Semantic 3D Mapping from Deep Image Segmentation
Abstract
:1. Introduction
 Two novel algorithms to calculate the space occupied by elements detected from deep image segmentation.
 Their application to the semantic mapping problem, which is used to validate its performance.
2. Related Work
2.1. Semantic Mapping
2.2. Object Detection and Segmentation
3. Segmentation and Mapping in 3D
3.1. 3D Segmentation
 The color of each object class is determined by 2D segmentation software, and it is not repeated.
 The octomap allows assigning a color to each voxel of the octomap, so that the detections are uniquely marked.
 It allows to visualize the octomap in a simple and intuitive way.
 [line 2]: The function $deep\_2d\_segment$ returns a set of detections in the image. Each detection is a 2D bounding box containing a mask. Each pixel is the mask corresponding to the detected object, marked with the object class color as shown in Figure 2.
 [line 3]: The result of the algorithm will be stored in $detection\_octomap$, a colored octomap with each voxel containing a color and a probability.
 [lines 4–7]: This loop iterates over each one of the detections in the image:
 (a)
 [line 5]: This algorithm uses a KDtree [30] that uses a fast neighbor search algorithm [31]. This data structure is initialized with a point cloud, and allows querying which points are in the vicinity of a reference point. By having the point cloud registered with the image (the ith pixel of the image corresponds to the ith point of the point cloud) and the pixels that belong to a detection, we can create point clouds that contain only the 3D points of the detection made in the image.
 (b)
 [line 6]: Processing begins with a 3D point that belongs to the detected object. This 3D point is the result of obtaining the position of the point cloud that corresponds to one of the detection pixels in the image. The detection center or centroid is valid for this purpose.
 (c)
 [line 7]: Start the recursive processing to ensure octomap connectivity.
 [lines 2,3]: Neighbors of the input point are calculated using a search in a radius $kr$. $kr$ is the voxel size in the output octomap.
 [lines 4,5]: If a neighbor is found, it is included into the output octomap. Note that the output octomap will contain points that were not in the original detection point cloud, but rather represent a continuous space occupied by this point cloud.
 [lines 6–9]: This function will be called recursively with the expansion points that are around the input point, to expand the output octomap in all axes.
 [lines 10,11]: If the expansion point is already in the octomap output recursion to this point is stopped. If not, the recursive function is called.
 In Figure 4b, the starting point is established and the neighbor search is performed.
 In Figure 4c, when finding neighbors (orange points from the original point cloud), their position is added to the output octomap (green box), new search points (green points) are created and the process is repeated in each of them.
 Any successful searching produces a new cell in the output octomap, and generate new search points (Figure 4d).
 The process is repeated in Figure 4e–g, until the detection limits are reached, or until there are no new search points that can be generated (green points with no candidates within the search radius).
 The final result (Figure 4h) is an octomap that includes the points that really belong to the object to be detected. Outliers are not included in this octomap.
Algorithm 1 3D Segmentation 

3.2. Semantic Mapping
 The connection of the frame map with the root of the robot tree, the frame odom: $R{T}^{map\to odom}$.
 A covariance matrix ${E}_{6\times 6}$ that indicates the precision of the robot’s location $(x,y,z,roll,$$pitch,yaw)$ in space.
 There is an uncertainty in $R{T}^{map\to sensor}$ since the particle filter estimates the robot’s position with uncertainty represented by the covariance matrix $\mathbf{E}$ of the location.
 There may be erroneous perceptions sporadically due to some error in the detection of the objects in the image, or in some synchronization between the point cloud and the image.
 Obtain the current uncertainty E of the location.$$\mathbf{E}=\left(\begin{array}{cccc}{\sigma}_{\mathit{xx}}^{\mathbf{2}}& {\sigma}_{xy}^{2}& \dots & {\sigma}_{xw}^{2}\\ {\sigma}_{yx}^{2}& {\sigma}_{\mathit{yy}}^{\mathbf{2}}& \dots & {\sigma}_{yw}^{2}\\ \vdots & \vdots & \ddots & \vdots \\ {\sigma}_{wx}^{2}& {\sigma}_{wy}^{2}& \dots & {\sigma}_{\mathit{ww}}^{\mathbf{2}}\end{array}\right)$$
 Generate a random noise transform $R{T}^{\prime}$ from the uncertainty using normal distributions of probability$$\begin{array}{c}Nois{e}_{Translation}=random(\mathcal{N}(0,{\sigma}_{xx}^{2}),\mathcal{N}(0,{\sigma}_{yy}^{2}),\mathcal{N}(0,{\sigma}_{zz}^{2}))\\ Nois{e}_{Rotation}=random(\mathcal{N}(0,{\sigma}_{rr}^{2}),\mathcal{N}(0,{\sigma}_{pp}^{2}),\mathcal{N}(0,{\sigma}_{ww}^{2}))\\ R{T}^{\prime}=(Nois{e}_{Translation},Nois{e}_{Rotation})\end{array}$$
 Calculate a new noisy transform $R{T}^{\prime map\to sensor}$$$\begin{array}{c}R{T}^{\prime map\to sensor}=\\ R{T}^{map\to odom}\ast {\mathbf{RT}}^{\prime}\ast R{T}^{odom\to base\_footprint}\ast R{T}^{base\_footprint\to base\_link}\ast R{T}^{base\_link\to sensor}\end{array}$$
 We repeat N times the steps 2–3 for each ${\mathcal{O}}_{t}$. We update ${\mathcal{M}}_{t}$ from ${\mathcal{M}}_{t1}$ accumulating the probabilities of each cell as follows, being K a constant in range [0–1]. If K is near 1, the previous value in the cell has more importance than the new perception when fusing both values:$$\begin{array}{c}{\mathcal{M}}_{t}=\sum _{i=1}^{N}(\frac{K}{N}\ast {\mathcal{M}}_{t1}+\frac{1K}{N}\ast R{T}_{i}^{\prime map\to sensor}\ast {\mathcal{O}}_{t})\end{array}$$
4. Results
4.1. Experiments of the Algorithm for 3D Segmentation
4.1.1. Methodology
4.1.2. Results
4.2. Application: Mapping a Domestic Environment
5. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
 Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Realtime Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019. [Google Scholar]
 Lambert, J.; Liu, Z.; Sener, O.; Hays, J.; Koltun, V. MSeg: A composite dataset for multidomain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2879–2888. [Google Scholar]
 Meagher, D. Geometric modeling using octree encoding. Comput. Graph. Image Process. 1982, 19, 129–147. [Google Scholar] [CrossRef]
 Hornung, A.; Wurm, K.M.; Bennewitz, M.; Stachniss, C.; Burgard, W. OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Auton. Robots 2013, 34, 189–206. [Google Scholar] [CrossRef] [Green Version]
 Lang, D.; Paulus, D. Semantic maps for robotics. In Proceedings of the Workshop “Workshop on AI Robotics” at ICRA, Chicago, IL, USA, 14–18 September 2014; pp. 14–18. [Google Scholar]
 Grinvald, M.; Furrer, F.; Novkovic, T.; Chung, J.J.; Cadena, C.; Siegwart, R.; Nieto, J. Volumetric InstanceAware Semantic Mapping and 3D Object Discovery. IEEE Robot. Autom. Lett. 2019, 4, 3037–3044. [Google Scholar] [CrossRef] [Green Version]
 Bowman, S.L.; Atanasov, N.; Daniilidis, K.; Pappas, G.J. Probabilistic data association for semantic slam. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1722–1729. [Google Scholar]
 McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4628–4635. [Google Scholar] [CrossRef] [Green Version]
 Nakajima, Y.; Tateno, K.; Tombari, F.; Saito, H. Fast and Accurate Semantic Mapping through Geometricbased Incremental Segmentation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 385–392. [Google Scholar] [CrossRef] [Green Version]
 Nakajima, Y.; Saito, H. Efficient ObjectOriented Semantic Mapping with Object Detector. IEEE Access 2019, 7, 3206–3213. [Google Scholar] [CrossRef]
 Prisacariu, V.A.; Kähler, O.; Golodetz, S.; Sapienza, M.; Cavallari, T.; Torr, P.H.S.; Murray, D.W. InfiniTAM v3: A Framework for LargeScale 3D Reconstruction with Loop Closure. arXiv 2017, arXiv:cs.CV/1708.00783. [Google Scholar]
 Furrer, F.; Novkovic, T.; Fehr, M.; Gawel, A.; Grinvald, M.; Sattler, T.; Siegwart, R.; Nieto, J. Incremental Object Database: Building 3D Models from Multiple Partial Observations. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 6835–6842. [Google Scholar] [CrossRef] [Green Version]
 Yu, H.; Moon, J.; Lee, B. A variational observation model of 3d object for probabilistic semantic slam. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5866–5872. [Google Scholar]
 Berrio, J.S.; Zhou, W.; Ward, J.; Worrall, S.; Nebot, E. Octree map based on sparse point cloud and heuristic probability distribution for labeled images. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3174–3181. [Google Scholar]
 Yang, S.; Huang, Y.; Scherer, S. Semantic 3D occupancy mapping through efficient high order CRFs. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 590–597. [Google Scholar]
 Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
 Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
 He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
 Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating depth into semantic segmentation via fusionbased CNN architecture. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016. [Google Scholar]
 Kendall, A.; Badrinarayanan, V.; Cipolla, R. Bayesian SegNet: Model Uncertainty in Deep Convolutional EncoderDecoder Architectures for Scene Understanding. arXiv 2015, arXiv:1511.02680. [Google Scholar]
 Ophoff, T.; Van Beeck, K.; Goedemé, T. Exploring RGBDepth Fusion for RealTime Object Detection. Sensors 2019, 19, 866. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 Linder, T.; Pfeiffer, K.Y.; Vaskevicius, N.; Schirmer, R.; Arras, K.O. Accurate detection and 3D localization of humans using a novel YOLObased RGBD fusion approach and synthetic training data. In Proceedings of the IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 1000–1006. [Google Scholar] [CrossRef]
 Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. MultiTask MultiSensor Fusion for 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
 Wang, W.; Neumann, U. DepthAware CNN for RGBD Segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 144–161. [Google Scholar]
 Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic Scene Completion from a Single Depth Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 190–198. [Google Scholar] [CrossRef] [Green Version]
 Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef] [Green Version]
 Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An opensource Robot Operating System. In Proceedings of the ICRA Workshop on Open Source Software, Kobe, Japan, 12–17 May 2009; Volume 3, p. 5. [Google Scholar]
 Grinvald, M. ROS Wrapper for DeepLab. 2018. Available online: https://github.com/ethzasl/deeplab_ros (accessed on 21 January 2021).
 Milioto, A.; Stachniss, C. Bonnet: An OpenSource Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7094–7100. [Google Scholar] [CrossRef] [Green Version]
 Bentley, J.L. Kd Trees for Semidynamic Point Sets. In Proceedings of the Sixth Annual Symposium on Computational Geometry (SCG ’90), Berkley, CA, USA, 6–8 June 1990; Association for Computing Machinery: New York, NY, USA, 1990; pp. 187–197. [Google Scholar] [CrossRef]
 Muja, M.; Lowe, D. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. In Proceedings of the VISAPP, Lisboa, Portugal, 5–8 February 2009. [Google Scholar]
Stat.  Erosion  Proposed Method 

Maximum  0.88 s  0.42 s 
Average  0.37 s  0.27 s 
Minimum  0.15 s  0.04 s 
Class  Color 

Cup  FF0000 
Bottle  00FF00 
Chair  0000FF 
Oven  FF00FF 
Tv  00FFFF 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Martín, F.; González, F.; Guerrero, J.M.; Fernández, M.; Ginés, J. Semantic 3D Mapping from Deep Image Segmentation. Appl. Sci. 2021, 11, 1953. https://doi.org/10.3390/app11041953
Martín F, González F, Guerrero JM, Fernández M, Ginés J. Semantic 3D Mapping from Deep Image Segmentation. Applied Sciences. 2021; 11(4):1953. https://doi.org/10.3390/app11041953
Chicago/Turabian StyleMartín, Francisco, Fernando González, José Miguel Guerrero, Manuel Fernández, and Jonatan Ginés. 2021. "Semantic 3D Mapping from Deep Image Segmentation" Applied Sciences 11, no. 4: 1953. https://doi.org/10.3390/app11041953