Next Article in Journal
An Engineering Design Approach for the Development of an Autonomous Sailboat to Cross the Atlantic Ocean
Previous Article in Journal
Materials Inspiring Methodology: Reflecting on the Potential of Transdisciplinary Approaches to the Study of Archaeological Glass
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Birds Eye View Look-Up Table Estimation with Semantic Segmentation

1
Department of Smart Car Engineering, Chungbuk National University, Seowon-gu, Chungdae-ro 1, Cheongju-si 28644, Korea
2
School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Ave, Singapore 639798, Singapore
3
Department of Intelligent Systems and Robotics, Chungbuk National University, Seowon-gu, Chungdae-ro 1, Cheongju-si 28644, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(17), 8047; https://doi.org/10.3390/app11178047
Submission received: 30 July 2021 / Revised: 21 August 2021 / Accepted: 27 August 2021 / Published: 30 August 2021
(This article belongs to the Topic Machine and Deep Learning)

Abstract

:
In this work, a study was carried out to estimate a look-up table (LUT) that converts a camera image plane to a birds eye view (BEV) plane using a single camera. The traditional camera pose estimation fields require high costs in researching and manufacturing autonomous vehicles for the future and may require pre-configured infra. This paper proposes an autonomous vehicle driving camera calibration system that is low cost and utilizes low infra. A network that outputs an image in the form of an LUT that converts the image into a BEV by estimating the camera pose under urban road driving conditions using a single camera was studied. We propose a network that predicts human-like poses from a single image. We collected synthetic data using a simulator, made BEV and LUT as ground truth, and utilized the proposed network and ground truth to train pose estimation function. In the progress, it predicts the pose by deciphering the semantic segmentation feature and increases its performance by attaching a layer that handles the overall direction of the network. The network outputs camera angle (roll/pitch/yaw) on the 3D coordinate system so that the user can monitor learning. Since the network’s output is a LUT, there is no need for additional calculation, and real-time performance is improved.

1. Introduction

Autonomous driving is currently receiving much attention, with many research institutes and companies conducting related research. The autonomous vehicle field can be divided into three major categories: recognition, judgment, and control. In particular, the cognitive area is rapidly developing, along with the development of machine learning. In addition, estimating 3D data based on 2D cognitive data using a camera is an essential element for autonomous driving. Estimating the pose of a camera attached to an autonomous vehicle is a method for outputting 3D data. In this paper, we propose a network that predicts an LUT [1] that transforms a camera image plane to a BEV [2] plane, and we aim to estimate the pose of the camera through this.
A camera is a sensor based on 2D data projected on a lens, so it is almost impossible to estimate perfect 3D data. Using a single camera to obtain the distance to an object—that is, the depth—is possible only with non-occlusion data, and the texture of the object with the Z-value cannot be accurately obtained. Therefore, we considered that a relatively strong feature point that can estimate a pose in the image plane is the free space and set, and we studied pose estimation based on the feature point in the area specified as free space as an initial goal. A semantic segmentation network [3,4,5,6] was selected as a backbone network to construct a network that outputs the LUT by utilizing the feature points of the free space domain.
A BEV was used to show depth using camera data intuitively. This BEV is similar to the around view monitoring (AVM) [7] used for parking assistance [8]. However, since an object in a position higher than the ground is projected onto the floor and expressed, any obstacle on the road surface is expressed as drooping compared to the original shape. Moreover, since camera-based depth lacks a reference distance, the relative distance is valid, but it cannot calculate the absolute distance. In general, the ToF [9] sensor, the size of the surrounding artificial landmark, and motion data of a driving vehicle are used to estimate actual distance. However, the purpose of this paper was to obtain the camera pose of a driving vehicle by using only a camera, which is a low-cost sensor, so it does not matter that the only output is the relative distance when using a BEV.
Traditionally, to measure the pose of a camera, sensor calibration with a ToF sensor is performed, or the camera calibration process is performed in a tolerance calibration room [10], where a marker with a specified actual distance is located. However, the development of autonomous driving is progressing with the target of mass production, and it is necessary to set the cost of the sensor not too high so as to mass produce it. Therefore, we propose a system based on a single camera requiring other sensors and infra.
The rest of this paper is organized as follows: Section 2 introduces related works; Section 3 discusses the DB used in the experiment; Section 4 describes the structure and details of the entire network utilized in machine learning; Section 5 presents experiments with different methods and their results; finally, Section 6 concludes the thesis and presents future work.

2. Related Work

Le et al. [11] claimed that dynamic object detection and pose estimation are tightly coupled tasks. When a network is constructed and trained to perform dynamic object detection and pose estimation, the results of dynamic object detection and pose estimation work complementary. By applying this point, we used semantic segmentation that expresses the contour of an object rather than detection based on a bounding box. The LUT was predicted using the segmentation result as a feature point, and we constructed the network to estimate the roll/pitch/yaw of the camera image and to induce an interaction between the segmentation and pose estimation.
Jaderberg et al. [12] proposed STN, which continues the process of warping the original image and proposes a layer that can obtain an image that includes good features in contrast to the original image. In this process, a fully connected layer was placed inside the STN to consider the interrelationship of all features. We deduced the entire network to be suitable for pose estimation, to create a layer that outputs pose by itself, and to afford this layer the ability to convert encoded features into poses by using the fully connected layer.
Ronneberger et al. [13] proposed a general encoder–decoder and described how to efficiently perform up-sampling after down-sampling. Many studies have utilized similar methods. Our paper also deals with the form of an image to image (original image to the LUT) as a result, and since encoded data were used in processing it, the overall configuration was composed of an encoder–decoder.

3. Synthetic Database

We needed to collect image semantic segmentation and camera pose ground truth to implement the proposed network. Representatively, MS-COCO [14] and KITTI [15] have such data, but there is a disadvantage that the variation in camera poses is not significant. Each of the camera rolls, pitches, and yaws range from 0 to 360 degrees, but the variety of the open dataset has a disadvantage in that it does not reach that level.
Our solution was to use a simulator that simulates real environments and places. We acquired data from various camera poses by using the MORAI Sim Standard [16] (Figure 1) as a simulator. When a camera is attached to an actual vehicle, experimental data of various angles cannot be acquired due to in-vehicle structures such as windshields. As learning proceeds, the results may not be generalized and may be overfitted. Since this paper proposes a network that predicts the LUT for producing a BEV using pose data, a simulator was used to derive data of the various poses that were constructed and utilized.

3.1. Data Collection

By utilizing the simulator’s characteristics to collect various ground truths, data related to the camera and camera pose were acquired. The process was configured so that a separate handcraft labeling operation was not required.
An RGB camera image for the input of the whole system, the segmentation ground truth image used for the backbone semantic segmentation, and the pose (x, y, z, roll, pitch, and yaw) expressing the camera attachment position were acquired.
Among the pose data, the roll, pitch, and yaw, which express the angles of each 3D axis, ranged from 0 to 360 degrees. Due to the vehicle’s windshield, it is difficult to express the rotation of the camera in real vehicle, so only a tiny change in pose compared to the entire range can be expressed. In this paper, various camera poses were constructed using the simulator, because overfitting occurs when learning with data with a small number of configurable data collection groups compared to the actual range.

3.2. LUT Generation

Due to the nature of the simulator that provides a camera image without lens system distortion, the distortion removal procedure is omitted, an ideal camera matrix is created, and the rotation and translation matrices are generated using the extracted camera pose roll/pitch/yaw and translation information. c x / c y means the principal length of the camera and f x / f y means the focal length, and based on these contents, the camera matrix K is estimated, which is a conversion matrix between the camera’s original plane and the normalized plane.
c x = w i d t h / 2
c y = h e i g h t / 2
p i = 3.141592 . . .
f x = c x / tan 0.5 f o v p i 180
f y = c y / tan 0.5 f o v p i 180
f = f x     f x   f y f y     ( f x < f y )
K = f 0 c x 0 f c y 0 0 1
In the homography production stage, through image calibration, the existing methods combine the 3 × 3 rotation matrix ( R ) and the 3 × 1 translation matrix ( T ) to create and utilize a 3 × 4 matrix R T , but if this method is used, the axis before rotation is applied. Due to the gimbal lock phenomenon that occurs when huge angles are rotated on an axis, the conversion may not be performed properly. Therefore, we multiplied 3 × 4 R | T n u l l first and multiplied 3 × 4 R i | T later to solve the problem. The problem was solved by considering the rotation of the coordinate axis first and then applying a translation matrix based on a new three-dimensional orthogonal axis (3 × 3 unit matrix) rather than the rotated axis (maybe with gimbal lock). The details are as follows.
R | T n u l l = r 11 r 12 r 13 0 r 21 r 22 r 23 0 r 31 r 32 r 33 0
R i | T = 1 0 0 t x 0 1 0 t y 0 0 1 t z
R T = R i | T R | T n u l l
Since the R T obtained in this way is a 3 × 4 matrix, which is a transformation between a 3D homogeneous coordinate system and a 2D homogeneous coordinate system, the z-axis data are removed, based on the camera coordinates, to be used in the BEV transformation (which is a 2D-to-2D transformation, z = 0). If column 3 is deleted in R T , an effect in which the z-axis data becomes 0 is derived, and a 3 × 3 matrix is obtained.
The homography H between the original image and the BEV is the product of R T 3 r d C o l R e m o v e d (which is the rotation matrix between the image and the BEV), the camera matrix K , and the matrix that makes the scale and makes left-top of the BEV to (0, 0).
H = b e v S c a l e 0 b e v W i d t h / 2 0 b e v S c a l e b e v H e i g h t 0 0 1 K R T 3 r d C o l R e m o v e d
Both the x and y coordinates of the original image corresponding to the BEV points can be obtained using homography H . The x and y coordinate values are divided by the width and height of the original image, multiplied by 65,525, converted into 16-bit data, and then stored as images named LUT_X and LUT_Y, respectively.

4. Proposed Deep Learning Network

The network (Figure 2) consists of an encoder, a decoder, an LUT generator, and a pose regressor. As loss functions, segmentation loss in the encoder, LUT loss in the LUT generator, and pose loss in the pose regressor were used.
The input of the whole network is a three-channel RGB image, and the output is the predicted pose and LUT of the BEV. The predicted pose is output through the encoder and pose regressor, and the LUT is output through the encoder, decoder, and LUT generator.
The pose regressor is attached in the form of an add-on, and by connecting it, the overall direction of the network is given and the performance is improved. The predicted pose helps the user to intuitively monitor the learning progress during the learning process.
The image output in the LUT reduces the post-processing time because it enables the BEV to be produced immediately after performing only the memory copy process and simple interpolation operation without additional operation.
In Table 1, Table 2, Table 3, Table 4 and Table 5, C denotes the customizable channel, and H and W denote the height and width of the input image (or input feature map), respectively.

4.1. Encoder

To create an LUT that converts to a BEV using a single camera, we used semantic segmentation-based features [5] for the semantic segmentation backbone network. After fitting the original image and the segmentation result to the same scale, down-sampling is performed through the compress block. The output from each compress block comprises encoded data for multiple scales, and through this, a network robust to multiple scales is constructed.
The semantic segmentation layer aims to find precise boundaries for objects in the image. In addition, since the proposed network aims to estimate the pose using image features, we inserted a semantic segmentation layer into the encoder (Figure 3, Table 1) to utilize the information on the precise boundary as a feature.
The compress block (Figure 4a, Table 2a,b) plays a role in integrating features using the custom residual block (Figure 4b, Table 2c,d), referring to He et al. [17], and the structure that fuses the original image and segmentation features.

4.2. Decoder

Due to the characteristics of a camera that projects light in a specific space, far-distance data are insufficient compared to near-distance data, which cause aliasing. We constructed a parallel path to efficiently perform anti-aliasing by utilizing the features delivered from the encoder.
A structure for restoring and up-scaling multi-scale encoded data was constructed by composing a parallel path through the transposed convolution [8] and pixel shuffle [18] to solve the aliasing phenomenon that may occur in the up-scaling process.
The decoder (Figure 5, Table 3) does not output a separate loss, but delivers data to the LUT generator.

4.3. LUT Generator

Through converging and compressing the results obtained through the decoder, three LUT channels were finally generated. The first/second channels represent x/y coordinates of the original image, respectively, and the third channel represents the boundary for the camera’s field of view (FoV) area during the LUT conversion process (the boundary value is the max value, while the rest is the min value). Figure 6 and Table 4 show the structure of the LUT generator.

4.4. Pose Regressor

In essence, the LUT is an interpretation of the geometrical information of the image (the geometrical information described in this paper is the camera pose, that is, the roll/pitch/yaw), so we must consider the camera pose regression from the network design stage. To effectively add pose regression information to the network, we aimed to make the network recognize the task of estimating the pose by attaching a pose regressor (Figure 7, Table 5) to the network.
To estimate the pose, we compressed the encoder data and utilized the fully connected layer [19] to understand the correspondence of all the information between the encoder features [12].
In the angle expression method using the degree or radian, the non-continuous parts such as 0 degrees/360 degrees and –π/π may hinder the learning performance, so cos is used and then translated, while the region of the value is scaled to change from –1~1 to 0~1, and the activation function of the final layer is used as a sigmoid to efficiently infer the value of 0~1.

5. Experiments

5.1. Loss Cost

Three feature points calculate loss cost. The seg loss is at the end of the segmentation backbone of the encoder, the LUT loss is located at in LUT generator output, and the pose loss is obtained from the pose regressor output.
The contour of the semantic segmentation feature must be precise to help improve the entire network’s performance, so we output the segmentation loss using the pixel-wise cross-entropy [20] of the segmentation.
The LUT loss is calculated through pixel-wise mean squared error (MSE), and the weights of the first/second and third channels, which have different basic properties, are experimentally learned differently.
For the result of the pose regressor, the pose loss is obtained by using the MSE. Due to the relatively small number of elements, a smaller value is output compared to the other losses.
Since the domain covered by each loss and the convergence speed in the learning process are different, the weight multiplied by each loss cost in calculating the total loss was experimentally obtained and is as follows.
S e g L o s s = C r o s s E n t r o p y S e g
L U T L o s s = 5 M S E L U T X + 5 M S E L U T Y + 3 M S E L U T F O V
P o s e L o s s = M S E P o s e
T o t a l L o s s = 1 S e g L o s s + 5 L U T L o s s + 100 P o s e L o s s

5.2. Quantitative Evaluation

In this paper, the seg loss is a learning measure for segmentation features inside the network, and the LUT and pose losses are quantitative indicators for the LUT/3D spatial angle, respectively (Table 6).
We gradually changed the layers to infer the change in performance from the structural evolution of our network (Table 7).
The resolution representation of the encoder, decoder, and LUT generator was tested with 1 (single scale)/4 (multi-scale), and the parallel path of the decoder was tested with two pixel shuffles or pixel shuffle and convolution transposed. Finally, we tried to improve the performance through the combination with a pose regressor.
All tests were inferenced in the NVIDIA GTX 1080 environment, and three losses were evaluated.
If the parallel path composed of two pixel shuffles is changed to a layer consisting of pixel shuffle and convolution transposed, the speed is 1.11 times slower, but the segmentation loss is reduced 0.89 times and the LUT loss is significantly reduced as well (0.06 times).
If the resolution representation considered by the encoder, decoder, and LUT generator is changed from single scale to multi-scale, the speed is 3.7 times slower, the seg-mentation loss is 0.38 times less, and the LUT loss is 0.44 times less, so we can see that the loss decreases. When changing to multi-scale, we can see that the segmentation loss is significantly reduced.
When learning that by adding a pose regressor to the end of the encoder, the speed is only reduced 1.09 times, but the segmentation loss is reduced 0.64 times and the LUT loss by 0.25 times, we can see that the loss cost is significantly reduced, while the processing speed is slightly increased. Through this, we can understand that it is meaningful to connect the pose regressor to the end of the encoder to obtain the direction of the entire network.

5.3. Qualitative Evaluation

Since the coordinates of the original image corresponding to each coordinate of the BEV can be estimated using the LUT data obtained from the end of the network, the BEV was generated through this value. It was tested using two map data of Chungbuk National University (CBNU) and KATRI K-city (Table 8 and Table 9).
As we progressed from v1 to v4, the aliasing decreased. Particularly, if we compare v3 (without a pose regressor) and v4 (with a pose regressor), we can see that the concept of the overall pose is added, and the distant region is converted relatively well.

6. Conclusions

In this work, we studied BEV conversion based on a single camera image. We used segmentation backbone-based features during the study, and the performance difference before and after attachment was analyzed by adding on a pose regressor. Since it is challenging to collect various camera poses using an actual camera, we tested the network through a simulator.
We plan to conduct research using actual camera data (or actual + synthetic data) in this network in the future and try to reduce aliasing by improving the network. In addition, to supplement the characteristics of a single camera, which makes it difficult to estimate scale, it is intended to produce a real distance-based BEV with an explicit unit rather than a relatively real distance through combination with a ToF sensor or other odometry methods, as with adjacent frames of a single camera.

Author Contributions

Conceptualization, D.L., W.P.T.; methodology, D.L., W.P.T.; validation, D.L.; writing—original draft preparation, D.L.; writing—review and editing, S.-C.K.; supervision, S.-C.K.; funding acquisition, S.-C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MOTIE (Ministry of Trade, Industry, and Energy) in Korea, under the Fostering Global Talents for Innovative Growth Program (P0008751) supervised by the Korea Institute for Advancement of Technology (KIAT). This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Grand Information Technology Research Center support program (IITP-2021-2020-0-01462) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Choi, D.Y.; Choi, J.H.; Choi, J.W.; Song, B.C. CNN-based Pre-Processing and Multi-Frame-Based View Transformation for Fisheye Camera-Based AVM System. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
  2. Zhu, X.; Yin, Z.; Shi, J.; Li, H.; Lin, D. Generative Adversarial Frontal View to Bird View Synthesis. In Proceedings of the 2018 International conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018. [Google Scholar]
  3. Chao, P.; Kao, C.Y.; Ruan, Y.S.; Huang, C.H.; Lin, Y.L. HarDNet: A low memory traffic network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  4. Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
  5. Tao, A.; Sapra, K.; Catanzaro, B. Hierarchical Multi-scale Attention for Semantic Segmentation. arXiv 2020, arXiv:2005.10821. [Google Scholar]
  6. Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  7. Hsu, C.M.; Chen, J.Y. Around View Monitoring-Based Vacant Parking Space Detection and Analysis. Appl. Sci. 2019, 9, 3403. [Google Scholar] [CrossRef] [Green Version]
  8. Lee, D.; Lee, J.S.; Lee, S.; Kee, S.C. The Real-time Implementation for the Parking Line Departure Warning System. In Proceedings of the 3rd IEEE International Conference on Intelligent Transportation Engineering (ICITE), Singapore, 3–5 September 2018. [Google Scholar]
  9. Dhall, A.; Chelani, K.; Radhakrishnan, V.; Krishna, K.M. LiDAR-Camera Calibration using 3D-3D Point correspondences. arXiv 2017, arXiv:1705.09785. [Google Scholar]
  10. Lee, D.; Kee, S.C. Real-time Implementation of the Parking Line Departure Warning System Using Partitioned Vehicle Region Images. Trans. KSAE 2019, 7, 553–560. [Google Scholar]
  11. Le, H.; Liu, F.; Zhang, S.; Agarwala, A. Deep Homography Estimation for Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  12. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial Transformer Networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
  13. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015. [Google Scholar]
  14. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  15. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
  16. MORAI Sim Standard. Available online: http://www.morai.ai (accessed on 30 August 2021).
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  18. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  19. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
  20. Rubenstein, R.Y.; Kroese, D.P.; Cohen, I.; Porotsky, S.; Taimre, T. Cross-Entropy Method; Springer: Boston, MA, USA, 2013. [Google Scholar]
Figure 1. MORAI Sim Standard: (a) Chungbuk National University map; (b) KATRI K-city map.
Figure 1. MORAI Sim Standard: (a) Chungbuk National University map; (b) KATRI K-city map.
Applsci 11 08047 g001
Figure 2. Entire network structure.
Figure 2. Entire network structure.
Applsci 11 08047 g002
Figure 3. Encoder structure.
Figure 3. Encoder structure.
Applsci 11 08047 g003
Figure 4. (a) Compress block structure; (b) custom residual block structure.
Figure 4. (a) Compress block structure; (b) custom residual block structure.
Applsci 11 08047 g004
Figure 5. Decoder structure.
Figure 5. Decoder structure.
Applsci 11 08047 g005
Figure 6. LUT generator structure.
Figure 6. LUT generator structure.
Applsci 11 08047 g006
Figure 7. Pose regressor structure.
Figure 7. Pose regressor structure.
Applsci 11 08047 g007
Table 1. (a). Encoder layer detail; (b). Encoder feature map detail.
Table 1. (a). Encoder layer detail; (b). Encoder feature map detail.
(a)
NameIn
Channels
Out
Channels
LayerRemarks
EL003CSemantic segmentationEncoder layer
EL01C162D convolution (kernel: 3; stride: 1; padding: 1)
EL02316
EL031664Compress block
EL0464256
EL052561024
EL0610244096
(b)
NameChannelHeightWidthRemarks
EF003HWEncoder feature
EF01CHW
EF0216HW
EF03
EF0464H/2W/2
EF05
EF06256H/4W/4
EF07
EF081024H/8W/8
EF09
MF004096H/16W/16Middle feature
MF01
MF0264H/2W/2
MF03256H/4W/4
MF041024H/8W/8
MF054096H/16W/16
Table 2. (a). Compress Block Layer Detail; (b). Compress block feature map detail. (c). Custom residual block layer detail. (d). Custom residual block feature map detail.
Table 2. (a). Compress Block Layer Detail; (b). Compress block feature map detail. (c). Custom residual block layer detail. (d). Custom residual block feature map detail.
(a)
NameIn
Channels
Out
Channels
LayerRemarks
CL00C4CCustom residual blockCompress layer
CL01
CL024C8CConcatenate
CL038C4C2D convolution (kernel: 3; stride: 1; padding: 1)
(b)
NameChannelHeightWidthRemarks
CF00CHWCompress feature
CF01
CF024CH/2W/2
CF03
CF048CH/2W/2
CF054CH/2W/2
CF06
CF07
(c)
NameIn
Channels
Out
Channels
LayerRemarks
RL00C2C2D convolution (kernel: 3; stride: 2; padding: 1)Residual layer
RL012C4C2D convolution (kernel: 3; stride: 1; padding: 1)
RL02C4C2D convolution (kernel: 3; stride: 2; padding: 1)
RL034C4CElement Add
(d)
NameChannelHeightWidthRemarks
RF00CHWResidual feature
RF012CH/2W/2
RF024CH/2W/2
RF034CH/2W/2
RF044CH/2W/2
Table 3. (a). Decoder layer detail. (b). Decoder feature map detail.
Table 3. (a). Decoder layer detail. (b). Decoder feature map detail.
(a)
NameIn
Channels
Out
Channels
LayerRemarks
DL006416Pixel shuffleDecoder layer
DL016416Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL0225664Pixel shuffle
DL0325664Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL046416Pixel shuffle
DL056416Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL061024256Pixel shuffle
DL071024256Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL0825664Pixel shuffle
DL0925664Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL106416Pixel shuffle
DL116416Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL1240961024Pixel shuffle
DL1340961024Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL141024256Pixel shuffle
DL151024256Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL1625664Pixel shuffle
DL1725664Convolution transposed (kernel: 2; stride: 2; padding: 0)
DL186416Pixel shuffle
DL196416Convolution transposed (kernel: 2; stride: 2; padding: 0)
(b)
NameChannelHeightWidthRemarks
DF0064H/2W/2Decoder feature
DF01
DF02256H/4W/4
DF03
DF0464H/2W/2
DF05
DF061024H/8W/8
DF07
DF08256H/4W/4
DF09
DF1064H/2W/2
DF11
MF0264H/2W/2Middle feature
MF03256H/4W/4
MF041024H/8W/8
MF054096H/16W/16
MF0616HW
MF07
MF08
MF09
MF10
MF11
MF12
MF13
Table 4. (a). LUT generator layer detail. (b). LUT generator feature map detail.
Table 4. (a). LUT generator layer detail. (b). LUT generator feature map detail.
(a)
NameIn
Channels
Out
Channels
LayerRemarks
LL001632ConcatenateLUT layer
LL01
LL02
LL03
LL0432162D convolution (kernel: 3; stride: 1; padding: 1)
LL05
LL06
LL07
LL081664Concatenate
LL0964162D convolution (kernel: 3; stride: 1; padding: 1)
LL101632D convolution (kernel: 3; stride: 1; padding: 1)
(b)
NameChannelHeightWidthRemarks
LF0032HWLUT feature
LF01
LF02
LF03
LF0416HW
LF05
LF06
LF07
LF0864HW
LF0916HW
LF103HW
MF0616HWMiddle feature
MF07
MF08
MF09
MF10
MF11
MF12
MF13
Table 5. (a). Pose regressor layer detail. (b). Pose regressor feature map detail.
Table 5. (a). Pose regressor layer detail. (b). Pose regressor feature map detail.
(a)
NameIn
Channels
Out
Channels
LayerRemarks
PL0040968192ConcatenatePose layer
PL01819210242D convolution (Kernel: 3; stride: 1; padding: 1)
PL0210244HWFlatten (3D to 1D)
PL034HW3Fully connected layer
(b)
NameChannelHeightWidthRemarks
PF008192H/16W/16Pose Feature
PF011024H/16W/16
PF024HW11
PF03311
MF004096H/16W/16
MF01
Table 6. Differences between the versions of the network (v1~v4).
Table 6. Differences between the versions of the network (v1~v4).
NoScaleParallel PathPose RegressorProcessing Time (s)Seg LossLUT LossPose Loss
v1Single2 pixel shufflesX0.180.651.59-
v2SinglePixel shuffle and convolution transposedX0.200.580.09-
v3MultiPixel shuffle and convolution transposedX0.740.220.04-
v4MultiPixel shuffle and convolution transposedO0.810.140.010.14
Table 7. Metrics of gradual change for the network.
Table 7. Metrics of gradual change for the network.
FromToProcessing Time
Change (To/From)
Seg Loss
Change (To/From)
LUT Loss
Change (To/From)
v1v21.110.890.06
v2v33.700.380.44
v3v41.090.640.25
Table 8. Chungbuk National University map-based BEV generation evaluation.
Table 8. Chungbuk National University map-based BEV generation evaluation.
Origin Applsci 11 08047 i001 Applsci 11 08047 i002 Applsci 11 08047 i003
GT Applsci 11 08047 i004 Applsci 11 08047 i005 Applsci 11 08047 i006
v1 Applsci 11 08047 i007 Applsci 11 08047 i008 Applsci 11 08047 i009
v2 Applsci 11 08047 i010 Applsci 11 08047 i011 Applsci 11 08047 i012
v3 Applsci 11 08047 i013 Applsci 11 08047 i014 Applsci 11 08047 i015
v4 Applsci 11 08047 i016 Applsci 11 08047 i017 Applsci 11 08047 i018
Table 9. KATRI K-city map-based BEV generation evaluation.
Table 9. KATRI K-city map-based BEV generation evaluation.
Origin Applsci 11 08047 i019 Applsci 11 08047 i020 Applsci 11 08047 i021
GT Applsci 11 08047 i022 Applsci 11 08047 i023 Applsci 11 08047 i024
v1 Applsci 11 08047 i025 Applsci 11 08047 i026 Applsci 11 08047 i027
v2 Applsci 11 08047 i028 Applsci 11 08047 i029 Applsci 11 08047 i030
v3 Applsci 11 08047 i031 Applsci 11 08047 i032 Applsci 11 08047 i033
v4 Applsci 11 08047 i034 Applsci 11 08047 i035 Applsci 11 08047 i036
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lee, D.; Tay, W.P.; Kee, S.-C. Birds Eye View Look-Up Table Estimation with Semantic Segmentation. Appl. Sci. 2021, 11, 8047. https://doi.org/10.3390/app11178047

AMA Style

Lee D, Tay WP, Kee S-C. Birds Eye View Look-Up Table Estimation with Semantic Segmentation. Applied Sciences. 2021; 11(17):8047. https://doi.org/10.3390/app11178047

Chicago/Turabian Style

Lee, Dongkyu, Wee Peng Tay, and Seok-Cheol Kee. 2021. "Birds Eye View Look-Up Table Estimation with Semantic Segmentation" Applied Sciences 11, no. 17: 8047. https://doi.org/10.3390/app11178047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop