An Efficient Vehicle Localization Method by Using Monocular Vision

Liang, Yonghui; He, Yuqing; Yang, Junkai; Jin, Weiqi; Liu, Mingqi

doi:10.3390/electronics10243092

Open AccessArticle

An Efficient Vehicle Localization Method by Using Monocular Vision

by

Yonghui Liang

,

Yuqing He

^*

,

Junkai Yang

,

Weiqi Jin

and

Mingqi Liu

MOE Key Laboratory of Optoelectronic Imaging Technology and System, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Electronics 2021, 10(24), 3092; https://doi.org/10.3390/electronics10243092

Submission received: 8 November 2021 / Revised: 9 December 2021 / Accepted: 10 December 2021 / Published: 12 December 2021

(This article belongs to the Section Electrical and Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate localization of surrounding vehicles helps drivers to perceive surrounding environment, which can be obtained by two parameters: depth and direction angle. This research aims to present a new efficient monocular vision based pipeline to get the vehicle’s location. We proposed a plug-and-play convolutional block combination with a basic target detection algorithm to improve the accuracy of vehicle’s bounding boxes. Then they were transformed to actual depth and angle through a conversion method which was deduced by monocular imaging geometry and camera parameters. Experimental results on KITTI dataset showed the high accuracy and efficiency of the proposed method. The mAP increased by about 2% with an additional inference time of less than 5 ms. The average depth error was about 4% for near distance objects and about 7% for far distance objects. The average angle error was about two degrees.

Keywords:

monocular vision; target detection; semantic segmentation; vehicle localization

1. Introduction

As of 2020, the number of global vehicles has exceeded 1.2 billion, while automobile accidents have become the eighth largest cause of injuries and death accidents worldwide. The National Highway Traffic Safety Administration (NHTSA) states that up to 90% of car accidents are related to human error of estimated safety distance and distractions [1]. The acquisition of accurate depth and direction angle, which are used to locate surrounding vehicles helps drivers to perceive surrounding environment to prevent dangerous situations. The methods based on global depth map reconstruction are widely used. Among them, the monocular vision methods are more economical, flexible, and easy to use. With the development of deep learning, monocular vision based on convolutional neural networks has become a research hotspot. To make the localization of vehicles more rapid, accurate and targeted, we focus on the more direct methods based on target detection algorithm, which can provide the depth, angle and category of target at the same time. The accuracy of bounding boxes and the correctness of location mapping relationship directly affect the location of target, which are studied separately.

In this paper, a new monocular pipeline was proposed to provide a rapid and accurate estimation of a vehicle’s location with a single frame. We presented a plug-and-play convolutional block inspired by semantic segmentation [2,3,4] to improve the accuracy of vehicle detection bounding boxes which was named Segmentation Block (SegBlock). Any target detection algorithm with feature extraction layers could be used in combination with it to construct an end-to-end network, which could greatly improve the accuracy of target boxes with an extremely low computational cost. Then we used imaging geometry and camera parameters to deduce a conversion relationship of the vehicle’s location in image to actual depth and angle, is named Pixel to Real-Space (PRS) relationship, which provided mappings from planar regions to space.

In order to evaluate the performance of SegBlock, we combined it with SSD (ResNet50 [5] feature extraction backbone) [6] and YOLOv3spp [7,8] to construct SegSSD and SegYOLO, respectively, then trained and tested them in KITTI dataset. We conducted depth and direction angle estimation experiments on KITTI with the four models and PRS relationship to verify the accuracy and validity of the proposed pipeline. The main contributions of the paper are the following:

The combination of target detection and semantic segmentation to improve the precision of bounding boxes was proved.
An exact relationship of bounding boxes and real location was deduced by imaging geometry and camera parameters.
Utilization of deep learning and monocular vision technology for vehicles localization was studied.

The rest of the paper is organized as follows. As mentioned in the introduction, a literature survey of vehicle localization is conducted in Section 2. The proposed method is described in detail in Section 3. Experimental results and discussion are demonstrated in Section 4. Finally, Section 5 presents the conclusions.

2. Related Work

In this section, we present a literature survey based on two categories of vehicle localization methods: (1) Laser Simultaneous Localization and Mapping (Laser SLAM) and (2) visual localization.

2.1. Laser SLAM

The laser SLAM methods rely on the light detection and ranging (LiDAR) unit to collect surrounding point cloud information. As the name suggests, moving platform can build a map of environment while localize itself and other targets.

The alignment of LiDAR scans is the most critical aspect towards achieving this goal. The classic methods are Iterative Closest Point (ICP) [9] and its related variants, which have the main idea of aligning point clouds iteratively until the stopping criterion is satisfied. They are effective but often suffer from high computational cost. Although some methods like the Generalized ICP [10] and the parallel methods like VGICP [11] accelerate the correlation speed, the efficiency remains a disadvantage.

The methods based on feature matching solved this problem to a certain extent. They can extract 3D features from the point clouds, such as planes or edges, and then match them. Zhang et al. proposed a feature based on LiDAR odometry and mapping (LOAM) method to find the 3D features relationships between point clouds, and achieved near real-time results. However, the performance deteriorates when resources are limited [12]. Then the method was improved by the same authors and the LeGO-LOAM was proposed, which was lightweight and more rapid [13]. With increased trajectories, feature based methods could suffer from large estimation errors. To overcome the problem, the systems based on graph SLAM have been proposed, such as the pose-graph SLAM [14].

Contemporary autonomous vehicles typically employ an expensive LiDAR as the main sensor to gather information on depth and angle to various objects on the road. However, they either sacrifice the quality of the trajectory to obtain real-time performance, or achieve high accuracy at the cost of computational time [15], and the cost of the system is also considerable [16].

2.2. Visual Localization

Visual localization methods based on computer vision has two main classes of binocular vision and monocular vision. They use one or two cameras to acquire environmental information, and then the depth and direction angle of surrounding objects are calculated.

Binocular methods measured the spatial depth of the target area by stereo vision technology. Traditionally it consists of four steps: matching cost computation, aggregation of cost, disparity computation, and optimization [17]. Recently, parallax computation becomes a problem that can be solved with a learning approach. Kendall et al. used an end-to-end network to directly generate the final parallax map. More contextual information was obtained using 3D convolution and a regression approach was used to predict the parallax values [18]. This idea has been adopted in many subsequent approaches. Based on this, later studies added pyramid structure, fused sematic information, and designed faster training method and obtained finer parallax maps [19,20,21]. However, binocular ranging requires precise matching [22]. The matching process is time consuming, and it is not negligible for the real-time impact of the visual system. At the same time, there are many strict constraints to ensure the positioning accuracy such as camera calibration, imaging quality, distance between the left and right camera optical axes, and camera focal length [23].

In contrast, the monocular methods used only one camera and less constraint requirements than binocular vision. It has the advantages of higher usability, simple operation, and low cost. An intuitive idea was to reconstruct the global depth map from consecutive frames of images [24]. However, the methods using only one frame were simpler and more efficient. Early learning approaches for monocular depth estimation were based on the hidden Markov model (HMM) [25], Markov random field [26], and defocus cue [27]. The deep learning provided a different kind of solution. David et al. used two depth networks, one to estimate the global depth structure and the other to refine it locally, using scale-invariant errors to measure depth relationships, but the spatial feature extraction capability was insufficient [28,29]. Inspired by this, Xu et al. proposed that network can fuse complementary information derived from multiple CNN side outputs to model the depth map better [30]. Wang et al. first combined semantic segmentation and depth estimation to capture the spatial relationship, then further enforced in the joint inference of Hierarchical Conditional Random Field [31]. Laina et al. used ResNet [5] as the base network and added a pyramid structure to fuse multi-layer semantic features to construct a deeper depth prediction network [32]. Meng et al. improved the mask R-CNN to speed up the detection, and used the detected boxes to fit the depth of target [33]. Parmar et al. presented an enhancement of the classical CNN by adding a module for handling distance determination, then used for estimating the range of target based on object features [34].

Tracking the features in 2D images or 3D scene points can measure and record the trace of camera. This technique is called Visual Odometry (VO) [35]. The early methods usually adopted the depth maps of scene by binocular stereo camera, followed by the registration of 3D points for pose and motion estimation [36,37], but the matching process was computationally burdensome. Then the method based on corner features detection was proposed to decrease the computation time [38]. In addition to point matching, the methods using line features proved to be more robust [39]. The VO techniques based on monocular camera such as [40,41] are more attractive because of fewer system setups and constraints. Within these, many RGB-D image based VO techniques achieved promising results. Lin et al. proposed a novel sparse tracking method using keypoint management and pose adjustment, which adopts the sparse edge points for the visual odometry model and improves the accuracy of the sparse direct technique significantly [42].

The visual localization methods are more economical and practical than Laser SLAM, and the monocular vision methods are more flexible, less constraints and easy to use among them. In this work, we primarily focus on the vehicle localization method based on monocular vision in combination with deep learning.

3. Method

3.1. Vehicle Detection Improvement Method

Accurate location of vehicles in the image is crucial for localization in the real world, which can be obtained from the target detection algorithm. However, almost all target detection algorithms cannot give precise bounding boxes. An important reason is the low accuracy of the labeled ground truth boxes, as shown in Figure 1. This limits the application of the target detection algorithm requiring accurate target location.

In terms of labeling, annotations of semantic segmentation are often more accurate than target detection because of some efficient tools, such as polygon tools, magnetic lasso, etc. With the development of artificial intelligence, many high-precision semi-automated tools for semantic information annotation have emerged. Based on this, we proposed a plug-and-play segmentation block inspired by semantic segmentation, which can help to improve the boundary accuracy of the boxes in target detection.

3.1.1. Structure

The target detection algorithm can be abstracted into three parts: backbone, neck, and head. Segblock and the combination diagram are shown in Figure 2. SegBlock can provide a pixel-level mask of the target used to adjust the bounding box given by the basic network. It has three decoding layers: each decoding doubles the feature map by upsampling, and concatenates it with the feature map of the external input. The backbone and the neck are natural encoder. Feature maps from backbone involve high amounts of low-level semantic information. Neck feature maps have overall characteristics of target, because they have been deeply convolved. In SegBlock we reuse feature maps from basic algorithm. The lower decoding layer receives the feature map from backbone and the higher layer receives the feature map from neck. With such a structure, it is possible to combine high-level semantic information with low-level superficial information during decoding to accurately identify targets and delineate boundaries.

In this paper, we use SegBlock in combination with SSD (ResNet 50 backbone) and YOLOv3-spp, named SegSSD and SegYOLO, respectively. More detail network parameters are shown in Table 1, where transform X denotes a channel and size transformation, including a convolutional layer with a kernel size of 1, a crop layer and a pooling layer. The latter two will only be used when needed.

All pixels from the image were dichotomized (target and background) by SegBlock, and a probability map was obtained. We then generated mask map by softmax function, while the target category was provided by basic algorithm. We conducted a series of experiments to give a trade-off between the performance of Segmentation Block and the extra computational cost. Ultimately, three convolutional layers structure was proven better.

3.1.2. Training

Because our model combines target detection with segmentation, the training target should include ground truth bounding boxes, classes, and a semantic segmentation label. We use the smooth L1 loss and the focal loss in target detection while using the dice loss and the cross entropy loss in segmentation. The well characteristics of loss functions help to reduce the impact of significant imbalance between target and background training examples. At the same time, although our model is end-to-end, when we cannot easily get bounding box labels and semantic segmentation labels of one image at the same time, we can use another semantic segmentation dataset containing same categories because they have identical edge features. Benefiting from the plug-and-play nature of SegBlock, we can use a step-by-step training strategy rather than re-labeling.

First, the segmentation block was disconnected, training the parameters of backbone and addition layers using the target detection dataset. Second, the trained parameters of backbone were loaded and fixed, then the segmentation block was reconnected while the connection behind the neck layers was cut, training the parameters of segmentation block using the segmentation dataset. Finally, the layers behind the neck were reconnected. We loaded the trained parameters of the segmentation block, and the finalized network was obtained.

3.1.3. Inference

There are three outputs when an image is input to our network: predicted boxes that are higher than the confidence threshold, class of every box and a mask map. The mask map characterizes whether each pixel point belongs to target or not. The boxes detected from basic algorithm are considered coarse areas and each boundary is corrected by the mask map. When there is no target pixel contained by a boundary line, it is adjusted along the direction perpendicular to the position where the first target pixel appears. SegBlock has a good robustness, because optimization of boundary only occurs within the coarse area. Even if there is an error in the segmentation task, it will not have a serious impact on the results.

3.2. PRS Relationships

When taking an image, the display size of target is related to the actual depth, while the offset pixels between target and image central point is related to the direction angle. The spatial relations schematic diagram of a primary imaging is shown in Figure 3.

According to Figure 3, the spatial relation can be expressed as Equation (1).

\begin{array}{l} 2 d \times \tan α_{h} = H \\ 2 d \times \tan α_{w} = W \\ d \times \tan θ = Δ w \end{array}

(1)

The field angle can be calculated by Equation (2).

α_{i} = \tan^{- 1} (\frac{c_{i}}{2 f}) i = [h, w]

(2)

where

c_{i}

is the chart size of sensor and f is effective focal length. These parameters are fixed with the determined imaging system.

During imaging, the edges magnification of the lens is not the same as the center that leads to distortions, which is an inherent characteristic of optical lens. However, most of current industrial cameras have a small distortion in both axial and lateral direction that are usually less than 1%, and the similarity between object and image is still high enough to ignore the effect of distortions. The simulation we described in Figure 2 illustrated the reasons. Then the lens of camera can be considered as an ideal optic system. The object plane and imaging plane are conjugate and every pixel corresponds to their imaging region in the actual plane of the object location are the same. Equation (3) can be obtained.

\begin{array}{l} \frac{h}{H} = \frac{y}{Y} = β \\ \frac{2 Δ w}{W} = \frac{2 x}{X} = γ \end{array}

(3)

where x is the pixels number between the center of target bounding box and the center of image, y is the vertical pixel numbers of the bounding box, X and Y are the width and height of the whole image, respectively. The axial and lateral ratios are defined separately as

β

and

γ

. Then the following PRS Equation (4) can be obtained by Equations (2) and (3).

\begin{array}{l} d = \frac{h}{2 β \times \tan α_{h}} \\ θ = t a n^{- 1} (γ \times \tan α_{w}) \end{array}

(4)

When the lens and sensor models are determined,

α_{h}

and

α_{w}

are fixed.

h

is the actual height of target object, which can be obtained from a pre-constructed lookup table with the categories given by the target detection algorithm.

β

and

γ

can be produced by the bounding box of target detection algorithm, and the higher accuracy of the bounding box, the higher precision of the distance and angle.

By way of example, we conducted the simulation using imaging system parameters which were used to capture the KITTI dataset (1.4 megapixels, 1/2″ Sony ICX267 CCD, Edmund Optics lenses, 4 mm), as shown in Figure 2.

Figure 4a shows that the depth and error are gradually increasing with the decreasing of y, a larger h corresponds to a faster rate. Improving the accuracy of y can significantly reduce the ranging error. All positive and negative signs in Figure 4b indicate direction, and absolute values indicate magnitude. Larger pixel offset indicate the larger measured angle, and the error rate achieve largest in the center of image when the bounding box is not exact. Accurate pixels offset can greatly reduce the angle estimate error.

As the assumption of axial and lateral distortions are 1%, respectively, the depth error rate of a 70-pixel height and 200-pixel width bounding box at the top or bottom is less than 1.41% with respect to the depth at the center, and the error rate of the angle at the left or right border is close to zero, so the error due to longitudinal and tangential distortion is negligible.

4. Experiments

We used Pytorch 1.1.0 and a Quadro GV100 GPU with 32GB memory to train SSD, SegSSD, YOLOv3-spp, and SegYOLO, respectively, as contrast experiments on KITTI [43,44] dataset and BDD100K [45] segmentation dataset. Then we evaluated their target detection results and real-time performance on low computation capability GPU (GTX1650) that was used to simulate the edge device. We next evaluated depth and angle estimation result using SegYOLO and PRS pipeline to verify the proposed method in this paper.

All networks training used an SGD optimizer, batch size was set to 8, with an initial learning rate of 0.0005. The total epochs were set to 30, and we used the StepLR strategy to decrease learning rate to 0.7 times after every 5 epochs automatically.

4.1. Vehicle Detection

4.1.1. Dataset Implementation

The KITTI dataset is a commonly used dataset for evaluating computer vision algorithms in autonomous driving scenarios. It contains several subsets for different evaluation tasks, among which the object detection dataset contains 7481 images with annotations and 8 classes. We divided the training and test sets according to the ratio of 8 to 2, and selected five classes: car, van, truck, cyclist, and tram to regenerate the annotation files. For the models added Segblock, the semantic segmentation dataset was also needed. The semantic segmentation subset of KITTI only contained 200 images, and the small amount of training data may result in more overfitting. Because of the plug-and-play nature of SegBlock, it can be trained cross-dataset. We used 3000 images from the BDD100K segmentation dataset for the training. Then SSD, SegSSD, YOLOv3-spp, and SegYOLO were trained according to the strategy described in Section 3.1.2, respectively.

4.1.2. Qualitative Analysis

To visualize the vehicle detection improvement result, we plotted the detection results of SSD, YOLOv3spp, the mask map of SegBlock, the optimized bounding boxes of SegSSD and SegYOLO on the same picture respectively, as shown in Figure 5.

It can be seen that both SegSSD and SegYOLO obtained more accurate bounding boxes than the basic method because of the accurate target mask provided by SegBlock. In comparison, SegYOLO obtains a more accurate target mask than SegSSD, which is related to the coding capability of the basic algorithm.

4.1.3. Quantitative Analysis

The mean value of average precision (mAP) for each category at IoU threshold 0.5 were used to evaluate the prediction results objectively. It can indicate both the predict boxes and the classification accuracy, the larger the better.

As shown in Table 2, after adding SegBlock, the mAP of both SSD and YOLOv3spp are improved. However, the AP values in some categories slightly decreased, arising from two reasons. (1) For some long distance targets, there are slight overcorrections with SegBlock. As shown in Figure 6a, the left boundary line shrinks excessively after the right car correction. (2) The bounding boxes after SegBlock correction are better than the ground truth boxes, as shown in Figure 6b, the correction results of the left car and the right cyclist are obviously more accurate than the basic prediction results and the annotated boxes, the evaluation of AP is based on the ground truth bounding boxes, which leads to the slight decrease of AP value.

The extra computational cost of improving the quality of the bounding box by SegBlock is extremely low. Since the encoding feature maps come from the basic network, it contains only three convolutional layers with a small number of channels, the inference time increases only 3.5 ms in SegSSD and 4.2 ms in SegYOLO.

4.2. Distance and Angle Measurement

Depth and angle ground truth values are provided for each target in the KITTI 2D object detection dataset. We performed a series of experiments on it, which used SSD, SegSSD, YOLOv3-spp, and SegYOLO in combination with PRS, respectively, to evaluate the accuracy and effectiveness of proposed method pipeline. Due to the number of targets that are less than 40 m and account for about 80% of the entire test set, we set a distance boundary of 40 m to distinguish between near and far distance, and set 20 degrees as the distinguishing boundary between large and small angles. This was done to test the performance of the methods in large and small depth and angle scales, respectively. The prior height of different category is from the mean value in KITTI training dataset, as illustrated in Table 3. The experimental results are shown in Table 4.

It can be seen that the methods used SegBlock (SegSSD and SegYOLO) perform better than the basic algorithms (SSD and YOLOv3-spp), respectively. At the same time, it is important to select an appropriate basic model, the detection capability of YOLOv3-spp is much better than SSD so that the depth and angle estimation performances of SegYOLO should better than SegSSD. Indeed, all the indexes indicate that the pipeline based on SegYOLO has the best results among the four methods. The recall of all targets at near and far distances is 87% and 73%, respectively, which can detect most of the targets appearing in the pictures, while the average depth error rate is controlled below 7% for both near and far distances. However, the long-distance recall of cyclist is only 5.3%, and the depth error rate reaches 8.49%, which is caused by too few effective pixel points of cyclist after greater than 40 m. The average angular error is less than 2 degrees in both small-angle detection and large-angle detection, except the car category in large-angle case. The reason is that some cars are obscured by street trees in large-angle cases, the width of the detection box being only part of the true value. Figure 7 gives some outputs of the pipeline.

As shown in Figure 7, our method can improve the accuracy of bounding box for small targets, overlapping targets, and different categories of target obviously. Accurate depth and angle of deviation from the center in the field of view are obtained from the pipeline as well.

4.3. Discussion

Compared to previous methods, the main contribution of this study is to propose a new monocular vision pipeline which can provide the target category, depth, and direction angle at the same time, combining simplicity with effectiveness and high accuracy. We conducted a series of experiments on KITTI with four methods to verify the accuracy and validity of the results of the proposed SegYOLO and PRS pipeline show that the average depth error rate is about 4% for less than 40 m and about 7% for more than 40 m, since the average angle error is about 2 degrees, the inference time for a single image was only 39.7 ms on GTX1650 GPU.

The pipeline is valid in the clear imaging range of camera because it is completely dependent on the image rather than any other limitations. Since the system is inexpensive and easy to use, it has a great potential for practical applications. And it is more efficient because we can locate the target directly rather than reconstruct a map of environment by point cloud matching or stereo matching and then provide the locations of targets like SLAM. Another unique capability is that the method can obtain multivariate information such as the depth, angle and the category of target at the same time, which will be very useful for movement decisions. However, if no vehicles appear in the field of view or vehicles are not detected, it will not output any localization results, which is a limitation. In fact, there are two error sources of the location results. First, since the number of pixels of the sensor is fixed, the same depth results may be obtained from different depth within a range because they have the same vertical pixels number in the picture. For example, a truck at 150 to 155 m may have the same vertical pixels in a picture, and the larger the depth, the greater the range. It is a systematic error which is the primary source of long distance ranging, but it is decreasing with the increasing resolution. Second, the prior heights currently used are the average height of different categories in the KITTI dataset, which will make specific depth estimations larger or smaller, and a finer partition of different categories will reduce the impact. As an example, the category of car can be divided into sedan, SUV, Jeep, etc.

Our follow-up study will concentrate on two parts. First, a range of different resolution images with finer partition will be tried to provide the trade-offs between computation cost and systematic error. Second, different structures of segmentation block will be investigated to reduce the extra inference time and improve the segmentation accuracy.

5. Conclusions

In this paper, we proposed a new monocular visual pipeline to estimate the depth and direction angle which used to locate surrounding vehicles. A robust and real-time basic target detection algorithm was selected and combined with SegBlock as an end-to-end network, to provide the category of different targets and their high-precision bounding boxes on the image. Then the actual depth and angle are mapped by the bounding boxes vertical pixels number and the number between the center of boxes and image through PRS, respectively.

For validation, we combined SegBlock with SSD and YOLOv3spp to form SegSSD and SegYOLO, respectively, the four networks then were trained and tested on the KITTI dataset. Objectively, the mAP was improved, and subjectively, we obtained more accurate bounding boxes than the labeled annotations. Then we conducted depth and angle estimation experiments on KITTI with the four methods to verify the accuracy and validity of the proposed pipeline. Finally the excellent results of SegYOLO and PRS are obtained. However, there are some systematic errors which cannot be ignored especially in distant target tasks. In the future, we will try to reduce the systematic errors and investigate more efficient structure of SegBlock.

Author Contributions

Conceptualization, Y.L. and Y.H.; methodology, Y.L.; software, Y.L.; validation, Y.L., Y.H. and J.Y.; formal analysis, M.L.; investigation, J.Y.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.H.; writing—review and editing, Y.L.; visualization, Y.H.; supervision, Y.H.; project administration, W.J.; funding acquisition, W.J. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant No. 2020YFF0304104, 2018YFB0504901.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, D.; Xu, Z.; Huang, Z.; Gutierrez, A.R.; Norris, T.B. Neural network based 3D tracking with a graphene transparent focal stack imaging system. Nat. Commun. 2021, 12, 2413. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D. SSD: Single Shot MultiBox Detector. Lect. Notes Comput. Sci. 2016, 9905, 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2017, arXiv:1804.02767. [Google Scholar]
Sri, J.S.; Esther, R.P. LittleYOLO-SPP: A Delicate Real-Time Vehicle Detection Algorithm. Optik 2020, 225, 165–173. [Google Scholar]
Paul, B.; Neil, M. Method for registration of 3-D shapes. Sens. Fusion IV Control. Paradig. Data Struct. 1992, 1611, 586–606. [Google Scholar]
Aleksandr, S.; Dirk, H.; Sebastian, T. Generalized-icp. Robot. Sci. Syst. 2009, 2, 435. [Google Scholar]
Kenji, K.; Masashi, Y.; Shuji, O.; Atsuhiko, B. Voxelized gicp for fast and accurate 3d point cloud registration. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11054–11059. [Google Scholar]
Ji, Z.; Sanjiv, S. LOAM: Lidar Odometry and Mapping in Real-time. Robot. Sci. Syst. 2014, 2, 592. [Google Scholar]
Tixiao, S.; Brendan, E. Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 7 January 2019; pp. 4758–4765. [Google Scholar]
Ellon, M.; Pierrick, K.; Simon, L. ICP-based pose-graph SLAM. In Proceedings of the 2016 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Lausanne, Switzerland, 23–27 October 2016; pp. 195–200. [Google Scholar]
Frosi, M.; Matteo, M. ART-SLAM: Accurate Real-Time 6DoF LiDAR SLAM. arXiv 2021, arXiv:2109.05483. [Google Scholar]
Scharstein, D.; Szeliski, R.; Zabih, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision 2001 (SMBV), Kauai, HI, USA, 9–10 December 2001; pp. 131–140. [Google Scholar]
Navarro, J.; Duran, J.; Buades, A. Disparity adapted weighted aggregation for local stereo. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 2249–2253. [Google Scholar]
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
Chang, J.; Chen, Y. Pyramid stereo matching network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
Yang, G.; Zhao, H.; Shi, J.; Deng, Z.; Jia, J. SegStereo: Exploiting Semantic Information for Disparity Estimation. Lect. Notes Comput. Sci. 2018, 11211, 660–676. [Google Scholar]
Guo, X.; Yang, K.; Yang, W.; Wang, X.; Li, H. Group-wise correlation stereo network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3268–3277. [Google Scholar]
Young, T.; Bourke, M.; Zhou, X.; Toshitake, A. Ten-m2 is required for the generation of binocular visual circuits. J. Neurosci. 2013, 33, 12490–12509. [Google Scholar] [CrossRef] [Green Version]
Lin, Y.; Yang, J.; Lv, Z.; Wei, W.; Song, H. A Self-Assessment Stereo Capture Model Applicable to the Internet of Things. Sensors 2015, 15, 20925–20944. [Google Scholar] [CrossRef] [Green Version]
Perdices, E.; Cañas, J.M. SDVL: Efficient and Accurate Semi-Direct Visual Localization. Sensors 2019, 19, 302. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nagai, T.; Naruse, T.; Ikehara, M.; Kurematsu, A. HMM-based surface reconstruction from single images. In Proceedings of the International Conference on Image Processing, New York, NY, USA, 22–25 September 2002; p. 11. [Google Scholar]
Saxena, A.; Chung, S.; Ng, A. 3-D depth reconstruction from a single still image. Int. J. Comput. Vis. 2008, 76, 53–69. [Google Scholar] [CrossRef] [Green Version]
Zhuo, S.; Sim, T. On the recovery of depth from a single defocused image. Lect. Notes Artif. Intell. Lect. Notes 2009, 5702, 889–897. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. arXiv 2014, arXiv:1406.2283, 2014. [Google Scholar]
Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
Xu, D.; Ricci, E.; Ouyang, W.; Wang, X.; Sebe, N. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 161–167. [Google Scholar]
Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A. Towards unified depth and semantic prediction from a single image. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 July 2015; pp. 2800–2909. [Google Scholar]
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tomban, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
Meng, C.; Bao, H.; Ma, Y.; Xu, X.; Li, Y. Visual Meterstick: Preceding Vehicle Ranging Using Monocular Vision Based on the Fitting Method. Symmetry 2019, 11, 1081. [Google Scholar] [CrossRef] [Green Version]
Parmar, Y.; Natarajan, S.; Sobha, G. Deep Range: Deep-learning-based object detection and ranging in autonomous driving. Intell. Transp. Syst. 2019, 13, 1256–1264. [Google Scholar] [CrossRef]
Nister, D.; Naroditsky, O.; Bergen, J. Visual odometry. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 27 July 2004; p. 1. [Google Scholar]
Lin, H.Y.; Lin, J.H. A visual positioning system for vehicle or mobile robot navigation. IEICE Trans. Inf. Syst. 2006, 89, 2109–2116. [Google Scholar] [CrossRef]
Aladem, M.; Rawashdeh, S. Lightweight visual odometry for autonomous mobile robots. Sensors 2018, 18, 2837. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Howard, A. Real-time stereo visual odometry for autonomous ground vehicles. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008; pp. 3946–3952. [Google Scholar]
Yan, L.; Dezheng, S. Robust RGB-D odometry using point and line features. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3934–3942. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 June 2014; pp. 15–22. [Google Scholar]
Zhou, D.; Dai, Y.; Li, H. Ground-Plane-Based Absolute Scale Estimation for Monocular Visual Odometry. Trans. Intell. Transp. Syst. 2020, 21, 791–802. [Google Scholar] [CrossRef]
Lin, H.Y.; Hsu, J.L. A Sparse Visual Odometry Technique Based on Pose Adjustment With Keyframe Matching. IEEE Sens. J. 2020, 21, 11810–11821. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision & Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. arXiv 2018, arXiv:1805.04687. [Google Scholar]

Figure 1. In KITTI training dataset the green boxes indicate the annotated ground truth bounding boxes. The labeling errors of the truck on the left and the car on the right cannot be ignored, and the box of the black car in front is too small.

Figure 2. The collocation structure of Segmentation Block and basic target detection algorithm. Upward arrows in segmentation block indicate upsampling. Bolded arrows indicate the delivery of one or more feature maps. The green rectangle is the output result of the basic algorithm and the blue rectangle is the bounding box with the segmentation block, w and h are the input size of basic detection algorithm.

Figure 3. Diagram of the spatial relationship when taking a photo. (a): Side view. AB indicates the actual plane where the object is located,

α_{h}

is half of the maximum lateral field of view angle. (b): Front view. Rectangle

A_{l e f t} A_{r i g h t} B_{r i g h t} B_{l e f t}

is the imaging region in the actual plane of the object location, h and H are the height of the target and region, respectively.

Δ w

indicates the offset distance between the target center and the center of the field of view. (c): Top view. W is the width of the imaging region.

α_{w}

and

θ

are half of the maximum horizontal field of view and the angle of the target center offset, respectively.

Figure 3. Diagram of the spatial relationship when taking a photo. (a): Side view. AB indicates the actual plane where the object is located,

α_{h}

is half of the maximum lateral field of view angle. (b): Front view. Rectangle

A_{l e f t} A_{r i g h t} B_{r i g h t} B_{l e f t}

is the imaging region in the actual plane of the object location, h and H are the height of the target and region, respectively.

Δ w

indicates the offset distance between the target center and the center of the field of view. (c): Top view. W is the width of the imaging region.

α_{w}

and

θ

are half of the maximum horizontal field of view and the angle of the target center offset, respectively.

Figure 4. PRS simulations along with imaging system parameters of KITTI dataset. Assuming that x and y are accurate. (a) The depth results at different h and error rate if the vertical pixels of bounding box is y + Δ. (b) The angle results and error rate if the pixels offset is x + Δ.

Figure 5. The green boxes are the detection output of the basic target detection algorithm (SSD, YOLOv3spp), and the light-sky-blue boxes are the output of SegSSD and SegYOLO, the target regions in mask map of SegBlock are marked by red. (a,b) are the results of YOLOv3spp and SegYOLO, (c,d) are the results of SSD and SegSSD.

Figure 6. (a,b) shows the two cases of the decrease of AP values in some categories: overcorrection and better than ground truth, respectively. The yellow-green bounding boxes indicate the output of YOLOv3spp, the bright blue indicates the output of SegYOLO, and the blackish green boxes indicate the ground truth of annotation.

Figure 7. The outputs of SegYOLO and PRS pipeline (a–e). To keep clear, we have marked only one depth and angle estimation result on each figure, with ground truth in parentheses.

Table 1. SegSSD and SegYOLO detail parameters. X means the layer name. ‘s’ in decode indicate scale factor.

INDEX		Encode_X		(Conv + Maxpool)_X		Decode_X
Model	X	Structure	Output Size	Structure	Output Size	Structure	Output Size
SegSSD (300 × 300)	1	$1 \times 1$ , 256 $3 \times 3$ , 512	$19 \times 19$	concat $1 \times 1$ , 256	$19 \times 19$	$3 \times 3, 256$ upsampling, s4	$76 \times 76$
	2	$[\begin{matrix} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 256 \end{matrix}] \times 3$	$75 \times 75$	crop concat $1 \times 1$ , 128	$75 \times$ 75	$3 \times 3, 128$ $3 \times 3, 128$ upsampling, s2	$150 \times$ 150
	3	$7 \times 7$ , 64 stride 2	$150 \times 150$	concat $1 \times 1$ , 64	$150 \times 150$	$3 \times 3, 64$ $3 \times 3, 64$ upsampling, s2	$300 \times 300$
	final					$3 \times 3$ , 32 $3 \times 3$ , classes	$300 \times 300$
SegYOLO (512 × 512)	1	$3 \times 3, 1024$	$16 \times 16$	concat $1 \times 1$ , 256	$16 \times 16$	$3 \times 3, 256$ upsampling, s4	$64 \times 64$
	2	$[\begin{matrix} 1 \times 1, 128 \\ 3 \times 3, 256 \end{matrix}] \times 8$	$64 \times 64$	concat $1 \times 1$ , 128	$64 \times 64$	$3 \times 3, 128$ $3 \times 3, 128$ upsampling, s4	$256 \times 256$
	3	$1 \times 1$ , 256 $3 \times 3$ , 512	$256 \times 256$	concat $1 \times 1$ , 64	$256 \times 256$	$3 \times 3, 64$ $3 \times 3, 64$ upsampling, s2	$512 \times 512$
	final					$3 \times 3$ , 32 $3 \times 3$ , classes	$512 \times 512$

Table 2. KITTI test detection results. Inference times are test on GTX1650 GPU.

Model	mAP@0.5	Car AP	Van AP	Truck AP	Cyclist AP	Tram AP	Inference Time (ms)
SSD	0.449	0.752	0.336	0.504	0.276	0.375	26.9
SegSSD	0.452	0.762	0.344	0.498	0.273	0.391	30.4
YOLOv3-spp	0.596	0.837	0.499	0.599	0.580	0.463	35.9
SegYOLO	0.620	0.840	0.549	0.550	0.570	0.592	39.7

Table 3. The prior height and number of annotated samples of KITTI test set.

Category	Prior Height (m)	The Number of Annotated Samples
Category	Prior Height (m)	Total	≤40 m	>40 m	≤20°	>20°
Car	1.53	2566	2087	479	1724	842
Van	2.21	247	152	95	160	87
Truck	3.25	93	40	53	71	22
Cyclist	1.74	127	108	19	87	40
Tram	3.53	56	30	26	46	10

Table 4. The detection results of SSD, SegSSD, YOLOv3-spp, and SegYOLO along with the depth and angle evaluation results pipeline. The confidence thresholds are all set to 0.4.

Category	Model	Recall (%)		Average Depth Error Rate (%)		Average Angle Error (°)
Category	Model	≤40 m	>40 m	≤40 m	>40 m	≤20°	>20°
Car	SSD	0.57	0.47	9.15	12.23	2.34	2.57
	SegSSD	0.61	0.51	6.11	8.79	2.11	2.47
	YOLOv3-spp	0.87	0.72	5.02	9.12	2.21	2.42
	SegYOLO	0.88	0.75	3.05	5.52	1.96	2.38
Van	SSD	0.63	0.42	10.61	11.06	2.01	2.34
	SegSSD	0.67	0.50	7.63	9.28	1.85	2.16
	YOLOv3-spp	0.85	0.78	7.36	10.96	1.58	1.87
	SegYOLO	0.91	0.79	4.82	7.31	1.47	1.83
Truck	SSD	0.43	0.56	9.24	12.33	2.12	2.32
	SegSSD	0.47	0.60	7.92	10.69	1.94	2.12
	YOLOv3-spp	0.71	0.79	8.93	9.25	1.26	1.92
	SegYOLO	0.73	0.83	5.33	6.17	1.16	1.57
Cyclist	SSD	0.56	0.11	12.33	14.73	2.57	2.85
	SegSSD	0.59	0.16	8.26	11.56	2.17	2.45
	YOLOv3-spp	0.82	0.19	6.76	10.04	2.38	2.46
	SegYOLO	0.83	0.21	3.91	8.49	1.79	1.82
Tram	SSD	0.40	0.46	10.28	13.75	1.99	2.21
	SegSSD	0.47	0.54	5.60	7.61	1.55	1.43
	YOLOv3-spp	0.57	0.70	4.95	8.81	1.52	1.22
	SegYOLO	0.63	0.77	4.34	5.3	1.43	1.02

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, Y.; He, Y.; Yang, J.; Jin, W.; Liu, M. An Efficient Vehicle Localization Method by Using Monocular Vision. Electronics 2021, 10, 3092. https://doi.org/10.3390/electronics10243092

AMA Style

Liang Y, He Y, Yang J, Jin W, Liu M. An Efficient Vehicle Localization Method by Using Monocular Vision. Electronics. 2021; 10(24):3092. https://doi.org/10.3390/electronics10243092

Chicago/Turabian Style

Liang, Yonghui, Yuqing He, Junkai Yang, Weiqi Jin, and Mingqi Liu. 2021. "An Efficient Vehicle Localization Method by Using Monocular Vision" Electronics 10, no. 24: 3092. https://doi.org/10.3390/electronics10243092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Vehicle Localization Method by Using Monocular Vision

Abstract

1. Introduction

2. Related Work

2.1. Laser SLAM

2.2. Visual Localization

3. Method

3.1. Vehicle Detection Improvement Method

3.1.1. Structure

3.1.2. Training

3.1.3. Inference

3.2. PRS Relationships

4. Experiments

4.1. Vehicle Detection

4.1.1. Dataset Implementation

4.1.2. Qualitative Analysis

4.1.3. Quantitative Analysis

4.2. Distance and Angle Measurement

4.3. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI