Transformer-Based Global PointPillars 3D Object Detection Method

Zhang, Lin; Meng, Hua; Yan, Yunbing; Xu, Xiaowei

doi:10.3390/electronics12143092

Open AccessArticle

Transformer-Based Global PointPillars 3D Object Detection Method

School of Automobile and Traffic Engineering, Wuhan University of Science and Technology, Wuhan 430065, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(14), 3092; https://doi.org/10.3390/electronics12143092

Submission received: 2 July 2023 / Revised: 13 July 2023 / Accepted: 14 July 2023 / Published: 16 July 2023

(This article belongs to the Special Issue Advanced Research and Applications of Deep Learning and Neural Network in Image Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

The PointPillars algorithm can detect vehicles, pedestrians, and cyclists on the road, and is widely used in the field of environmental awareness in autonomous driving. However, its feature encoding network only uses a minimalist PointNet network for feature extraction of point cloud information, which does not consider the global context information of the point cloud, and the local structure features are not sufficiently extracted, and these feature losses can seriously affect the performance of the object detection network. To address this problem, this paper proposes an improved PointPillars algorithm named TGPP: Transformer-based Global PointPillars. After the point cloud is divided into several pillars, global context features and local structure features are extracted through a multi-head attention mechanism, so that the point cloud after feature coding has global context features and local structure features; the two-dimensional pseudo-image generated by this feature is used for feature learning using a two-dimensional convolutional neural network. Finally, the SSD detection head is used to achieve 3D object detection. It is demonstrated that the TGPP achieves an average accuracy improvement of 2.64% in the KITTI test set.

Keywords:

deep learning; 3D object detection; lidar point cloud; transformer

1. Introduction

The 3D object detection technology is an important part of the environment perception module in the automatic driving system. Accurately identifying objects such as vehicles, pedestrians, and cyclists on the road is the basis for vehicle planning and control. To better accomplish this goal, self-driving cars need to rely on a variety of sensors, among which lidar is one of the most important sensors. Lidar can measure the distance to the surrounding environment through a scanner, and directly generate sparse 3D point cloud information, which has inherent advantages in the task of 3D object detection. Traditional methods usually down-sample the point cloud information first, then remove the ground, and then use European, DBSCAN and other clustering methods combined with 3D bounding boxes to detect objects [1,2,3,4,5]. The traditional method requires cumbersome parameter adjustment work during the deployment process, making it difficult to apply in practice. With the rapid development of deep learning technology and parallel computing units, the end-to-end 3D object detection method based on deep learning has become the current key research content.

With the rapid development of computer vision and deep learning, 2D object detection technology has made great progress, but there are essential differences between the two data forms of the point cloud and image. In order, direct convolution of the point cloud will lead to severe distortion of the features [6], so the excellent 2D object detection algorithm cannot be directly applied to the 3D object detection task. In 2017, Qi et al. proposed PointNet [7] and PointNet++ [8] deep convolutional neural networks, which take the original point cloud as input and can be applied to point cloud point-by-point feature extraction, point cloud recognition, and point cloud semantics segmentation, and other fields also provide feature extraction tools for 3D object detection tasks based on point cloud data. Subsequently, a point-based 3D object detection method was proposed. PointRCNN [9] is a more classic point-based method. The main idea is to extract point-by-point features from the PointNet network and predict 3D proposals to achieve 3D object detection. This type of method takes a lot of time to retrieve points, so the calculation is very large, and the detection efficiency is very low. In response to this problem, Zhou Y and others proposed the VoxelNet [10] algorithm, which is the earliest voxel-based method. This algorithm represents the point cloud as voxels, and the follow-up work in voxels reduces the amount of calculation and is based on the voxel method, which is more convenient for the extraction of target features, but due to the slow inference speed of the 3D convolutional neural network, its detection efficiency is still not ideal. As an upgraded version of VoxelNet, the SECOND [11] algorithm replaces the ordinary 3D convolution with sparse 3D convolution to speed up the reasoning time, but it still cannot eliminate the disadvantage of the slow calculation speed of 3D convolution. To this end, the PointPillars [12] algorithm proposes a novel encoder, which realizes end-to-end learning on 3D object detection tasks using only 2D convolutional neural networks. Its unique pillars-based encoding method greatly speeds up the detection speed. In addition, its simple algorithm framework can be easily deployed to a variety of laser radars. At present, it is one of the most widely used methods in engineering practice, and the research and improvement of the algorithm have practical application value and engineering significance.

At present, the detection rate of the PointPillars algorithm still has a large advantage, but its detection accuracy is inferior to the later excellent works. For example, Li Y et al. proposed the UVTR [13], which explicitly expresses and interacts with image and point cloud features in voxel space; Lai X et al. proposed the SphereFormer [14] method, which solved the problem of discontinuous information and limited receptive field. Therefore, in the past two years, some scholars have proposed some methods to improve PointPillars. For example, in 2021, Xinwei He et al. [15] proposed an intra-pillar multi-scale feature extraction module to enhance the overall learning ability of the PointPillars algorithm, thereby improving detection accuracy. This work improved the local structural feature extraction method of the point cloud, but still does not consider the global context feature information of the point cloud; in 2022, Dejiang Chen et al. [16] improved the 2D convolutional down-sampling module of the PointPillars algorithm based on Swin Transformer [17], optimized the original 2D convolutional neural network, and improved the Average Orientation Similarity (AOS) accuracy to a certain level. The improvement of this work optimizes the 2D convolutional neural network to improve the learning ability of point cloud features.

However, the above improvement scheme still does not make full use of point cloud features: its feature encoding process is to divide all point clouds into uniform pillars, where each pillar can be understood as a combination of voxels at the corresponding position on the z-axis, and then pass a minimalist PointNet network, which performs local feature extraction and uses Max Pooling to obtain points representing the features of each pillar, and finally generates a sparse 2D pseudo-image through position mapping. In this encoding process, the local feature extraction is insufficient and does not consider the global features causing a loss to the features of the point cloud.

To solve the appealing problem, this paper proposes an improved PointPillars algorithm named TGPP (Transformer-based Global PointPillars), based on Transformer [18], to improve the feature encoding network: after the point cloud is divided into pillars, the global position feature calculation and local structure feature calculation are performed based on the improved Transformer module, and each the rich global context features and local features of the pillars enable the local features of the point cloud and accurate global position information to be preserved in the feature encoding process to improve the accuracy of the algorithm object detection.

2. TGPP Algorithm Network

TGPP is an improvement on PintPillars. The reasoning speed of the PointPillars algorithm is very fast, exceeding the scanning frequency of the radar, so its real-time detection is very good. The algorithm uses 3D point clouds as input, which can realize end-to-end learning and can detect road vehicles, pedestrians, and cyclists. These three common objects are identified.

The TGPP algorithm structure is shown in Figure 1 below. The algorithm can be divided into three main parts: (1) Pillar Feature Net: divide the 3D point cloud into pillars, and generate the 2D pseudo image. (2) Two-dimensional convolutional neural network: use multiple down-sampling of 2D pseudo-images to obtain feature maps of different resolutions, and then up-sample multiple feature maps after down-sampling to the same size for splicing to generate the final feature map. (3) Object detection head: generate a 3D detection frame and object classification for the feature map, and obtain the position and type of the object. The main difference between this method and the original method lies in the feature encoding network, and the structure of this algorithm will be introduced in detail below.

2.1. Overall Algorithm Process

This algorithm takes the original point cloud data information as input, and first expresses the point cloud as a uniformly distributed pillar: the three-dimensional point cloud information is directly obtained from the top view, all points are discretized into a uniform square network on the x–y plane in the grid, and each pillar is cubic with infinite extension in the z-axis direction of each grid.

Due to the sparsity of the point cloud, most of the pillars are empty, and there are usually only a small number of points in the non-empty pillars. This sparsity is used to create a size density tensor (D, P, N), where D represents the feature dimension of each pillar, P represents the number of pillars, and N represents the maximum number of points in each pillar. When the points in the pillars exceed N, random sampling selects N points; when the points are less than N, it will be filled with 0 samples.

After obtaining a (D, P, N) tensor, the input is based on the improved Transformer feature an encoding network for feature extraction. First, we use the MLP (Multi-Layer Perceptron) operation for position encoding and dimension-up processing, changing from (D, P, N) tensors to (C, P, N) tensors, and C represents the feature dimension after dimension enhancement (256); then, based on the multi-head attention mechanism, we calculate the global context features for each pillars and calculate the local structural features for the points in each pillar, so that the point cloud information in each pillars has global context features and local structural features, in particular, in order to fully extract the local structural features of the point cloud, a combination of local and global position encoding is used; then, the maximum pooling is used to extract the feature points that best represent the features of the pillars; finally, according to the pillars index, we remap the point cloud to the corresponding position of the original grid, and generate a 2D pseudo-image of size (C, H, W), where H and W represent the height and width of the image.

The generated 2D pseudo-image will be input into the 2D convolutional neural network for feature learning, and finally, the detection head based on the design of SSD (Single Shot Multibox Detector) [19] is used to realize the classification and regression of 3D object detection and generate a 3D object detection frame.

2.2. Feature Encoding Network Based on Transformer

The Transformer model is a deep learning model based on the attention mechanism, which has been widely used in natural language processing (NLP), image processing, and other fields. Its core idea is to split the input sequence into a set of vector representations, and then use the attention mechanism to learn the dependencies between positions. Through the multi-head attention mechanism, Transformer can perform more comprehensive and accurate feature extraction on point cloud data, and its application in 3D object detection tasks based on point cloud data has gradually become a trend. PCT [20], Point Transformer [21], SOE-Net [22], VoxSeT [23], FlatFormer [24] and other works have achieved good results. Therefore, it is feasible to improve the feature encoding network based on Transformer.

The feature encoding network structure of this algorithm is shown in Figure 2 below, which is a network structure based on the encoder–decoder. The input of the feature encoding network is the point cloud information represented by the pillar distribution, and a vector sequence is generated through position encoding, and input to the multi-head attention module for calculation; each element of the input sequence is compared with other elements in the sequence Elements interact and give different weights according to their relevance. This interaction is realized by calculating the attention weight matrix; then it is input into the feedforward neural network module, and the output of the attention layer is further nonlinearly transformed; additionally, in order to prevent degradation problems during the training process, we add ResNet residual neural network [25] and LN layer [26] (Figure 2: Add&Norm module); the difference between the decoder and the encoder is that there is add a masked multi-head attention module, whose input is the predicted output of the entire feature calculation process; and the final output layer converts the output of the decoder into the final probability distribution through a linear transformation and Softmax function for generating prediction results. Nx represents the number of encoders and decoders, that is, the number of layers of the Transformer. Each layer independently processes the input and passes its output to the next layer.

The core content of the feature encoding network is to use the multi-head attention module to calculate the global context feature of the pillars and the local feature calculation of the internal structure of the pillars:

(1): Global Feature Calculation:

The calculation formula of the global attention [18] is

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

(1)

In Formula (1), Q, K, and V are feature codes of point cloud columns. First, calculate the dot product of the Q matrix and K matrix, and divide it by the scale

\sqrt{d_{k}}

to prevent overflow of the dot product result, where

d_{k}

is the vector in the Q and K dimension. After calculating the dot product, use the Softmax function to normalize the dot product to a probability distribution, and finally multiply it by the matrix V to obtain the attention score matrix between different pillars.

The multi-head attention allows the model to simultaneously focus on information from different pillars and different locations. The representation of the multi-head attention [18] is as follows:

\begin{array}{l} MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{o} \\ {head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{array}

(2)

Among them,

W_{i}^{K} \in R^{d m o d e l}^{\times d_{k}}

,

W_{i}^{V} \in R^{d m o d e l}^{\times d_{v}}

,

d_{v} = d_{k}

,

d_{m o d e l} = d_{k} \times h

; h is the number of attention heads.

All non-empty pillars perform global context feature calculation through the multi-head attention mechanism, which can add global attention to each pillar.

(2): Local Feature Calculation:

In the process of calculating the global feature of the pillars, the local feature calculation is added, and the local geometric relationship between the center point and the adjacent point is used to effectively aggregate the local features by learning the attention weight. The specific method is to use the subtraction relationship, and at the same time add the local position information δ to the attention vector γ and the feature vector α to aggregate features. The overall calculation expression [27] is

g_{i} = \sum_{μ_{j} \in x (i)} ρ (γ (φ (μ_{i}) - φ (μ_{j}) + δ)) (α (μ_{j}) + δ)

(3)

Among them,

μ = \{f_{i} | i = 1, 2, \dots, n\}

is a set of feature vectors composed of points in the pillars,

μ (i) \in μ

;

g_{i}

is the feature output after adding local attention;

φ

,

ϕ

, and

α

are point-by-point feature transformation functions, similar to linear projection functions; and

γ

is the attention-generating mapping function. The calculation formula of local position information

δ

is

δ = ε (p_{i} - p_{j})

(4)

Among them,

p_{i}

,

p_{j}

is the coordinates of the 3D point cloud; and

ε

is composed of two ReLU functions [28].

After the above processing, the point cloud features after feature encoding will have global position features and local structure features, which reduce the feature loss caused by feature encoding, and the subsequent generated 2D pseudo-images are more conducive to subsequent feature learning to improve object detection accuracy.

2.3. 2D Convolutional Neural Network and SSD Detection Head

After the original point cloud information passes through the feature encoding network, a 2D pseudo image is generated, and the 2D convolutional neural network can be used very conveniently for feature learning. The structure of the 2D convolutional neural network is shown in Figure 1. The backbone network consists of two sub-networks: a top-down feature extraction network and an up-sampling and feature stitching network. The top-down sub-network uses a gradually decreasing spatial resolution to acquire features, and consists of a series of block structures, where each block structure contains three parameters (S, L, F), each block contains L 3 × 3 2D convolutional layers, F output channels, and the step size of the convolutional layer is S. The network that performs up-sampling and feature splicing is responsible for up-sampling the features from the first sub-network and applying the BN Layer [29] and the ReLU function to form the final output features. The use of a 2D convolutional neural network avoids the disadvantages of slow inference speed of algorithms such as VoxelNet using a 3D convolutional neural network, simplifies the structure of the model, reduces the amount of calculation, has good detection accuracy, and greatly improves detection speed.

The SSD detection head is used to predict the position, category, and orientation of 3D objects. We use the 2D intersection-over-union ratio (IOU) to match the prior frame with the real label frame, regardless of the height information, but use it as an additional regression object, because in the real road object, all objects can be considered to be in the same plane of the three-dimensional space, the height difference between all categories of objects is not very large, and better results can be obtained by directly using the SmoothL1 function [30] for regression. At the same time, the FPN (feature pyramid network) [31] operation is also introduced in the detection head to handle objects of different sizes. By extracting features at different scales, objects of different sizes can be located more accurately.

3. Algorithm Implementation Details

3.1. Details of Feature Encoding Network Structure Parameters

The cross-section of each pillar is a square with a side length of 0.16 m. In the actual feature encoding process, only the front view part is intercepted to generate a pseudo-image, because the real label information of the KITTI dataset is only in the front view captured by the camera. It is marked in the image, so the points of the original point cloud information in the negative direction of the x-axis should be discarded, and the points that are too far away should be removed. Refer to the original algorithm to take the maximum and minimum values of (x, y, z) in the point cloud space. It is min: (0, −39.68, −3), max: (69.12, 39.68, 1), in meters; the maximum value P of the number of pillars is 12,000, and the maximum value of point sampling in each pillar is N, which is set to 32. The number of Transformer layers is 4, the number of heads is 2, and a 2-layer learnable MLP is used for position encoding.

3.2. Loss Calculation

This article uses the same loss calculation method as the original algorithm. Each real label box contains (x, y, z, w, l, h, θ) 7 parameters, where (x, y, z) represents the three-dimensional coordinates of the object center; (z, w, l) represents the label of the length, width, and height of the frame, and θ represents the rotation angle. The regression residual of the positioning task between the prior box and the ground truth box is defined as

\begin{array}{l} Δ x = \frac{x^{g t} - x^{a}}{d^{a}}, Δ y = \frac{y^{g t} - y^{a}}{d^{a}}, Δ z = \frac{z^{g t} - z^{a}}{d^{a}} \\ Δ w = \log \frac{w^{g t}}{w^{a}}, Δ l = \log \frac{l^{g t}}{l^{a}}, Δ h = \log \frac{h^{g t}}{h^{a}}, \\ Δ θ = \sin (θ^{g t} - θ^{a}) \end{array}

(5)

Among them,

x^{g t}

represents the x value of the label box;

x^{a}

represents the x value of the prior frame; y, z, w, l, h, θ are the same; and

d^{a}

represents the diagonal distance between the length and width of the prior frame, defined as

d^{a} = \sqrt{{(w^{a})}^{2} + {(l^{a})}^{2}}

. The total localization loss is

L_{l o c} = \sum_{b \in (x, y, z, w, l, h, θ)} SmoothL 1 (Δ b)

(6)

Since it is not possible to completely distinguish between two a priori boxes with completely opposite directions during angle regression, it is necessary to add direction classification to the a priori box. The direction classification loss function uses the Softmax function, denoted as

L_{d i r}

[11]. The object classification loss function uses Focal Loss [32]:

L_{c l s} = - λ_{a} {(1 - p^{a})}^{r} \log p^{a}

(7)

Among them,

p^{a}

represents the probability that the predicted prior box belongs to the positive class,

λ

= 0.25, r = 2. Finally, the total loss function is obtained as

L = \frac{1}{N_{p o s}} (β_{l o c} L_{l o c} + β_{c l s} L_{c l s} + β_{d i r} L_{d i r})

(8)

Among them,

N_{p o s}

is the number of correct prior frames. The values of

β_{l o c}

,

β_{c l s}

and

β_{d i r}

we refer to SECOND algorithm, so

β_{l o c} = 2

,

β_{c l s} = 1

,

β_{d i r} = 0.2

.

4. Testing Results

4.1. KITTI Dataset Division

The training and testing of the model use KITTI’s 3D object detection dataset [33], which consists of lidar point clouds and image samples. It is only trained on the lidar point cloud, but the lidar point cloud and image fusion are used. The method to realize the comparison between the prior frame and the true value. The sample data have 7481 training samples. For the convenience of comparison, the same data set division method as the PointPillars algorithm is used: the training samples are divided into 3712 training samples and 3769 testing samples.

4.2. Experiment Analysis

4.2.1. Model Training

The computer environment used for the training and testing of this algorithm is Ubuntu 20.04 system, the processor is Intel^® Core™ i9-9900 CPU @ 3.10 GHz × 16, the graphics card is Nvidia A40, and the video memory is 48G. TGPP is improved based on the PointPillars algorithm model in the OpenPCDet framework and written in Python3.8.

OpenPCDet is an open-source point cloud object detection algorithm library based on Pytorch. The PointPillars algorithm in this framework adopts more advanced data enhancement methods, optimizers, learning strategies, and other methods to optimize the model. The trained model has better detection accuracy.

The optimizer selected during training is Adam_onecycle, and the maximum learning rate LR is 0.002. This algorithm and PointPillars are trained under the same conditions. During the training process, the loss change curve before and after the model improvement is shown in Figure 3. It can be seen from the figure that the TGPP has a stronger feature learning ability.

4.2.2. Model Training

The test is performed on the trained model based on KITTI’s 3D object detection testing set. The test scenarios are divided into three types: simple, medium, and difficult. The test mainly uses the average precision (AP) of 3D object detection as the evaluation index. During the test, the detection of vehicles adopts the standard of IoU = 0.7, and the detection of pedestrians and cyclists adopts the standard of IoU = 0.5.

(1): Compared with PointPillars:

The test results of this algorithm and the PointPillars algorithm are shown in Table 1. It can be seen from Table 1 that compared with PointPillars, this method has improved the 3D object detection performance of vehicles, pedestrians, and cyclists. The vehicle detection AP in the three difficulty scenarios increased by 2.68%, 1.84%, and 2.62%, respectively; the pedestrian detection AP increased by 4.84%, 3.97%, and 3.42%, respectively; the cyclist detection AP increased by 1.41%, 2.12%, and 2.24%, respectively. To better evaluate the overall detection performance, the mAP of vehicles, pedestrians, and cyclists detected under medium difficulty are calculated. The TGPP mAP is 63.56%, and the PointPillars mAP is 60.92%. TGPP has improved by 2.64% mAP in the testing set, which is equivalent to a performance improvement of about 4.3% for PointPillars.

In terms of detection speed, the average time for PointPillars to process a frame of point cloud data is only 16 ms. Compared with PointPillars, the detection speed has decreased, and the average time is 21 ms, which is 47 Hz when converted into Hz. Considering the vehicle-mounted laser, the scanning frequency of the lidar is usually 10–20 Hz, so this method can still meet the real-time detection requirements.

(2): Comparison with Other 3D Object Detection Methods:

Comparing this method with the more excellent methods in recent years, as shown in Table 2, the 3D detection performance of other methods is derived from their own papers, some of which did not give the detection speed, taken from KITTI’s 3D object detection method performance leaderboard. It can be seen from the table that this method is compared with commonly used methods based on the fusion of image and point cloud data such as MV3D [34], RoarNet [35], AVoD-FPN [36], and F-PointNet [37]. There are no small advantages in the speed or detection of AP. Among lidar-based methods, this method also has certain advantages compared with voxel-based methods. For example, compared with VoxelNet, SECOND, TANe [38], and PSA-Det3D [39] the mAP is 14.51%, 7.17%, 2.93%, and 2.43% higher; the detection accuracy of the point-based method is usually higher, but this method also has advantages compared with it, such as PointRCNN and STD [40], where the mAP is 4.51% and 2.85% higher, respectively. At the same time, this method is superior to all the methods mentioned above in terms of detection speed.

In summary, this method maintains the advantages of the PointPillars algorithm in detection speed and is also superior to the current mainstream methods in detection accuracy. Therefore, it is proved that the feature encoding network improvement scheme proposed in this paper is feasible and practical.

4.3. Comparison of Actual Road Environment Test Results

We use this method and the original method to test the effect of target detection in the same road environment, as shown in Figure 4. In Figure 4 (scenario a), it can be seen that the false detection rate of the original method is higher, and many non-object point clouds are recognized as vehicles and cyclists; in Figure 4 (scenario b), it can be seen that the original method has a higher impact on pedestrians. The false detection rate is high, and the point cloud of non-pedestrians is recognized as pedestrians. From this, it can be seen that the detection accuracy of this method is better than that of the original method in the actual road environment test.

5. Ablation Experiments

In order to verify the effectiveness of the improved Transformer-based feature encoding network, an ablation experiment is performed. The hyperparameters of the feature encoding network include the number of Transformer layers and the number of heads. Change these two parameters to observe the impact on detection performance. In order to avoid the influence of random number seeds, each set of parameters was trained five times. For the convenience of the experiment, Epoch was set to 120, and the average mAP under medium difficulty was used as the evaluation index. The results are shown in Table 3. It can be seen from the table that when the number of layers and heads is small, the detection performance is not as good as PointPillars. Increasing the number of layers and heads within a certain range can improve the detection performance. When the number of layers is 4 and the number of heads is 2, the effect is the best. Continuing to increase the number of layers and the number of heads will degrade the detection performance.

6. Conclusions

In this paper, for the 3D object detection algorithm, an improved PointPillars feature encoding network based on Transformer is proposed. This improved PointPillars algorithm is named TGPP. The improved feature encoding network uses a multi-head attention mechanism to extract global context features and local structure features from pillars. The feature extraction ability of the original algorithm in the feature encoding process is improved, and the feature loss is reduced. Experimental results prove that this algorithm has better object detection performance than PointPillars, and the average object detection accuracy on the KITTI testing set has increased by 2.64%, which is also competitive with other methods in recent years.

Author Contributions

Conceptualization, L.Z. and H.M.; methodology, H.M.; software, H.M.; validation, L.Z., H.M., Y.Y. and X.X.; formal analysis, L.Z. and H.M.; resources, Y.Y. and X.X.; writing—original draft preparation, L.Z. and H.M.; writing—review and editing, Y.Y. and X.X.; project administration, L.Z. and Y.Y.; funding acquisition, L.Z. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China [51975428], “Chunhui Plan” Cooperative Scientific Research Project of the Education Department of China [HZKY20220330] and Guidance Project of Scientific Research Plan of the Education Department of Hubei Province of China [B2022027].

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Himmelsbach, M.; Mueller, A.; Lüttel, T.; Wünsche, H.-J. LIDAR-based 3D object perception. In Proceedings of the 1st International Workshop on Cognition for Technical Systems, Munich, Germany, 6–7 October 2008. [Google Scholar]
Xie, D.; Xu, Y.; Wang, R.; Su, Z. Obstacle Detection and Tracking for Unmanned Vehicles Based on 3D Laser Radar. Automot. Eng. 2018, 40, 952–959. [Google Scholar] [CrossRef]
Xia, X.; Zhu, S.; Zhou, Y.; Ye, M.; Zhao, Y. LiDAR K-means Clustering Algorithm Based on Threshold. J. Beijing Univ. Aeronaut. Astronaut. 2020, 46, 115–121. [Google Scholar] [CrossRef]
Zong, C.; Wen, L.; He, L. Object Detection Based on Euclidean Clustering Algorithm with 3D Laser Scanner. J. Jilin Univ. Eng. Technol. Ed. 2020, 50, 107–113. [Google Scholar] [CrossRef]
Ning, X.; Gong, L.; Zhang, J. Detection Method of Passable Road Areas Based on Laser Point Clouds. Comput. Eng. 2022, 48, 22–29. [Google Scholar] [CrossRef]
Qian, R.; Lai, X.; Li, X. 3D Object Detection for Autonomous Driving: A Survey. Pattern Recognit. 2021, 130, 108796. [Google Scholar] [CrossRef]
Qi, C.; Su, H.; Mo, K.; Guibas, L. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 652–660. [Google Scholar]
Qi, C.; Yi, L.; Su, H.; Guibas, L. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 770–779. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. In Proceedings of the Italian National Conference on Sensors, Catania, Italy, 21–23 February 2018; Volume 18, p. 3337. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying voxel-based representation with transformer for 3d object detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 17545–17555. [Google Scholar]
He, X.; Yan, A.; Chen, L.; Hou, P.; Dong, D.; Ma, Y. An Improved PointPillars for Fast and Accurate 3D Object Detection. In Proceedings of the 2021 Unmanned Systems Summit Forum (USS 2021), Changsha, China, 23–24 September 2021; pp. 115–120. [Google Scholar]
Chen, D.; Yu, W.; Gao, Y. Lidar 3D Object Detection Based on Improved Point Pillars. Laser Optoelectron. Prog. 2023, 60, 447–453. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.-Y.; Berg, A. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Guo, M.-H.; Cai, J.; Liu, Z.-N.; Mu, T.-J.; Martin, R.R.; Hu, S. PCT: Point cloud transformer. Comput. Vis. Media 2020, 7, 187–199. [Google Scholar] [CrossRef]
Engel, N.; Belagiannis, V.; Dietmayer, K. Point Transformer. IEEE Access 2020, 9, 16259–16268. [Google Scholar] [CrossRef]
Xia, Y.; Xu, Y.; Li, S.; Wang, R.; Du, J.; Cremers, D.; Stilla, U. SOE-Net: A self-attention and orientation encoding network for point cloud based place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11348–11357. [Google Scholar]
He, C.; Li, R.; Li, S.; Zhang, L. Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8417–8427. [Google Scholar]
Liu, Z.; Yang, X.; Tang, H.; Yang, S.; Han, S. FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1200–1211. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Qiu, S.; Wu, Y.; Anwar, S.; Li, C. Investigating Attention Mechanism in 3D Point Cloud Object Detection. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–15 September 2022; pp. 403–412. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Girshick, R.B. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2117–2125. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1907–1915. [Google Scholar]
Shin, K.; Kwon, Y.; Tomizuka, M. RoarNet: A Robust 3D Object Detection based on RegiOn Approximation Refinement. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 2510–2515. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the IEEE/RJS International Conference on Intelligent Robots and Systems, Vancouver, BC, Canada, 24–28 September 2017; pp. 1–8. [Google Scholar]
Qi, C.; Liu, W.; Wu, C.; Su, H.; Guibas, L. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. TANet: Robust 3D Object Detection from Point Clouds with Triple Attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 11677–11684. [Google Scholar]
Huang, Z.; Zhao, J.; Zheng, Z.; Chena, D.; Hu, H. PSA-Det3D: Pillar Set Abstraction for 3D object Detection. Pattern Recognit. Lett. 2022, 168, 138–145. [Google Scholar] [CrossRef]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1951–1960. [Google Scholar]

Figure 1. TGPP algorithm network structure.

Figure 2. TGPP algorithm feature-encoded network structure.

Figure 3. Loss change curve.

Figure 4. Object detection results in the actual road environment. Car: green bounding boxes; pedestrian: blue bounding boxes; cyclist: purple bounding boxes. The red box displays the difference in detection performance between TGPP and PointPillars. By comparing the object in the red box, it can be found that TGPP has better detection performance.

Table 1. Comparison of average precision of 3D object detection (%).

Method	Runtime (ms)	Car (IoU = 0.7)			Pedestrian (IoU = 0.5)			Cyclist (IoU = 0.5)			mAP Mod.
Method	Runtime (ms)	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard	mAP Mod.
PointPillars	16	85.06	76.05	72.03	52.08	45.88	41.67	78.64	60.83	57.43	60.92
TGPP	21	87.74	77.89	74.65	56.92	49.85	45.09	80.05	62.95	59.67	63.56

Table 2. Comparison of 3D object detection accuracy with other methods (%).

Method	Runtime (ms)	Car (IoU = 0.7)			Pedestrian (IoU = 0.5)			Cyclist (IoU = 0.5)			mAP Mod.
Method	Runtime (ms)	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard	mAP Mod.
Lidar& Img.
MV3D	360	71.29	62.68	56.56	-	-	-	-	-	-	-
RoarNet	100	83.71	73.04	59.16	-	-	-	-	-	-	-
AVOD-FPN	100	81.94	71.88	66.38	50.80	42.81	40.88	64.00	52.18	46.61	55.62
F-PointNet	169	81.20	70.39	62.19	51.21	44.89	40.23	71.96	56.77	50.39	57.35
Only Lidar
Voxel-base
VoxelNet	220	77.47	65.11	57.73	39.48	33.69	31.50	61.22	48.36	44.37	49.05
SECOND	50	83.13	73.66	66.20	51.07	42.56	37.29	70.51	53.85	46.90	56.39
TANet	35	83.81	75.38	67.66	54.92	46.67	42.42	73.84	59.86	53.46	60.63
PSA-Det3D	80	87.46	78.80	74.47	49.72	42.81	39.58	75.82	61.79	55.12	61.13
Point-base
PointRCNN	100	85.94	75.76	68.32	49.43	41.78	38.63	73.93	59.60	53.59	59.05
STD	80	86.61	77.63	76.06	53.08	44.24	41.97	78.89	62.53	55.77	60.71
TGPP	21	87.74	77.89	74.65	56.92	49.85	45.09	80.05	62.95	59.67	63.56

Table 3. Results of ablation experiments.

Group	Transformer Layer	Transformer Head	Epoch	Training Time	mAP Mod.
1 (PointPillars)	--	--	120	5	60.42
2 (TGPP)	1	1	120	5	59.76
3 (TGPP)	2	1	120	5	59.95
4 (TGPP)	2	2	120	5	61.34
5 (TGPP)	4	1	120	5	62.26
6 (TGPP)	4	2	120	5	63.32
7 (TGPP)	6	1	120	5	62.81
8 (TGPP)	6	2	120	5	62.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Meng, H.; Yan, Y.; Xu, X. Transformer-Based Global PointPillars 3D Object Detection Method. Electronics 2023, 12, 3092. https://doi.org/10.3390/electronics12143092

AMA Style

Zhang L, Meng H, Yan Y, Xu X. Transformer-Based Global PointPillars 3D Object Detection Method. Electronics. 2023; 12(14):3092. https://doi.org/10.3390/electronics12143092

Chicago/Turabian Style

Zhang, Lin, Hua Meng, Yunbing Yan, and Xiaowei Xu. 2023. "Transformer-Based Global PointPillars 3D Object Detection Method" Electronics 12, no. 14: 3092. https://doi.org/10.3390/electronics12143092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Global PointPillars 3D Object Detection Method

Abstract

1. Introduction

2. TGPP Algorithm Network

2.1. Overall Algorithm Process

2.2. Feature Encoding Network Based on Transformer

2.3. 2D Convolutional Neural Network and SSD Detection Head

3. Algorithm Implementation Details

3.1. Details of Feature Encoding Network Structure Parameters

3.2. Loss Calculation

4. Testing Results

4.1. KITTI Dataset Division

4.2. Experiment Analysis

4.2.1. Model Training

4.2.2. Model Training

4.3. Comparison of Actual Road Environment Test Results

5. Ablation Experiments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI