An Efficient Point Cloud Semantic Segmentation Method Based on Bilateral Enhancement and Random Sampling

Shan, Dan; Zhang, Yingxuan; Wang, Xiaofeng; Luo, Wenrui; Meng, Xiangdong; Liu, Yuhan; Gao, Xiang

doi:10.3390/electronics12244927

Open AccessArticle

An Efficient Point Cloud Semantic Segmentation Method Based on Bilateral Enhancement and Random Sampling

by

Dan Shan

^1,2

,

Yingxuan Zhang

³,

Xiaofeng Wang

¹,

Wenrui Luo

³,

Xiangdong Meng

²,

Yuhan Liu

¹ and

Xiang Gao

^1,*

¹

School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China

²

School of Electrical and Control Engineering, Shenyang Jianzhu University, Shenyang 110168, China

³

School of Sowftware Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(24), 4927; https://doi.org/10.3390/electronics12244927

Submission received: 6 November 2023 / Revised: 1 December 2023 / Accepted: 5 December 2023 / Published: 7 December 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Point cloud semantic segmentation is of utmost importance in practical applications. However, most existing methods have evolved to be incredibly intricate, leading to a rise in complexity that has made them increasingly impractical for real-world utilization. The escalating complexity of these methods has resulted in a deterioration in their efficiency and ease of implementation, making them less suitable for use in time-sensitive and resource-constrained environments. Towards this issue, we propose an efficient and lightweight segmentation method, able to achieve a remarkable performance in terms of both segmentation accuracy, training speed, and space consumption. Specifically, we first propose to adopt random sampling to replace the original one to obtain more efficiency. Moreover, a lightweight decoding module and an improved bilateral enhancement (BAE) module are developed to further improve the performance. The proposed method achieved a 73.6% and 60.7% mIoU on the S3DIS and Semantickitti datasets, respectively. In the future, the random sampling and the proposed BAE module can be adopted in a more concise and lightweight network to achieve faster and more-accurate point cloud segmentation.

Keywords:

point cloud; semantic segmentation; random sampling

1. Introduction

Owing to the rapid advancement of 3D data-acquisition technology, various types of 3D scanners such as LiDAR scanners and RGBD cameras are becoming increasingly prevalent in our day-to-day lives [1,2,3]. These 3D scanners can well capture vast amounts of data, enabling AI-powered machines to perceive and recognize the world more accurately. Specifically, point cloud data serve as the main tool. Moreover, the development of deep learning is increasingly emphasizing the value of practical applications [4]. Among its various applications, point cloud segmentation [5,6] stands as a pivotal process in point cloud data processing.

In recent years, point cloud semantic segmentation has emerged as a hot research topic due to its immense potential in sectors like autonomous driving [7], virtual reality [8], and robotics [9]. Existing studies mainly focus on enhancing the accuracy of segmentation [10,11,12], which has led to an increase in model complexity. However, this has given rise to another issue. It is widely known that point clouds are scattered, irregular, disordered, and unevenly distributed in 3D spaces [13]. Meanwhile, point cloud data collected from the real-world environment usually consist of large scenes with millions or even billions of points. The complexity of the model and the intrinsic properties of point cloud data result in prolonged training times and extremely high memory usage. Hence, we address this issue by proposing a method that can achieve remarkable results in both segmentation accuracy and efficiency.

Point cloud segmentation is primarily achieved through deep learning. PointNet [14] is the initial deep learning network that directly inputs 3D point clouds to produce segmentation outcomes, representing a groundbreaking advance. PointNet++ [15] adopts a multilayer perceptual field from a CNN to address some limitations of PointNet. This approach enables an increase in the accuracy of point cloud segmentation. However, due to the constraint of the input size of point clouds, these methods are unable to sustain the same exceptional performance in large-scale semantic segmentation tasks. To effectively manage millions of point cloud data, the super point graph (SPG) [16] is proposed as a method to represent point clouds by using an SPG and subsequently employing a graph convolutional neural network to extract contextual features. Subsequently, RandLA-Net [17] utilized the simplest and most efficient RS (random sampling) approach in the downsampling process of point clouds, resulting in favorable outcomes in terms of model time complexity and space complexity. BAAFNet [13] introduced a bilateral structure to efficiently handle multiresolution point clouds and utilized adaptive fusion methods within the decoding module to more comprehensively and efficiently represent point pair features.

As the existing semantic segmentation methods for point clouds become more and more accurate, the cost is that the network becomes increasingly complex, which deviates from the trajectory of practical applications, making the practical application value smaller [18]. Therefore, in the next study, we will consider designing a lightweight network to achieve faster speed. In addition, due to the inevitable increase in point cloud data in the future, the burden on hardware will become heavier. In response to this problem, we will consider designing a network with smaller parameters. Specifically, our contributions towards the above issues are listed as follows:

A more-effective bilateral enhancement module is proposed to address the problem of missing key points caused by random sampling. This module applies a modified sampling method and enhances the algorithm to detect key points more accurately. Additionally, it adaptively samples data based on the characteristics of the input data to better fit different datasets.
A two-branch structure with jointly adaptive pooling and mean pooling is proposed to further enhance the extraction of key features. This structure splits the input data into two branches, with each branch employing a different pooling method to extract features. This design allows for a more-comprehensive capture of various features in the data, enhancing the richness and diversity of the extracted features.
A lightweight decoding module is proposed to further reduce the time and space complexity of the model. This module employs a series of compact techniques to quickly process the input data and fully extract features. Additionally, it utilizes lightweight network structures and optimization techniques to further reduce parameter counts and computational requirements, allowing the model to operate in resource-constrained environments.

2. Methodology

As mentioned before, point cloud data suffer from severe scattered, irregular, disordered, and unevenly distributed issues. Moreover, point cloud datasets collected from the real world typically consist of large scenes with millions or even billions of points. These issues lead to a fact that existing point cloud methods are often accompanied by long training times and low efficiency. Therefore, as shown in Figure 1, we propose to adopt a random sampling technology (see Section 2.1) to address this issue. Additionally, to achieve better performance in segmentation accuracy while satisfying a high training efficiency and low memory consumption, a bilateral enhancement structure (see Section 2.2) combined with a CBAM attention mechanism [19] and a two-branch structure with attentive pooling and mean pooling are designed in the encoding module. Lastly, a lightweight multiresolution decoding module is designed to further reduce the time and space complexity (see Section 2.3).

2.1. Random Sampling

Currently, the majority of point cloud segmentation algorithms utilize farthest-point sampling, which entails selecting the point that is furthest from the

k - 1

points obtained from the previous sampling for each sampling. However, this sampling operation has a time complexity of O(

M^{2} N

) [20,21], which results in high computational complexity and low efficiency. In contrast, random sampling randomly selects M points uniformly from the original N points, with each point being selected with the same probability. This method is independent of the total number of input points, meaning that it has a constant time complexity of O(M). Random sampling has two advantages. Firstly, it has a significant computational efficiency because it is independent of the total number of input points. Secondly, it does not require additional memory to compute. However, random sampling inevitably has some drawbacks. Specifically, some significant point features may be discarded by chance, particularly in sparse areas, leading to a degraded segmentation performance.

2.2. Improved Bilateral Enhancement

To address the problem arising from random sampling, a modified bilateral enhancement module is employed to ensure that the model’s efficiency and effectiveness are balanced. The input comprises geometric information P and semantic information F, as shown in Figure 2. Subsequently, the semantic information is enhanced by using the CBAM attention mechanism to obtain neighboring features via the KNN algorithm, which are then integrated with neighboring features to generate the local semantic context. For the geometric information, P also identifies neighboring features by using the KNN algorithm and merges the absolute position of P and the relative position of its neighbors into the local context. Additionally, an offset is obtained through a simple MLP operation for the local semantic context, and rich semantic information is used to enrich the local geometric information. Subsequently, the enhanced local geometric information is employed to enhance the local semantic information in turn, and the enhanced local semantic information is integrated with the geometric information to acquire the enhanced local context. Finally, this local context may not be sufficient for some shared neighborhoods with a boundary point segmentation performance. To address this issue, the generalized view of the neighborhood is obtained by using attentive pooling. Additionally, more details of the neighborhood are obtained via average pooling. The two branches are then combined to obtain the final output features, which serve as the input for the decoding module.

Figure 2. The bilateral enhancement module.

G I

refers to geometric information;

S I

refers to semantic information;

C (G I_{i})

is the local geometric context of G and its neighbors obtained by the 3D-KNN algorithm; and similarly,

C (S I_{i})

is the local semantic context. While

S I_{j}

and

G I_{j}

are the offset,

C^{^{'}} (G I_{i})

and

C^{^{'}} (S I_{i})

are the enhanced local geometric context and local semantic context. Mean pooling is the weighted average. The final output

P_{i}

is the input of the decoding module.

Figure 2. The bilateral enhancement module.

G I

refers to geometric information;

S I

refers to semantic information;

C (G I_{i})

is the local geometric context of G and its neighbors obtained by the 3D-KNN algorithm; and similarly,

C (S I_{i})

is the local semantic context. While

S I_{j}

and

G I_{j}

are the offset,

C^{^{'}} (G I_{i})

and

C^{^{'}} (S I_{i})

are the enhanced local geometric context and local semantic context. Mean pooling is the weighted average. The final output

P_{i}

is the input of the decoding module.

Since the point cloud raw data has three-dimensional coordinate (XYZ) information and RGB information, the proposed method first treats the three-dimensional coordinate information as a separate geometric information, denoted as G. In addition, the feature information obtained through a feature extractor is treated as preliminary semantic information. This process results in two branches of information. Due to the advantages of MLP in flexibly representing features in three-dimensional space, the feature extractor applies an

1 \times 1

convolutional layer, followed by batch normalization and a ReLU activation function to obtain preliminary semantic features.

Afterwards, the importance of learning channels and spatial features is weighted by introducing an attention module to the preliminary semantic features. The attention module is based on the CBAM attention mechanism [19] and consists of two attention mechanisms connected in sequence. While the input features undergo channel attention and spatial attention, the output features undergo feature modification, resulting in a feature map that is consistent with the input feature map in both spatial and channel dimensions.

After the attention module’s effect on semantic information S, for the center point

s_{i}

, KNN is used to find its neighbors under the metric of the three-dimensional Euclidean distance

\forall s_{j} \in N_{i} (s_{i})

, where N refers to the number of point clouds, and the corresponding geometric information point is represented as

g_{i}

. In order to simultaneously obtain global and local information about the neighborhood, the absolute and relative positions of the center point are combined into a local context

F_{l}

. Therefore,

F_{l} (s_{i}) = [s_{i}; s_{i} - s_{j}]

is represented as a local semantic context in a three-dimensional space, and

F_{l} (g_{i}) = [g_{i}; g_{i} - g_{j}]

is represented as a local geometric context in a three-dimensional space.

However, because

F_{l} (s_{i})

and

F_{l} (g_{i})

only use the absolute distance to find neighbor points and aggregate them, it may lead to the insufficient generalization ability of and redundant representation of the neighborhood boundary position. Therefore, in order to solve the above problems and enhance the generalization ability of features, bilateral offsets are added to enhance the local context. Firstly, based on the rich semantic information of

F_{l} (s_{i})

, the local geometric context

F_{l} (g_{i})

is enhanced by applying MLP on

F_{l} (g_{i})

to estimate the offset of neighbors

\forall g_{j} \in N_{i} (g_{i})

. This process can be represented by Equation (1):

g_{j} = M (F_{l} (s_{i})) + g_{j}, g_{j} \in R^{3} .

(1)

Then,

{\tilde{F}}_{l} (g_{i}) = [g_{i}; g_{i} - g_{j}; g_{j}]

is used to represent the enhanced local geometric context, where

{\tilde{F}}_{l} (g_{i}) \in R^{k \times 9}

and k represents the number of neighbors.

Similarly, the offset

s_{i}

of neighbor features can also be obtained based on geometric information, so the enhanced local geometric context is further used to enhance the local semantic context. This process can be represented by Equation (2) similarly:

f_{j} = M ({\tilde{F}}_{l} (g_{i})) + f_{j}, (f_{i}) \in R^{d},

(2)

where d denotes the dimension,

{\tilde{F}}_{l} (s_{i}) = [s_{i}; s_{i} - s_{j}; s_{j}]

denotes the enhanced local semantic context, and

{\tilde{F}}_{l} (s_{i}) \in R^{k \times 3 d}

.

Finally, after applying MLP on

{\tilde{F}}_{l} (s_{i})

and

{\tilde{F}}_{l} (g_{i})

, they are connected together to obtain an enhanced local context, as shown in Equation (3):

F_{i} = c o n c a t (M ({\tilde{F}}_{l} (g_{i})), M ({\tilde{F}}_{l} (s_{i}))), F_{i} \in R^{k \times d} .

(3)

Pointwise feature representation is crucial for semantic segmentation tasks. Although bilateral offsets can effectively summarize local information, they cannot accurately represent local uniqueness, especially for nearby points that share a similar local context. To address this issue, a dual-branch pooling method is proposed to fully collect accurate neighborhood representations.

Given the enhanced local context

F_{i}

, adaptive pooling is firstly used to obtain comprehensive domain information and highlight important points. Moreover, weighted average pooling is further employed to learn the weighted average of features on the domain and acquire more detailed information. Finally, the two types of information are combined to obtain an accurate point representation.

In other words, each neighbor utilizes its own features to compute weights for each feature dimension, and the weights are multiplied with the original features to obtain new features for each neighbor. Then, the features of all neighbors of a point are aggregated by summation to obtain new features for each point. From the perspective of the neighbor points, this approach amplifies important features of each neighbor and suppresses unimportant features. From the perspective of each point itself, this approach weighs and sums the features of each neighbor at the feature-dimension level. This aggregation approach not only adaptively learns the importance of each neighbor for each point but also adaptively learns the importance of different feature dimensions for each point. In contrast, average pooling performs weighted averaging on features within the neighboring area, treating different neighbors of a point equally and treating different features within each neighbor equally, assigning them the same weight. Therefore, it has a smoothing effect that can reduce the impact of noise and irregular features. By combining the two pooling methods into a dual-branch structure, more useful point information can be effectively aggregated. Finally, the accurate point representation obtained from the output is used as input to the decoding module.

2.3. Multiresolution Decoding Module

Typically, in end-to-end neural network models, the decoder’s task is to map the low-dimensional feature output by the encoder back to the high-dimensional representation of point cloud data, thereby generating a segmentation result. For the used encoder in our method, a bilateral enhancement module was designed and cascaded to achieve more-effective segmentation by utilizing multiresolution point cloud information. Therefore, in order to effectively analyze real 3D point cloud scenes composed of a large number of points and further improve the model-training efficiency, the used decoder upsamples the multiresolution output of the encoder and then fuses them into a comprehensive feature map of the entire point cloud scene. The complex decoder structure is replaced with a relatively simple and efficient multiresolution decoding module as depicted in Figure 3. This module incorporates multiple upsampling and deconvolution operations, as well as concatenation operations. Geometric information (GI) and semantic information (SI) are both encompassed within the module. Notably, the subsampling (DS) and upsampling (US) operations are integral components of the module’s architecture. Such a structure benefits by significantly alleviating the computational burden of the traditional decoder.

Figure 3. The encoder and decoder structure used in our network.

G I

refers to geometric information and

S I

refers to semantic information. The structure of the decoder module is multiple upsampling and deconvolution operations.

D S

stands for subsampling and

U S

stands for upsampling. C means concatenation.

Figure 3. The encoder and decoder structure used in our network.

G I

refers to geometric information and

S I

refers to semantic information. The structure of the decoder module is multiple upsampling and deconvolution operations.

D S

stands for subsampling and

U S

stands for upsampling. C means concatenation.

Firstly, the last layer of the encoder features are upsampled by using the nearest neighbor interpolation method, where the size of the upsampled features is consistent with the that of the encoded features of the previous layer. Then, the decoded features of the current layer are spliced with the encoded features of the previous layer and sent to the transposed convolution for dimensionality reduction. The resulting features are used as the decoded features of the previous layer. This process is repeated until all the encoded features of different resolutions in each layer are fused and the first layer of the decoded features is obtained. The decoded features of the current layer have the same resolution as the original point cloud, and the final segmentation result is obtained by using MLP to perform a dimensionality reduction on the decoded features of the current layer. In this way, it not only considers the analysis and interpretation of different resolutions but also avoids more-redundant operations, achieving both efficiency and accuracy.

3. Experimental Results and Analysis

3.1. Datasets and Evaluation Metrics

The proposed methods were evaluated on the real indoor scene point cloud dataset S3DIS [22] and the real outdoor scene point cloud dataset SemanticKITTI [23].

The Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS) [22] is a large-scale indoor 3D point cloud dataset provided by Stanford University. It contains six teaching and office areas with a total of 695,878,620 3D points with color information and semantic labels. The data are semantically separated into 272 rooms and annotated with 12 semantic elements and an additional label for clutter. It is typically used for indoor semantic segmentation [24].

It should be noted that in order to verify the effectiveness of the proposed methods, a six-fold cross-validation experiment will be conducted on the S3DIS dataset to effectively avoid overfitting and underfitting. Since the S3DIS dataset has six areas, the six-fold cross-validation mainly uses the following method: sequentially treating each area as a test set, and the remaining five areas as training sets for experiments, obtaining results for different test sets, resulting in six values, and finally calculating the average of these six values as the final experimental result.

SemanticKITTI [23] is a large-scale outdoor scene dataset containing 28 classes, including moving and nonmoving objects. Overall, the 28 classes cover traffic participants such as pedestrians and vehicles but also ground facilities such as parks and sidewalks. The original ranging data include 22 sequences, of which sequences 00–10 are divided into training data and sequences 11–21 are divided into testing data. SemanticKITTI is an authoritative dataset in the field of autonomous driving. Based on the KITTI dataset, it annotates all sequences in the KITTI Vision Odometry Benchmark and also provides dense point-wise annotations for all targets captured within the LiDAR’s 360-degree range.

At the same time, for the evaluation of the segmentation results of point cloud datasets, the Mean Intersection over Union (mIoU), accuracy (Acc), and mean accuracy (mAcc) were used. mIoU refers to the average intersection over union, which is the average of the IoU values for each class in the dataset. Assuming that TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively, among them, the TP part in the middle is the intersection of the true value and the predicted value, and the FN + FP + TP part is the union of the true value and the predicted value. The IoU is the ratio of the intersection to the union, which can be represented by Equation (4):

IoU = \frac{T P}{F N + F P + T P} .

(4)

In addition, mIoU can be represented by Equation (5):

mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P},

(5)

where k is the category of the label and mIoU is the standard metric for semantic segmentation.

Accuracy (Acc) is the most common evaluation metric, which is the percentage of correctly predicted samples by the classifier, defined as follows by Equation (6):

Acc = \frac{T P + T N}{T N + T P + F N + F P} .

(6)

In addition, mean accuracy (mAcc) is the average accuracy of all categories.

3.2. Experimental Environment

The proposed method trains 100 epochs on a single GeForce RTX 2080Ti GPU, mainly using Python as the development language and based on the TensorFlow 1.13.1 deep learning framework. CUDA 10.1 is used for computational acceleration. Adam is used as the optimizer with a batch size of five and a learning rate of 0.025. The cross-entropy loss function is used to train the dataset. As shown in Figure 4, we can observe the trend of the loss function during training. At 100 epochs, the change in the loss function has stabilized, indicating that the model has converged.

3.3. Parameter Selection

In order to investigate the impact of the hyperparameters, including the batch size and learning rate, we conduct multiple experiments, as shown in Figure 5. Due to the limitation of the hardware conditions caused by the large amount of point cloud data, the selection range of the batch size is only set from three to five. Within this range, the larger the batch size is, the better the segmentation results are. At the same time, we also report the experimental results on the selection of the learning rate for different batch sizes. As shown in Figure 5, the impact of the hyperparameters’ batch size and learning rate has a significant influence on the performance of the proposed method. Moreover, the segmentation performance under different batch sizes gradually improves with the increase in the learning rate and then gradually declines. It can be observed that the segmentation performance of the method proposed in this chapter is optimal when the batch size is equal to five and the learning rate is equal to 0.025.

3.4. Ablation Analysis

To verify the effectiveness of the attention mechanism and adaptive pooling, we report the ablation experiment results on the S3DIS Area5 dataset in terms of both quantitative results and visualization figures. It should be noted that since there is no beam category in S3DIS Area5, its segmentation accuracy is all 0, and its results are not shown in this table. Table 1 presents the quantitative results of the ablation experiments on the S3DIS Area5 dataset. The 13 rows, such as the ceiling, windows, and others, represent different point cloud categories. The baseline in the table refers to the model consisting of a bilateral enhanced encoding structure without an attention mechanism and adaptive pooling, as well as a multiresolution decoding module. Baseline + FPS represents the model consisting of a bilateral enhanced encoding structure without an attention mechanism and adaptive pooling, as well as a multiresolution decoding module under the farthest-point sampling. Baseline + CBAM represents the experimental data that use an attention module but do not have adaptive pooling, while baseline + adaptive pooling represents the experimental data that have adaptive pooling but do not have a CBAM attention mechanism. The final model is the one that has both a CBAM attention mechanism and adaptive pooling under random sampling. From the experimental results in the table, we can see that each module in this chapter plays an important role in the final segmentation effect. After integrating the attention mechanism, the mIoU improves by 0.62% compared to the baseline as the attention mechanism enables the network to pay more attention to important points and regions, thereby improving the segmentation accuracy. After using adaptive pooling, the mIoU improves by 1.03% compared to the baseline, mainly because adaptive pooling enhances the expression ability of local features, thus positively affecting the segmentation effect. Finally, the model integrates both the attention mechanism and adaptive pooling simultaneously, resulting in an mIoU improvement of 2.58% compared to the baseline.

The effectiveness of each module is visualized as shown in Figure 6. Figure 6a refers to the visualization result of the Ground Truth, Figure 6b is the visualization structure of the baseline, Figure 6c denotes the segmentation-effect visualization under the farthest-point sampling (FPS), Figure 6d represents the experimental result of the baseline + CBAM, Figure 6e refers to the experimental result of the baseline + adaptive pooling, and Figure 6f is the experimental result of the final model proposed in this chapter.

It can be clearly seen from the visualization results that the final model has a better segmentation performance than other models. Among them, the overall segmentation results are similar, but it is particularly noteworthy that the black frame in the figure highlights the difference in segmentation performance between the window class and the sofa class. Compared with other models, the final proposed model performs better in both areas.

3.5. Comparative Experiment

We first compare the sampling time and GPU memory usage of two sampling methods: farthest-point sampling (FPS) and random sampling (RS). Then, we compare the computational time and network parameters of the proposed method and other advanced methods on the S3DIS Area5 dataset. Finally, we compare the final experimental results of the proposed method with several popular methods on two datasets: S3DIS and SemanticKITTI. At the same time, we also report visualized results as a part of the comparative experiments.

As shown in Figure 7, taking 1000 points as example, in the proposed method, the point cloud will undergo downsampling five times, with each sampling rate set to 4, 4, 4, 4, and 2. That is to say, the number of sampled points will decrease at a rate of 25%, 25%, 25%, 25%, and 50% in each downsampling process. In this experiment, a comparison was made with the same sampling rate for only five rounds of the downsampling process. Other experiments with different point cloud sizes follow the same principle. Meanwhile, as farthest-point sampling is the most commonly used sampling method for point cloud processing and its sampling time and memory usage both have a critical impact on training, this chapter also conducted comparative experiments between random sampling and farthest-point sampling in terms of sampling time and GPU memory usage.

In Figure 7a, for smaller point cloud sizes, there is not a significant difference in sampling time between FPS and RS. However, as the point cloud size increases, the sampling time of FPS gradually becomes much higher than that of RS. This indicates that in large-scale point clouds, RS is much more efficient than FPS, making it more suitable for large-scale point clouds.

Meanwhile, as shown in Figure 7b, with the same number of point clouds, FPS requires more memory than RS, and as the number of point clouds increases, FPS’s memory consumption nearly remains unchanged compared to RS, and is still much larger than RS. Therefore, both in terms of sampling time and GPU memory consumption, random sampling methods are more suitable than the furthest-point sampling method in large-scale point cloud scenarios.

Here we report the comparative results of current popular methods on the S3DIS Area5 dataset in terms of computation time and network parameters in the Table 2. As shown in Figure 7a, the proposed method has a lower computation time than other methods, and the number of parameters is also smaller than most methods. This is mainly due to the use of random sampling methods and efficient multiresolution decoding modules. However, as shown in Figure 7b, it can also be seen that the number of parameters in PointNet is much smaller than other methods, mainly due to the fact that PointNet is a direct point-based pioneer. This indicates it is simpler but also less sufficient for learning points. Less computation time means higher efficiency of the model, and fewer parameters mean less memory resources required. This shows that the proposed method is more efficient than other methods, and the memory resources required for the method are also less. The method proposed in this chapter has achieved competitive results in efficiency.

Then, we report the comparative experimental results on the S3DIS dataset between the proposed method and several currently popular methods. Six folds of cross-validation were used for the six areas in the S3DIS dataset. The proposed method and currently popular methods were evaluated on two metrics, mAcc and mIoU. As shown in Table 3, the proposed method outperforms them on both metrics. Specifically, the proposed method outperformed BAAF-Net by 0.6% in terms of mAcc and outperforms the CBL method by 0.5% in terms of mIoU. In summary, the proposed method also achieves good results in segmentation accuracy. Table 4 quantitatively demonstrates the mIoU (%) of the proposed method compared to several other popular methods on the SemanticKITTI dataset. Similarly, the proposed method outperforms them with a 0.8% higher mIoU than BAAF-Net, and the segmentation results are relatively accurate.

Figure 8 demonstrates the visual experiments on the SemanticKITTI dataset. It shows the comparison of the proposed method and two other methods in three scenarios. As shown in the figure, each column represents one of the three scenes in the SemanticKITTI dataset, while each row represents the original data of the sequence in the first row and the visualization results of three different methods in subsequent rows, which are PointNet++, BAAF-Net, and the proposed experimental method.

It can be observed from the visualization results that due to the large point cloud scenes in this dataset, it is difficult to show very subtle differences in segmentation. The overall scene-segmentation effect does not seem to vary much. However, in some difficult-to-segment and error-prone areas, there are observable differences in the segmentation results. As shown in the red box, for all three scenes, it can be observed that PointNet++ has poorer segmentation results with more errors, while the proposed method performs relatively better.

Although the proposed method achieves a satisfactory performance on the used datasets, it is still necessary to point out the difficulties in practical applications [27]. The application of point cloud semantic segmentation faces challenges in real-world scenarios. Data quality and diversity are crucial factors that affect the accuracy of segmentation. Inadequate data quality, such as noise, incompleteness, or missing data, can undermine the performance of the system. Additionally, a lack of diverse datasets may limit the system’s ability to generalize. Furthermore, environmental changes, such as variations in lighting, object movement, or background changes, can affect the acquisition and segmentation of point cloud data. In many practical applications, real-time performance is essential, such as in autonomous driving and robot navigation. Therefore, processing speed and efficiency are crucial considerations.

4. Conclusions

This paper proposed an efficient semantic segmentation method for point clouds based on random sampling and bilateral enhancement, where the latter aims to address the problem of missing key points caused by random sampling. We also propose a dual-branch structure of adaptive pooling and average pooling in the network encoder to further enhance the extraction of key features. Finally, a lightweight decoding module is proposed to further reduce the time and space complexity of the model. The effectiveness of this method is validated on the large-scale indoor dataset S3DIS with 73.6 mIoU and the large-scale outdoor dataset SemanticKITTI with 60.7 mIoU. The experimental results also demonstrate that the proposed method outperforms several competitors in terms of accuracy, training speed, and memory consumption.

Author Contributions

Data curation, Y.Z.; methodology, D.S. and X.W.; writing—original draft preparation, W.L. and Y.L.; writing—review and editing, X.M. and X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grants (Grant No.62003225 and No.62073039), the Shenyang Science and Technology Project: 23-407-3-05, and the Educational Department of Liaoning Provincial Basic Research Project: JYTMS20231572.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationship that could have appeared to influence the work reported in this paper.

References

Xiao, A.; Huang, J.; Guan, D.; Zhang, X.; Lu, S.; Shao, L. Unsupervised point cloud representation learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11321–11339. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Li, G.; Wunderlich, T.; Wang, L. A survey on deep learning-based precise boundary recovery of semantic segmentation for images and point clouds. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102411. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4338–4364. [Google Scholar] [CrossRef] [PubMed]
Khan, H.; Hussain, T.; Khan, S.U.; Khan, Z.A.; Baik, S.W. Deep multi-scale pyramidal features network for supervised video summarization. Expert Syst. Appl. 2024, 237, 121288. [Google Scholar] [CrossRef]
Lu, Y.; Jiang, Q.; Chen, R.; Hou, Y.; Zhu, X.; Ma, Y. See more and know more: Zero-shot point cloud segmentation via multi-modal visual data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 21674–21684. [Google Scholar]
Ye, M.; Wan, R.; Xu, S.; Cao, T.; Chen, Q. Efficient point cloud segmentation with geometry-aware sparse networks. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 196–212. [Google Scholar]
Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. 2021, 23, 722–739. [Google Scholar] [CrossRef]
Wirth, F.; Quehl, J.; Ota, J.; Stiller, C. Pointatme: Efficient 3d point cloud labeling in virtual reality. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 1693–1698. [Google Scholar]
Seita, D.; Wang, Y.; Shetty, S.J.; Li, E.Y.; Erickson, Z.; Held, D. Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; pp. 1038–1049. [Google Scholar]
Li, D.; Shi, G.; Li, J.; Chen, Y.; Zhang, S.; Xiang, S.; Jin, S. PlantNet: A dual-function point cloud segmentation network for multiple plant species. ISPRS J. Photogramm. Remote Sens. 2022, 184, 243–263. [Google Scholar] [CrossRef]
Nie, D.; Lan, R.; Wang, L.; Ren, X. Pyramid architecture for multi-scale processing in point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17284–17294. [Google Scholar]
Yang, C.K.; Wu, J.J.; Chen, K.S.; Chuang, Y.Y.; Lin, Y.Y. An mil-derived transformer for weakly supervised point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11830–11839. [Google Scholar]
Qiu, S.; Anwar, S.; Barnes, N. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1757–1767. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Xu, Q.; Zhou, Y.; Wang, W.; Qi, C.R.; Anguelov, D. Spg: Unsupervised domain adaptation for 3d object detection via semantic point generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 15446–15456. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Khan, K.; Khan, R.U.; Albattah, W.; Nayab, D.; Qamar, A.M.; Habib, S.; Islam, M. Crowd counting using end-to-end semantic image segmentation. Electronics 2021, 10, 1293. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, C.; Ning, X.; Sun, L.; Zhang, L.; Li, W.; Bai, X. Learning Discriminative Features by Covering Local Geometric Space for Point Cloud Analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Liu, K.; Gao, Z.; Lin, F.; Chen, B.M. Fg-net: A fast and accurate framework for large-scale lidar point cloud understanding. IEEE Trans. Cybern. 2022, 53, 553–564. [Google Scholar] [CrossRef] [PubMed]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Thabet, A.; Alwassel, H.; Ghanem, B. Self-supervised learning of local features in 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 938–939. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Tang, L.; Zhan, Y.; Chen, Z.; Yu, B.; Tao, D. Contrastive boundary learning for point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8489–8499. [Google Scholar]
Khan, H.; Ullah, M.; Al-Machot, F.; Cheikh, F.A.; Sajjad, M. Deep learning based speech emotion recognition for Parkinson patient. Image 2023, 298, 2. [Google Scholar] [CrossRef]

Figure 1. Framework of our method. The proposed efficient point cloud semantic segmentation method is based on bilateral enhancement modules (BEMs) (detailed in Figure 2) and random sampling. The random sampling is used to reduce the computational burden, while BEM is proposed to address the subsequent degradation problem caused by random sampling. This method shows complementary properties to each other. In addition, a lightweight multiresolution decoding module (detailed in Figure 3) is proposed to further reduce the time and space complexity of the model.

Figure 4. Loss curves during training.

Figure 5. Influence of batch size and learning rate.

Figure 6. Visualized ablation results.

Figure 7. Comparison of farthest-point sampling (FPS) and random sampling (RS) in terms of training speed and space consumption.

Figure 8. Visualization results on SemanticKITTI dataset.

Table 1. Ablation results in terms of IoU (%) and mIoU (%) on S3DIS Area5 dataset.

Method	Baseline	Baseline + FPS	Baseline + CBAM	Baseline + Adaptive Pooling	Ours
Ceiling	92.08	92.71	93.09	92.53	92.45
Floor	97.58	97.88	97	97.55	98.04
Wall	81.93	82.13	82.83	80.8	81.85
Column	27.54	25.11	25.78	29.46	27.91
Window	58.27	62.16	63.88	59.82	65.27
Door	49.13	47.06	49.42	50.23	58.92
Table	77.58	79.3	77.13	76.05	78.72
Chair	85.89	85.93	86.78	86.13	86.4
Sofa	65.17	70.1	68.17	73.51	78.51
Bookcase	68.96	72.5	69.14	69.71	69.16
Board	64.97	68.68	63.88	65.02	66.51
Clutter	52.48	55.2	52.54	54.1	51.21
Average (mIoU)	63.19	64.52	63.81	64.22	65.77

Table 2. Comparison of experiment results on S3DIS Area5 dataset for computing time and network parameters.

Method	Time (s)	#Parameters (Millions)
PointNet [14]	324	1.4
PointNet++ [15]	9851	1.52
PointCNN [25]	8426	17.31
RandLA-Net [17]	241	2.37
BAAF-Net [13]	283	7.13
CBL [26]	352	9.54
Ours	228	5.05

Table 3. Comparison experiment results on S3DIS dataset.

Method	mAcc	mIoU
PointNet [14]	66.2	47.6
PointNet++ [15]	67.1	54.5
PointCNN [25]	75.6	65.4
RandLA-Net [17]	82	70
BAAF-Net [13]	83.1	72.2
CBL [26]	79.4	73.1
Ours	83.7	73.6

Table 4. IoU

(%)

and mIoU

(%)

of different methods on Semantickitti dataset, where each row corresponds to the IoU values of each method per category.

Table 4. IoU

(%)

and mIoU

(%)

of different methods on Semantickitti dataset, where each row corresponds to the IoU values of each method per category.

Method	PointNet [14]	PointNet++ [15]	RandLA-Net [17]	BAAF-Net [13]	Ours
road	61.6	72	90.7	90.9	90.3
sidewalk	35.7	41.8	73.7	74.4	75.1
parking	15.8	18.7	60.3	62.2	64.4
other—ground	1.4	5.6	20.4	23.6	23.2
building	41.4	62.3	86.9	89.8	88.7
car	46.3	53.7	94.2	95.4	95.8
truck	0.1	0.9	40.1	48.7	49.4
bicycle	1.3	1.9	26	31.8	35.2
motorcycle	0.3	0.2	25.8	35.5	37.6
other—vehicle	0.8	0.2	38.9	46.7	47.3
vegetation	31	46.5	81.4	82.7	84.4
trunk	4.6	13.8	61.3	63.4	65.2
terrain	17.6	30	66.8	67.9	68.1
person	0.2	0.9	49.2	49.5	50.3
bicyclist	0.2	1	48.2	55.7	56.1
motorcyclist	0	0	7.2	53	56.9
fence	12.9	16.9	56.3	60.8	58.3
pole	2.4	6	49.2	53.7	54.1
traffic sign	3.7	8.9	47.4	52	53.6
Average (mIoU)	14.6	20.1	53.9	59.9	60.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shan, D.; Zhang, Y.; Wang, X.; Luo, W.; Meng, X.; Liu, Y.; Gao, X. An Efficient Point Cloud Semantic Segmentation Method Based on Bilateral Enhancement and Random Sampling. Electronics 2023, 12, 4927. https://doi.org/10.3390/electronics12244927

AMA Style

Shan D, Zhang Y, Wang X, Luo W, Meng X, Liu Y, Gao X. An Efficient Point Cloud Semantic Segmentation Method Based on Bilateral Enhancement and Random Sampling. Electronics. 2023; 12(24):4927. https://doi.org/10.3390/electronics12244927

Chicago/Turabian Style

Shan, Dan, Yingxuan Zhang, Xiaofeng Wang, Wenrui Luo, Xiangdong Meng, Yuhan Liu, and Xiang Gao. 2023. "An Efficient Point Cloud Semantic Segmentation Method Based on Bilateral Enhancement and Random Sampling" Electronics 12, no. 24: 4927. https://doi.org/10.3390/electronics12244927

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Point Cloud Semantic Segmentation Method Based on Bilateral Enhancement and Random Sampling

Abstract

1. Introduction

2. Methodology

2.1. Random Sampling

2.2. Improved Bilateral Enhancement

2.3. Multiresolution Decoding Module

3. Experimental Results and Analysis

3.1. Datasets and Evaluation Metrics

3.2. Experimental Environment

3.3. Parameter Selection

3.4. Ablation Analysis

3.5. Comparative Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI