DB-Net: Detecting Vehicle Smoke with Deep Block Networks

Chen, Junyao; Peng, Xiaojiang

doi:10.3390/app13084941

Open AccessArticle

DB-Net: Detecting Vehicle Smoke with Deep Block Networks

by

Junyao Chen

and

Xiaojiang Peng

^*

College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(8), 4941; https://doi.org/10.3390/app13084941

Submission received: 10 March 2023 / Revised: 6 April 2023 / Accepted: 12 April 2023 / Published: 14 April 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Vision-based vehicle smoke detection aims to locate the regions of vehicle smoke in video frames, which plays a vital role in intelligent surveillance. Existing methods mainly consider vehicle smoke detection as a problem of bounding-box-based detection or pixel-level semantic segmentation in the deep learning era, which struggle to address the trade-off of localization accuracy and speed. In addition, although various studies have been reported, there is no open benchmark available for real vehicle smoke detection. To address these issues, we made three contributions as follows: (i) We built a real-world vehicle smoke semantic segmentation dataset with 3962 polygon-based annotated vehicle smoke images, which will be released to the community. (ii) We regard vehicle smoke detection as a block-wise prediction problem and propose a conceptually new, yet simple deep block network model (DB-Net). It provides more accurate localization information than bounding-box-based ones and has a lower computational cost than semantic segmentation methods. (iii) We introduce a coarse-to-fine training strategy, where we first pre-train a model on bounding-box annotated data and then fine-tune it on pixel-wise labeled data. We compare our DB-Net to several advanced methods and evaluate them in several metrics. Extensive experiments demonstrate that our method is significantly superior to other methods.

Keywords:

vehicle smoke detection; deep learning; convolutional neural networks; dual-branch; coarse-to-fine training

1. Introduction

Over the last few decades, the number of vehicles has been increasing worldwide, especially in developing countries [1]. With the increase in the number of vehicles, vehicle emissions are becoming one of the significant sources of air pollution in many areas, including carbon dioxide, volatile organic compounds, and particulate matter, which have been associated with many adverse human health effects [2]. Visible vehicle emissions, a heavily polluted type of vehicle emission, are defined as vehicle smoke. A typical example of vehicle smoke is shown in Figure 1. The popular methods for detecting vehicle smoke rely on the manual report from traffic police or analysis of the vehicle emissions by an exhaust gas analyzer, which is less efficient and extremely expensive. Due to these limitations, detecting vehicle smoke in surveillance videos has recently drawn widespread attention, which provides a broader monitoring field and faster response than the popular methods.

Detecting vehicle smoke in surveillance videos aims to locate the region of vehicle smoke in the video frame, the core of which is detecting smoke. Vision-based smoke detection has been studied extensively in the past 20 years [3]. However, most existing research has focused on the wild scenario [4,5,6], with little attention paid to road vehicle scenarios (i.e., vehicle smoke detection). Compared with the wild scenario, it is more cluttered in the road vehicle scenario, such as shadows from the rear of vehicles and interference between vehicles, which makes vehicle smoke detection a more challenging problem. Although many methods for detecting smoke in the wild scenario can be transplanted to vehicle smoke detection, these transplanted methods still struggle in detection performance due to the significant difference between wild and road vehicle scenarios.

Existing vehicle smoke detection methods can be roughly categorized into traditional image process-based and deep learning-based methods. Traditional image process-based approaches usually rely on hand-crafted features to solve the smoke detection problem, which recognizes smoky and non-smoky regions according to different characteristics. There are two critical steps in the traditional image process-based methods. Firstly, extracting hand-crafted features according to the visual or statistical clues from regions, such as texture features [7], dynamic features [8], and spatiotemporal features [9]. After feature extraction, a classifier such as the K-Nearest Neighbors (K-NN) classifier [10] or Support Vector Machine (SVM) [11] is used to categorize these regions as non-smoky and smoky regions. Research on the traditional image process-based approach mainly focuses on designing interpretable and efficient feature descriptors to improve the recognition accuracy of the smoke region. However, the performance of these feature descriptors is dependent on prior knowledge and artificial design, which the complex environment can severely influence. For example, the dynamic features of the smoke can be changed by vehicle speed, and different backgrounds could make the texture feature of smoke significantly different.

Recently, deep learning techniques have been applied to various computer vision tasks and made tremendous progress, including object detection and semantic segmentation [12,13]. Both deep learning-based object detection and semantic segmentation approach can be applied in vehicle smoke detection. Object detection approaches for vehicle smoke detection provide rough location information, while semantic segmentation approaches realize pixel-level prediction by outputting a mask that contains classification, localization, and detailed depiction of boundaries. Nevertheless, semantic segmentation for vehicle smoke detection still has some challenges. The first problem is that training the semantic segmentation model requires extensive and high-quality labeled data. Even worse, there is still no public semantic segmentation dataset for vehicle smoke in the community. The second challenge for the semantic segmentation approach is that, compared with conventional objects, it is more difficult to infer an accurate mask with vehicle smoke due to the significant variations in smoke appearance and confused boundaries. Thirdly, detecting vehicle smoke in surveillance videos usually requires real-time detection, while semantic segmentation models are generally time-consuming and heavy.

In this paper, we make a trade-off between object detection and semantic segmentation, and propose a conceptually new, yet simple deep block network (DB-Net). DB-Net is built with fully convoluted layers, which take as input arbitrary resolution images, and block-wise classification masks as output. Specifically, DB-Net contains a main branch for feature abstraction and a feature aggregation branch that enhances features for smoke classification. Compared to popular semantic segmentation methods, our DB-Net has fewer parameters and is much faster, and also maintains comparable or superior performance. In addition, to facilitate the progress of vehicle smoke detection, we built a real-world vehicle smoke semantic segmentation dataset, termed the Polygon-based annotated Vehicle Smoke Segmentation dataset (PoVSSeg). The images of PoVSSeg have rich diversities in road conditions, weather, vehicle types, smoke types (light or strong), and so forth. In total, PoVSSeg contains 3962 vehicle smoke images with polygon annotations. We expect that our PoVSSeg can be a new benchmark for smoke detection or segmentation in images. Furthermore, we propose a coarse-to-fine training strategy to make full use of existing bounding-box annotated data. In fact, annotating smoke regions with accurate boundaries is difficult, while it is relatively easy to mark a bounding box for each smoke. To this end, we regard a massive collection of bounding-box labeled data as coarse polygon annotated data and pre-train our DB-Net on them as a coarse warm-up. With the pre-trained model, we further fine-tune it on our PoVSSeg to learn more accurate smoke features. Overall, this training strategy boosts our DB-Net effectively.

This paper is arranged as follows. Related works on vehicle smoke detection, including some works on wild smoke detection, are briefly reviewed in Section 2. Then our datasets are introduced and compared with the existing datasets in Section 3. Furthermore, our methods are described in Section 4, and experimental results and analysis are presented in Section 5. Finally, we provide a conclusion and discuss future work in Section 6.

2. Related Works

In this section, we briefly review the related work on vehicle smoke detection, including some works on wild smoke detection. Here, we roughly classify these existing methods as traditional image-process-based methods and deep-learning-based methods.

2.1. Traditional Image Process Based Methods

Existing works on the traditional image-process-based approach mainly focus on designing interpretable and efficient feature descriptors. Yuan et al. [7] proposed a feature extraction method based on similarity and dissimilarity matching measures of local binary patterns and combined the proposed method with local binary patterns for the texture feature modeling of the smoke. Dimitropoulos et al. [8] introduced a higher-order linear dynamical system descriptor for extracting dynamic texture features in video smoke detection. Tao et al. [14] presented a robust volume local binary count pattern for smoke sequence recognition; the volume local binary count pattern is an extension of the 2D texture local binary count to the spatiotemporal domain, which combines texture and dynamic features. Tao et al. [15] developed an adaptive scale local binary pattern and a discriminative edge orientation histogram for obtaining texture and gradient-based features, respectively, and used a multi-feature fusion-based technique for detecting vehicle smoke. Overall, due to the hand-crafted feature descriptor only extracting the shallow feature of the smoke, traditional image process-based methods may be limited by the representation abilities of the shallow feature.

2.2. Deep-Learning-Based Methods

Deep learning methods for vehicle smoke detection can be divided into object detection models and semantic segmentation models from their outputted results.

For object detection models, Wang et al. [16] proposed a two-stage model for vehicle smoke detection. The first stage is to locate the regions around exhaust pipes where the suspicious smoky region is; in the second stage, a multi-region convolutional tower network is utilized to determine the suspicious region further. Zhou et al. [17] applied an efficient spatial attention module in a convolutional neural network to recognize the smoky region. Wang et al. [18] designed a lightweight network based on the YOLOv5 (https://github.com/ultralytics/yolov5, accessed on 1 July 2022) to detect vehicle smoke rapidly when the model is deployed on embedded devices.

For semantic segmentation models, Yuan et al. [19] built a classification-assisted gated recurrent network with feature extraction, semantic segmentation, classification, and a fusion module for smoke semantic segmentation. In the segmentation module, they introduced attention convolutional gated recurrent units for learning the long-range context dependence of features. Additionally, a multi-scale context contrasted the local feature structure, and a dense pyramid pooling module was applied to improve the representation ability of the network. Inspired by fully convolutional networks (FCNs) [20], Yuan et al. [21] established a deep segmentation network constructed by two encoder–decoder FCNs with different skip structures which are for extracting global context information about the smoke and retaining fine spatial details of smoke, respectively. Sheng et al. [22] presented a dual-branch network composed of deep and shallow branches for vehicle smoke segmentation, where the deep branch is for global prediction, and the shallow branch is for spatial details. Moreover, a pyramid attention structure and skip modules were used in the network to expand the receiving range and integrate multi-scale features.

3. Vehicle Smoke Dataset

Although many methods have been proposed for vehicle smoke detection, to the best of our knowledge, there is still no publicly available vehicle smoke dataset annotated with delicate location information in the community. Existing smoke datasets primarily consist of images or videos in wild or outdoor settings. A limited number of datasets are focused on road vehicle scenarios while only providing rough annotations for smoke locations. To this end, we built a high-quality vehicle smoke image dataset annotated by the polygon for delicate location. In the rest of this section, we outline the creation of our dataset and compare it to existing datasets.

3.1. Dataset Collection and Annotation

To build a high-quality vehicle smoke dataset, we first obtained real-world surveillance videos from various traffic scenes and captured vehicle smoke frames from these videos. After data collection, we manually annotated the polygonal smoke region in the image by using the LabelMe toolbox (https://github.com/CSAILVision/LabelMeAnnotationTool, accessed on 1 July 2022). Due to the smoke’s translucency, manually annotating the polygonal vehicle smoke was an arduous task. In order to obtain good qualitative annotations, we employed several people to annotate the polygonal smoke region carefully. Moreover, when the respective part of the work was completed, it needed to be further validated by other colleagues. Eventually, we built a high-quality dataset, including 3962 vehicle smoke images with polygon-based annotation. We named the polygon-based annotated Vehicle Smoke Segmentation dataset as PoVSSeg. The PoVSSeg dataset encompasses a wide diversity in terms of road conditions (highway or urban), weather (sunny, cloudy, and rainy), vehicle types (bus, truck, and car), and smoke types. Figure 2 illustrates some samples from the PoVSSeg. Moreover, we randomly separated the PoVSSeg into a training and a test set, which contained 3812 and 150 images, respectively.

3.2. Dataset Comparison

We compare our datasets to several existing datasets in Table 1. The VSD [23] (see Figure 3a) consists of four sub-datasets, including 24,217 images, which are manually collected from the Internet, then cropped, resized, and labeled as smoke or non-smoke for smoke recognition. The RF dataset [24] (see Figure 3b) is composed of synthetic images, since it is challenging to obtain wild scenario smoke images in the real world. These samples were produced by inserting real smoke into forest backgrounds and annotating the smoke with bounding boxes. The Synthetic Smoke dataset [21] (see Figure 3c) is also composed of synthetic images, which are different from the RF dataset in generating samples by inserting simulative smoke into real-world images, and annotated the smoke at pixel-level. These datasets have driven progress in smoke recognition, detection, and segmentation, in wild or outdoor scenarios. As for the road vehicle scenario, Peng et al. [25] proposed the LaSSoV-video and LaSSoV datasets. The LaSSoV-video, annotated with temporal segments, consists of 82 training videos and 81 test videos in four traffic scenes. The LaSSoV includes a training set with 70,000 images and a test set with 5000 images, annotated with bounding boxes for providing coarse locations of the smoke. In contrast with these existing datasets, our PoVSSeg is the only real-world dataset that provides precise annotated locations for smoke.

4. Method

4.1. Problem Formulation

Semantic segmentation methods regard vehicle smoke detection as a pixel-wise prediction task, which usually takes expensive computational costs [26]. To overcome the inherent limitation of the segmentation method, we leverage a block-wise recognition strategy for vehicle smoke detection, which lies between semantic segmentation and object detection in granularity. Specifically, images were gridded into R rows and C columns to obtain

R \times C

image blocks and recognizing on block-wise. Compared with the segmentation method, the block-wise recognition strategy reduces computational costs significantly more than the segmentation strategy. Suppose an image with pixel size

W \times H

, and X is the feature extracted through the model. The formulation of the segmentation method can be written as follows:

P_{i, j, :} = f^{i j} (X), s . t . i \in [1, W], j \in [1, H]

(1)

where

f^{i j}

is the classifier to predict the vehicle smoke at the position (i, j), and the

P_{i, j, :}

are the corresponding probability vector. In a similar way, the block-wise prediction method can be formulated as follows:

P_{i, j, :} = f^{i j} (X), s . t . i \in [1, C], j \in [1, R] .

(2)

Therefore, the computation reduction ratio (r) can be expressed as:

r = \frac{W \cdot H}{C \cdot R} .

(3)

Taking the setting of the DB-Net as an example, the input size of the image is

960 \times 540

, and the size of the gridding blocks is

60 \times 34

. The computation reduction ratio approximately achieved 254.

4.2. DB-Net

The FCN [20] dramatically impacted the field of semantic segmentation through its encoder–decoder architecture. The encoder is utilized for feature extraction, while the decoder is for aggregating and classifying features. The operations were executed sequentially in the encoder–decoder structured model. Inspired by the encoder–decoder structure and based on the block-wise prediction formulation, we propose the DB-Net, a dual-branch network with a main branch and an aggregation branch. The main branch is designed for feature extraction, and the aggregation branch is for feature enhancement. With the help of the dual-branch architecture, DB-Net can extract and aggregate features parallelly, which is more efficient than the encoder–decoder structured models.

The architecture of the proposed DB-Net is presented in Figure 4. In detail, the main branch is a four-stage convolutional neural network; the feature map yielded by each stage is reduced to 1/2 the size of the stage input along each spatial dimension. In this work, we used the top four stages of a 50-layer ResNet [27] as the main branch to achieve a balance between computation efficiency and performance effectiveness. After a

{1 / 2}^{4}

downsampling in the main branch, an input image with the spatial size of

960 \times 540

was reached to a feature map with

60 \times 34

spatial size. The feature map was already the same size as the labeled ground truth, but it discarded lots of detailed information during the forward propagation. Therefore, we have designed an aggregation branch which fuses the features yielded by each stage of the main branch from shallow to deep to obtain contextual information.

In the aggregation branch, convolution and downsampling operations are utilized to unify the output of the shallower stage and the next stage to the same shape. In detail, convolutional operations ensure channel consistency, while downsampling operations adjust the size of the shallower feature map to match the feature map produced by the next stage. For example, the first and second stages of the main branch produce feature maps of sizes

64 \times 480 \times 270

and

256 \times 240 \times 135

, respectively. The aggregation branch first uses a convolution layer to unify the channel size of both feature maps. It then downsamples the feature map produced by the first stage to match the size of the feature map produced by the second stage. After downsampling, both feature maps have the same shape of

2 \times 240 \times 135

. Subsequently, the aggregation branch performs a pixel-wise summation of the two feature maps for aggregating. We repeated the process from shallow to deep until features from each stage were fused.

After obtaining the features, a softmax function gives the probability of features, which can be written as:

P_{i, j, :} = S o f t m a x (X_{i, j, :})

(4)

X_{i, j, :}

represent the feature at the position (i, j). In the training phase, a standard cross-entropy (CE) loss function is applied to regularize the parameter optimization. Suppose

G_{i, j, :}

is the ground truth of the block at the i-th row, j-th column label. The classification loss can be expressed as:

L_{c l s} = \sum_{i = 1}^{C} \sum_{j = 1}^{R} C E (P_{i, j, :}, G_{i, j :})

(5)

4.3. Coarse-to-Fine Training

As mentioned in the last Section, the LaSSoV [25] has the largest number of images with bounding-box annotations, which can be regarded as rough polygon annotations with abundant background information. To fully take advantage of this dataset, we propose a coarse-to-fine training strategy that first pre-trains the DB-Net on LaSSoV, and we then finetune it on our PoVSSeg.

Before training, we converted the semantic segmentation dataset and the bounding box-based detection dataset into the block-wise labeled dataset to fit the DB-Net structure. To do this, we divided the images into several blocks of the same size. According to DB-Net’s structure, the block size was set to

16 \times 16

. For example, when the image size was

960 \times 540

, the image was divided into

60 \times 34

blocks. The blocks that contained smoke were labeled as smoke blocks, while the rest were labeled as background blocks. The original PoVSSeg was annotated with high-quality polygons, and the detail was retained in the block-wise labeled PoVSSeg-block. The bounding-box annotated LaSSoV was also converted into a block-wise labeled dataset. Notably, the bounding box contained vehicle smoke and a part of the background, which results in some background blocks incorrectly labeled as vehicle smoke after converting to the block-wise annotated dataset. Therefore, only a coarsely annotated block-wise dataset, LaSSoV-block, can be gained after conversion. Figure 5 shows the converted block-wise annotated sample from PoVSSeg and LaSSoV.

After the data conversion, the model was first trained on the LoSSoV-block training set for mining the information hidden in various backgrounds and learning the ambiguous knowledge about the smoke. It is difficult for the model to extract accurate features via coarse training due to error annotation in the LoSSoV-block dataset. Therefore, the PoVSSeg-block training set, which contains precise annotation, was utilized to fine-tune the model after coarse training. Fine-tuning is a crucial part of the training process for the model to acquire correct knowledge about the smoke. By training the model in this manner, we were able to take advantage of high-quality annotations in the PoVSSeg-block, as well as rich samples in the LoSSoV-block, resulting in a more effective model.

5. Experiments

In this section, we first introduce implementation details and evaluation metrics. Subsequently, we design the ablation experiments to verify the rationality of each branch of the network and the training strategy. Finally, we conduct evaluation experiments to prove the effectiveness of our method.

5.1. Implementation Details and Evaluation Metrics

Implementation details. All the experiments were performed using the Pytorch toolbox (https://pytorch.org/, access on 1 July 2022 ) on a Linux server with two GeForce RTX 3090 GPUs. Runtime evaluation was performed on a single GeForce RTX 3090 GPU. We report an average of 100 frames for the frames per second (FPS) measurement. In the training process, we first resized images into

960 \times 540

and randomly cropped the image into

896 \times 448

size for data augmentation. Moreover, we used stochastic gradient descent (SGD) with batch size 16, momentum 0.9, weight decay 5 × 10

^{- 4}

, and the initial learning rate was set to 1 × 10

^{- 3}

in training. The total training epochs number was 50. The frames and images were resized to

960 \times 540

resolution in all experiments.

Evaluation Metrics. The evaluation experiments were conducted on LaSSoV-video and PoVSSeg test sets, which are a video dataset and an image dataset, respectively. In the image dataset, the performance was measured using mean Intersection-over-Union (mIoU), which is:

m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{P_{i} \cap G_{i}}{P_{i} \cup G_{i}},

(6)

where N is the total number of samples, and P and G represent prediction and ground truth, respectively. A greater mIoU value indicates a higher level of block recognition accuracy on the image for the model. On the LaSSoV-video dataset, the recall and precision are used to evaluate the method on frame level, which is formulated as:

P r e c i s i o n = \frac{T P}{T P + F P}, R e c a l l = \frac{T P}{T P + F N},

(7)

where TP, FP, and FN denote true-positive, false-positive, and false-negative, respectively. Based on the Recall and Precision, the F1 score is also an important evaluation metric for method comparison, which is formulated as follows:

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} .

(8)

5.2. Ablation Study

In this subsection, we conduct a series of experiments to demonstrate the rationality of the aggregation branch in DB-Net and verify the effectiveness of the coarse-to-fine training strategy. All the ablation studies were evaluated on the PoVSSeg-block test set.

Aggregation branch. To investigate whether the aggregation branch upgrades the effeteness of DB-Net. We discard the aggregation branch to build a network with only the main branch. As can be observed in Table 2, when models were trained in the same way, the model which contains the aggregation branch consistently performed better than the model which only has the main branch. The increment in the mIoU of the smoke was even more incredible when a coarse or fine-to-coarse strategy trained both models. This demonstrates that the aggregation branch can help the model capture smoke features better, especially in coarse annotations. Furthermore, as shown in the last column of Table 2, the aggregation branch only takes an additional 1670 parameters for the model. These findings indicate that the aggregation branch significantly improved the model’s performance without too much extra computational overhead.

Training strategy. We compared four different training strategies, that is, Fine, Coarse, Fine-to-Coarse, and Coarse-to-Fine, to explore the impact of training strategies on the model’s performance. As the results presented in Table 2 show, first of all, compared to training with the Fine strategy, both models’ performance improved by training with the Coarse-to-Fine training strategy. Secondly, models obtained the worst results when only trained by a coarsely annotated dataset. Thirdly, the Fine-to-Coarse training strategy still resulted in poor performance for models. For Coarse-to-Fine and Fine-to-Coarse, in contrast, both fine and coarse datasets were utilized in the two strategies; the only difference was the training priority of the dataset. The Fine-to-Coarse strategy first trains the model by the finely annotated dataset and then tunes by the coarsely annotated dataset. Hence, the outputs of the model eventually converged to those ambiguous details in the coarsely annotated dataset, which suggests that even if the model was training with the same content, the priority still makes a massive difference in performance.

Figure 6 illustrates the heat map of the prediction results from the DB-Net trained by different training strategies. As can be observed in Figure 6, when the finely annotated dataset was solely trained the model, it only responded strongly to the heavy smoke region. However, when the model was trained or fine-tuned using the coarsely annotated dataset, the model even recognized some background areas as the smoke region. The results are impressive when the model is trained by the coarse-to-fine training strategy. The model can be activated not only by the heavy smoke region, but also by the soft smoke region, which achieves the best prediction performance. All this proves that coarse-to-fine training is the optimal strategy for training the model to obtain better performance.

5.3. Performance Comparisons

In this subsection, we compare our method with several popular deep learning semantic segmentation methods (e.g., FCN-32s [20], PSP-Net [28], DeepLab-v3 [29], PAN [30]), and traditional image process based methods (e.g., R-VLBC [14]). We first report the comparison results about block-level prediction accuracy on the PoVSSeg test set and analyze the speed. Then, we show the results on the LaSSoV-video test set to contrast our method with other methods on frame-level prediction accuracy. Note that all the deep learning segmentation methods use a 50-layer ResNet as the backbone in the evaluation experiments.

Block prediction accuracy and parameter comparisons. The deep learning segmentation method outputted pixel-level prediction results. For evaluating various methods on the same criterion, the pixel-level results from the deep learning segmentation method are transferred to block-level results. Here we show the results in the PoVSSeg test set. Figure 7 illustrates the visualization results of each method. The first column demonstrates the ground truth, and the remaining shows different methods’ results. Table 3 presents the comparison results. As shown in Table 3, our DB-Net achieves 71.14% global mIoU with only 8.54 million parameters, which is an impressive result. Furthermore, as can be observed, compared with other methods, we only use half or even less of their parameter to keep a comparable performance. Additionally, we also list the FPS in Table 3 for reference. DB-net can process a

960 \times 540

image at the speed of 59.12 FPS. The DeepLab-v3 is the most accurate method and has a 3.3% accuracy level higher than ours. However, DeepLab-v3 only achieves a speed of 13.45 FPS, which is far slower than our method. The comparison demonstrated that our method has fast inference speed while maintaining comparable performance.

Frame level prediction accuracy comparisons. We applied several popular frame-level metrics for evaluation. For a fair comparison of the performance among the methods, the detecting thresholds were set to 0.2 for all methods. The evaluation results in each scene of the LoSSoV-video test set are presented in Table 4, and the average Recall, Precision, and F1 in four scenes are further presented in Figure 8. As can be seen in Table 4, our DB-Net achieved the best performance in Scenes 1 and 2 and also attained remarkable performance in Scenes 3 and 4, which proves that our method has better robustness than other methods for different environments. Moreover, as shown in Figure 8, the Deeplab-v3 and FCN-32s achieved the best recall and precision, respectively. Both of them still underperformed in F1. Our method performed the best F1, indicating that we achieved the best trade-off between recall and precision. Except for our method, the R-VLBC, a hand-crafted method designed explicitly for smoke detection, outperformed the deep learning segmentation methods due to the inherent difference between the smoke detection task and the semantic segmentation task.

6. Conclusions and Future Work

In this paper, we elaborately designed the DB-Net, a conceptually new yet simple model for block-wise vehicle smoke detection, and employed a coarse-to-fine training strategy to train the DB-Net. Moreover, we built a real-world vehicle smoke dataset with polygonal annotation, the PoVSSeg, since there is no available dataset, which has further converted to a block-wise annotated dataset to fit the structure of DB-Net. Our method has been demonstrated as effective through a series of experiments, which achieved 71.14% global mIoU in the PoVSSeg-block test set with only 8.54 million parameters and can run at 59.12 FPS on

960 \times 540

high-resolution images. In conclusion, our method achieved the best trade-off in computational efficiency and detection accuracy than other methods. Future research could explore the vehicle smoke detection issues on multi-feature fusion, such as spatial features with temporal features, to improve detection accuracy.

Author Contributions

Conceptualization, J.C. and X.P.; methodology, J.C. and X.P.; software, J.C.; data curation, J.C. and X.P.; writing—original draft preparation, J.C.; writing—review and editing, J.C. and X.P.; All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China (62176165), the Stable Support Projects for Shenzhen Higher Education Institutions (SZWD2021011), and the Natural Science Foundation of Top Talent of SZTU (grant no.GDRC202131).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gakenheimer, R. Urban mobility in the developing world. Transp. Res. Part A Policy Pract. 1999, 33, 671–689. [Google Scholar] [CrossRef]
Zhang, K.; Batterman, S. Air pollution and health risks due to vehicle traffic. Sci. Total. Environ. 2013, 450, 307–316. [Google Scholar] [CrossRef] [Green Version]
Chaturvedi, S.; Khanna, P.; Ojha, A. A survey on vision-based outdoor smoke detection techniques for environmental safety. ISPRS J. Photogramm. Remote Sens. 2022, 185, 158–187. [Google Scholar] [CrossRef]
Lin, G.; Zhang, Y.; Zhang, Q.; Jia, Y.; Wang, J. Smoke detection in video sequences based on dynamic texture using volume local binary patterns. Ksii Trans. Internet Inf. Syst. 2017, 11, 5522–5536. [Google Scholar]
Yuan, F.; Xia, X.; Shi, J.; Zhang, L.; Huang, J. Learning multi-scale and multi-order features from 3D local differences for visual smoke recognition. Inf. Sci. 2018, 468, 193–212. [Google Scholar] [CrossRef]
Yuan, F. A double mapping framework for extraction of shape-invariant features based on multi-scale partitions with AdaBoost for video smoke detection. Pattern Recognit. 2012, 45, 4326–4336. [Google Scholar] [CrossRef]
Yuan, F.; Shi, J.; Xia, X.; Huang, Q.; Li, X. Co-occurrence matching of local binary patterns for improving visual adaption and its application to smoke recognition. IET Comput. Vis. 2019, 13, 178–187. [Google Scholar] [CrossRef]
Dimitropoulos, K.; Barmpoutis, P.; Grammalidis, N. Higher order linear dynamical systems for smoke detection in video surveillance applications. IEEE Trans. Circuits Syst. Video Technol. 2016, 27, 1143–1154. [Google Scholar] [CrossRef]
Tao, H. Detecting smoky vehicles from traffic surveillance videos based on dynamic features. Appl. Intell. 2020, 50, 1057–1072. [Google Scholar] [CrossRef]
Zhao, L.; Luo, Y.M.; Luo, X.Y. Based on dynamic background update and dark channel prior of fire smoke detection algorithm. Appl. Res. Comput. 2017, 34, 957–960. [Google Scholar]
Appana, D.K.; Islam, R.; Khan, S.A.; Kim, J.M. A video-based smoke detection using smoke flow pattern and spatial-temporal energy analyses for alarm systems. Inf. Sci. 2017, 418, 91–101. [Google Scholar] [CrossRef]
Zhao, Z.Q.; Zheng, P.; Xu, S.t.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version]
Lateef, F.; Ruichek, Y. Survey on semantic segmentation using deep learning techniques. Neurocomputing 2019, 338, 321–348. [Google Scholar] [CrossRef]
Tao, H.; Lu, X. Smoke vehicle detection based on robust codebook model and robust volume local binary count patterns. Image Vis. Comput. 2019, 86, 17–27. [Google Scholar] [CrossRef]
Tao, H.; Lu, X. Smoke vehicle detection based on multi-feature fusion and hidden Markov model. J. Real-Time Image Process. 2020, 17, 745–758. [Google Scholar] [CrossRef]
Wang, X.; Kang, Y.; Cao, Y. SDV-net: A two-stage Convolutional neural network for smoky diesel vehicle detection. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 8611–8616. [Google Scholar]
Zhou, J.; Qian, S.; Yan, Z.; Zhao, J.; Wen, H. ESA-Net: A Network with Efficient Spatial Attention for Smoky Vehicle Detection. In Proceedings of the 2021 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Glasgow, UK, 17–20 May 2021; pp. 1–6. [Google Scholar]
Wang, C.; Wang, H.; Yu, F.; Xia, W. A high-precision fast smoky vehicle detection method based on improved Yolov5 network. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence and Industrial Design (AIID), Guangzhou, China, 28–30 May 2021; pp. 255–259. [Google Scholar]
Yuan, F.; Zhang, L.; Xia, X.; Huang, Q.; Li, X. A gated recurrent network with dual classification assistance for smoke semantic segmentation. IEEE Trans. Image Process. 2021, 30, 4409–4422. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Yuan, F.; Zhang, L.; Xia, X.; Wan, B.; Huang, Q.; Li, X. Deep smoke segmentation. Neurocomputing 2019, 357, 248–260. [Google Scholar] [CrossRef]
Sheng, C.; Hu, B.; Meng, F.; Yin, D. Lightweight dual-branch network for vehicle exhausts segmentation. Multimed. Tools Appl. 2021, 80, 17785–17806. [Google Scholar] [CrossRef]
Yuan, F.; Shi, J.; Xia, X.; Fang, Y.; Fang, Z.; Mei, T. High-order local ternary patterns with locality preserving projection for smoke detection and image classification. Inf. Sci. 2016, 372, 225–240. [Google Scholar] [CrossRef]
Zhang, Q.x.; Lin, G.h.; Zhang, Y.m.; Xu, G.; Wang, J.j. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
Peng, X.; Fan, X.; Wu, Q.; Zhao, J.; Gao, P. Video-based Smoky Vehicle Detection with A Coarse-to-Fine Framework. arXiv 2022, arXiv:2207.03708. [Google Scholar]
Holder, C.J.; Shafique, M. On Efficient Real-Time Semantic Segmentation: A Survey. arXiv 2022, arXiv:2206.08605. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]

Figure 1. A typical example of vehicle smoke.

Figure 2. Several samples in the PoVSSeg. The PoVSSeg contained vehicle smoke images from various traffic scenes and annotated with the polygon.

Figure 3. Some samples from existing datasets.

Figure 4. Overall structure of the DB-Net. The main branch is responsible for extracting features, while the aggregation branch is designed to fuse these features. The “⨁” represent pixel-wise summation.

Figure 5. Block-wise annotation converted from PoVSSeg and LaSSoV. Left: Fine-annotated sample. Right: Coarse-annotated sample.

Figure 6. The heat map of the prediction results from the DB-Net, which is trained by different training strategies.

Figure 7. The block prediction result for vehicle smoke.

Figure 8. Average Recall, F1, and Precision in four senses.

Table 1. The comparison between existing datasets and ours.

Database	Type	Scenario	Total Samples	Annotation	Available
VSD [23]	image	wild	24,217	category	public
RF dataset [24]	image	wild	12,620	bounding boxes	public
Synthetic-Smoke [21]	image	-	3000	pixel-level	private
LaSSoV-video [25]	video	road vehicle	163	temporal segments	public
LaSSoV [25]	image	road vehicle	75,000	bounding boxes	public
*PoVSSeg $^{}$** (Ours)	image	road vehicle	3962	polygon	public

^{*}

Available at https://drive.google.com/file/d/1Rt3IEKkriDNQ9wRtpsK5fylvfRlTjjK3/view?usp=share_link

Table 2. Ablation study results on the PoVSSeg test set.

Structure	Training Strategy				Evaluation
Aggregation Branch	Fine	Coarse	Fine-to-Coarse	Coarse-to-Fine	mIoU(%)			Parameters
Aggregation Branch	Fine	Coarse	Fine-to-Coarse	Coarse-to-Fine	Background	Smoke	Global	Parameters
	✔				99.21	36.25	67.73	8,545,346
		✔			98.84	17.58	58.21	-
			✔		98.84	17.93	58.38	-
				✔	99.16	40.31	69.73	-
✔	✔				99.18	36.99	68.08	8,547,016
✔		✔			98.87	24.28	61.57	-
✔			✔		98.91	24.95	61.93	-
✔				✔	99.17	43.12	71.14	-

Table 3. Comparison between our method and famous semantic segmentation methods with mIoU in the PoVSSeg test set.

Method	mIoU(%)			Parameters (M)	FPS
Method	Background	Smoke	Global	Parameters (M)	FPS
PSP-Net	99.1	25.19	62.14	24.31	23.55
FCN-32s	99.2	34.26	66.73	23.51	24.35
PAN	99.24	42.63	70.94	24.26	23.89
DeepLab-v3	99.34	49.54	74.44	39.36	13.45
DB-Net(Ours)	99.17	43.12	71.14	8.54	59.12

Table 4. Comparison between our method and state-of-the-art methods with varied metrics in LoSSoV-video test set.

Method	Scene 1			Scene 2			Scene 3			Scene 4
Method	Recall	Precision	F1	Recall	Precision	F1	Recall	Precision	F1	Recall	Precision	F1
R-VLBC	0.8155	0.4045	0.5407	0.8021	0.3942	0.5285	0.9259	0.1663	0.2819	0.8381	0.3038	0.4459
PSP-Net	0.4234	0.4594	0.4407	0.2212	0.4949	0.3058	0.7451	0.1436	0.2408	0.7743	0.4133	0.5389
DeepLab-v3	0.8235	0.2384	0.3698	0.826	0.2518	0.3859	0.9976	0.0631	0.1186	0.9805	0.1379	0.2418
FCN-32s	0.4791	0.4961	0.4874	0.3491	0.625	0.4479	0.8413	0.1525	0.2582	0.7061	0.5129	0.5942
PAN	0.6638	0.4617	0.5446	0.7016	0.3999	0.5094	0.9013	0.1254	0.2202	0.8685	0.2649	0.406
DB-Net(Ours)	0.6161	0.6283	0.6221	0.6341	0.5864	0.6093	0.9242	0.1155	0.2053	0.8246	0.3895	0.5291

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Peng, X. DB-Net: Detecting Vehicle Smoke with Deep Block Networks. Appl. Sci. 2023, 13, 4941. https://doi.org/10.3390/app13084941

AMA Style

Chen J, Peng X. DB-Net: Detecting Vehicle Smoke with Deep Block Networks. Applied Sciences. 2023; 13(8):4941. https://doi.org/10.3390/app13084941

Chicago/Turabian Style

Chen, Junyao, and Xiaojiang Peng. 2023. "DB-Net: Detecting Vehicle Smoke with Deep Block Networks" Applied Sciences 13, no. 8: 4941. https://doi.org/10.3390/app13084941

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DB-Net: Detecting Vehicle Smoke with Deep Block Networks

Abstract

1. Introduction

2. Related Works

2.1. Traditional Image Process Based Methods

2.2. Deep-Learning-Based Methods

3. Vehicle Smoke Dataset

3.1. Dataset Collection and Annotation

3.2. Dataset Comparison

4. Method

4.1. Problem Formulation

4.2. DB-Net

4.3. Coarse-to-Fine Training

5. Experiments

5.1. Implementation Details and Evaluation Metrics

5.2. Ablation Study

5.3. Performance Comparisons

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI