An Improved Forest Smoke Detection Model Based on YOLOv8

Wang, Yue; Piao, Yan; Wang, Haowen; Zhang, Hao; Li, Bing

doi:10.3390/f15030409

Open AccessArticle

An Improved Forest Smoke Detection Model Based on YOLOv8

by

Yue Wang

¹,

Yan Piao

^1,*,

Haowen Wang

¹,

Hao Zhang

¹ and

Bing Li

²

¹

School of Electronics and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China

²

Institute of Science and Technical Information of Jilin, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(3), 409; https://doi.org/10.3390/f15030409

Submission received: 31 December 2023 / Revised: 31 January 2024 / Accepted: 19 February 2024 / Published: 21 February 2024

(This article belongs to the Section Natural Hazards and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

:

This study centers on leveraging smoke detection for preemptive forest smoke detection. Owing to the inherent ambiguity and uncertainty in smoke characteristics, existing smoke detection algorithms suffer from reduced detection accuracy, elevated false alarm rates, and occurrences of omissions. To resolve these issues, this paper employs an efficient YOLOv8 network and integrates three novel detection modules for enhancement. These modules comprise the edge feature enhancement module, designed to identify smoke ambiguity features, alongside the multi-feature extraction module and the global feature enhancement module, targeting the detection of smoke uncertainty features. These modifications improve the accuracy of smoke area identification while notably lowering the rate of false alarms and omission phenomenon occurrences. Meanwhile, a large forest smoke dataset is created in this paper, which includes not only smoke images with normal forest backgrounds but also a considerable quantity of smoke images with complex backgrounds to enhance the algorithm’s robustness. The proposed algorithm in this paper achieves an AP of 79.1%, 79.2%, and 93.8% for the self-made dataset, XJTU-RS, and USTC-RF, respectively. These results surpass those obtained by the current state-of-the-art target detection-based and neural network-based improved smoke detection algorithms.

Keywords:

forest smoke detection; smoke ambiguity features; smoke uncertainty features

1. Introduction

Two primary means of early forest fire warning are fire detection and smoke detection. Compared with fire detection, smoke detection can ensure the safety of firefighters and minimize the loss of forest and firefighting resources to a greater extent. Smoke detection technology mainly includes smoke identification, segmentation, detection, and concentration estimation. Due to the incapability of smoke recognition to pinpoint the exact smoke occurrence, both smoke segmentation and smoke concentration estimation fall short of meeting the real-time demands of smoke detection. Therefore, this article uses smoke detection technology to detect smoke, which can quickly identify the exact whereabouts of smoke instances, has a wide working range, and meets the requirements for timely rescue and loss reduction in early forest fire warning. Existing smoke detection methods predominantly encompass manual detection, sensor-based detection, and machine vision detection, among which manual detection is costly, and the sensor detection range is narrow, so this study employs the machine vision detection approach. Machine vision techniques for smoke detection can be classified into conventional manual feature selection and deep learning methodologies.

Conventional manual feature selection methods usually detect smoke by manual selection or threshold setting based on one or more features such as frequency domain, color, shape, texture, time, etc., of smoke. Toreyin et al. [1,2,3] presented a smoke and flame detection method utilizing frequency domain features. Besbes et al. [4] introduced a smoke detection approach utilizing color features. Gomes et al. [5] presented a smoke detection approach integrating various features including color, frequency domain, and time. Yuanbin [6] suggested a smoke detection technique integrating diverse features like time, color, texture, and shape. Hossain et al. [7] introduced a forest fire and smoke detection method utilizing ANN and LBP. Wang et al. [8] presented a fire and smoke detection algorithm relying on LBP, LBPV, and Support Vector Machine (SVM). This algorithm employs LBP and LBPV to extract smoke texture features, while SVM is applied to classify the extracted smoke texture features. Although the above techniques have better real-time efficiency and detection precision, they strongly depend on manually selected features, are not generalizable, and are difficult to apply to real-world environments.

Recent research has transitioned from traditional manual feature selection to deep learning smoke detection algorithms. These methods fall into two primary categories: target detection-based and neural network-based improved smoke detection algorithms. Target detection-based improved smoke detection algorithms encompass classical two-stage algorithms like R-CNN [9], Fast R-CNN [10], and Faster R-CNN [11], along with single-stage algorithms like SSD [12] and the YOLO series [13,14,15,16,17,18,19,20,21,22]. While two-stage methods achieve higher detection accuracy, they often suffer from slower real-time performance. Conversely, single-stage algorithms provide faster detection speeds but may compromise detection accuracy. In addition, He et al. [23] introduced RetinaNet-50 and RetinaNet-101, which enhance accuracy in deeper layers but fail to achieve real-time processing. Tan et al. [24] proposed EfficientDet, which improves accuracy and efficiency but lags in detection speed compared to the YOLO series, particularly YOLOv8, renowned for its real-time performance in video surveillance. Guo et al. [25] present an enhanced smoke detection algorithm built upon YOLOv8. Experimental results reveal that the average accuracy of smoke detection reaches 90.1%, a 3% improvement over the original model. However, the enhanced model encounters challenges in detecting smoke under light or haze interference conditions. Yang et al. [26] proposed an improved tomato detection algorithm based on YOLOv8, achieving an average accuracy of 93.4%, a 2% enhancement compared to the original model. Nevertheless, this enhanced algorithm struggles to detect tomatoes in complex backgrounds. Various approaches have been explored within neural network-based improved smoke detection algorithms. Gu et al. [27] introduced a two-channel depth sub-network that struggled with texture-lacking smoke. Khan et al. [28] proposed a lightweight neural network showing better performance only on synthetic datasets. Yuan et al. [29] developed a waveform neural network achieving improved performance limited to synthetic datasets. Cao et al. [30] presented a spatiotemporal crossover network for industrial smoke detection, with good performance on an industrial dataset but facing challenges with high false alarm rates and detecting obscured smoke targets. Hu et al. [31] introduced a multi-directional detection network for forest fire smoke, exhibiting improved performance on a self-made dataset but limited in complex environments. Wang et al. [32] introduced a self-cooperative network for smoke detection, showing better performance on a self-made dataset but struggling to differentiate between smoke plumes and similar things, resulting in elevated false alarms. Yin et al. [33] introduced a smoke detection method based on recurrent neural networks, achieving superior performance on a proprietary dataset. However, the algorithm faces challenges in detecting smoke in complex environments. Zhang et al. [34] proposed a rapid smoke detection method for forest fires using the MMFNet network. While it exhibited improved performance on an in-house dataset, the algorithm encountered difficulties detecting smoke in complex environments. Tran et al. [35] presented a forest fire detection and damage area estimation method by fusing DetNAS and BNN networks. Although accurate in detecting various types of smoke on a larger dataset, the algorithm needed help with slight smoke and complex environments. Wang et al. [36] introduced a two-phase forest wildfire detection experiment based on IP, ML, and DL. While performing better on a proprietary dataset, the algorithm faced challenges detecting smoke in complex environments. Armando et al. [37] proposed an early smoke detection method based on ResNet and EfficientNet networks. Despite improved performance on a proprietary dataset, it exhibited a high rate of false alarms. Almeida et al. [38] proposed a wildfire smoke detection method based on a lightweight convolutional neural network. However, demonstrating superior performance on datasets from CCTV and UAV, its partially synthetic dataset prevents direct evaluation for natural forest surveillance. Li et al. [39] introduced a wildfire smoke detection algorithm based on the 3D-PFCN network. Despite identifying early-stage smoke, the algorithm exhibited a high false alarm rate and limitations in detecting slow-moving smoke in complex environments. Cao et al. [40] proposed a smoke source prediction and detection method based on the EFFNet network. Despite incorporating a temporal module for enhanced spatiotemporal feature representation, the algorithm faced challenges detecting smoke in complex environments.

In machine vision-based smoke detection, some methods leverage 2D convolution to extract spatial features for smoke detection, while others utilize 3D convolution to capture spatiotemporal features. Furthermore, a hybrid network integrating CNN with a temporal modeling module can be employed for extracting spatiotemporal features in smoke detection. Lin et al. [41] proposed a smoke detection algorithm based on a FAST RCNN and 3D CNN network, achieving commendable network performance. Nevertheless, the algorithm comes with high hardware requirements and computational complexity and falls short of delivering satisfactory real-time results. Donahue et al. [42] introduced a visual recognition and description method employing a CNN + LSTM network. This method utilizes CNN to extract spatial features and LSTM to integrate temporal features of moving targets, making it applicable in smoke video recognition. Wang et al. [43] presented a novel framework for video action recognition based on the TSN network, providing an efficient and effective solution for video action recognition with potential application in smoke video recognition. Zhou et al. [44] proposed a framework based on the TRN-Multiscale network for capturing temporal relationships over multiple time scales, which holds promise for use in smoke video recognition. Lin et al. [45] introduced video understanding based on the TSM network, a consideration in smoke video recognition. However, forest surveillance videos, typically positioned on watchtowers, result in distant occurrences of forest fires and slow smoke movement. Consequently, CNN and hybrid networks struggle to extract evident movement information from smoke. This limitation gives rise to overfitting in existing smoke detection algorithms based on the spatial and temporal features of smoke. Therefore, this paper adopts a smoke detection algorithm centered on smoke spatial features.

To address the aforementioned challenges, this study proposes MEG-YOLOv8 based on improved YOLOv8. The improvements involve integrating new modules—multi-feature extraction, global feature enhancement, and edge feature enhancement. The multi-feature extraction module and edge feature enhancement module enhance the network’s feature extraction abilities, boosting smoke detection accuracy and reducing false alarm rates. Additionally, the global feature enhancement module extracts structural smoke information globally, aiding the network in accurately identifying smoke areas and minimizing alarm omission. For effective model performance, neural networks require adequate real forest scene smoke images as datasets. However, available datasets like USTC-RF [46] and XJTU-RS [32] are limited. The USTC-RF dataset lacks semantic meaning and is unsuitable for real forest fire safety monitoring, while the XJTU-RS dataset, with only 6845 smoke images, lacks generalizability and complexity for smoke detection in intricate forest backgrounds. To address this, this paper combines existing real forest smoke datasets and actual forest scene images provided by forest security companies to create a comprehensive dataset called RSF. This dataset comprises 15,373 forest fire smoke images, including those from both normal and complex scenes, serving as a practical resource for forest security monitoring.

In summary, the network architecture proposed in this paper has the following four contributions:

(1): To detect the uncertain features of smoke, this study introduces a multi-feature extraction module by combining the three operations of standard convolution, deformable convolution, and involution, which is capable of achieving the functions of adaptive spatial aggregation and adaptive weight assignment, which makes it possible to extract the smoke features locally, globally, and variably and facilitate differentiation between smoke plumes and smoke-like objects, thereby enhancing smoke detection accuracy and minimizing false alarms.
(2): To accurately identify the smoke region, this paper proposes a global feature enhancement module, which can extract the structural information of smoke from a global perspective, making the extraction of the smoke region more comprehensive, thus reducing the phenomenon of smoke detection omission.
(3): To detect the ambiguity of smoke, this paper proposes an edge feature enhancement module, which can reduce the smoke edge noise and enhance the ability of smoke feature extraction to enhance smoke detection accuracy and decrease false alarms.
(4): It is well known that a dataset’s quality significantly impacts the efficacy of deep learning algorithms, so this paper produces a large forest fire smoke dataset, which not only contains smoke images in normal scenes but also smoke images with complex backgrounds; compared with the existing forest fire smoke dataset, this paper proposes that the dataset has a better practical value.

2. Materials and Methods

2.1. Datasets

This study employs three distinct datasets to showcase the superior performance of the introduced algorithm. As shown in Figure 1a, USTC-RF, developed by Zhang et al. [46], focuses on synthetic smoke in expansive forest settings. This dataset comprises 12,620 forest smoke images synthesized by extracting smoke plume features from 2800 genuine smoke images and embedding them randomly into forest background images. As shown in Figure 1b, XJTU-RS, introduced by Wang et al. [32], caters to real smoke scenarios within broader real-world applications. It was curated from two benchmark datasets, CVPR [47] and USTC [46]. Recognizing the limitations of existing datasets in simulating forest fire warnings due to their inadequacy in capturing smoke in forest environments, a novel dataset, RSF, is proposed in this study, as shown in Figure 1c. The RSF dataset amalgamates 13,675 forest scenario images, incorporating ambiguous and uncertain smoke characteristics from current public smoke datasets and an additional 1698 images sourced from a security company. The detailed training configurations for different datasets are shown in Table 1.

The smoke images of the above three datasets are shown in Figure 1.

2.2. The Proposed Network Architecture

The vague and uncertain nature of early smoke characteristics in forest fires poses challenges for existing smoke detection algorithms in accurately identifying and pinpointing smoke locations. This paper introduces a target detection network, MGE-YOLOv8, to address this issue, illustrated in Figure 2.

Since YOLOv8 demonstrates superior performance in target detection, this paper extends its application to smoke detection. In addition, this paper adds a new MEM (multi-feature extraction module), GEM (global feature enhancement module), and EEM (edge feature enhancement module) to the YOLOv8 network based on the ambiguity and uncertainty characteristics of smoke. To detect the uncertainty features of smoke, this study suggests that the MEM splices the local features, non-rigid features, and global features of smoke well by means of residual concatenation of standard convolution, deformable convolution, and involution, which enhances the ability of multi-feature representation of smoke. To augment the global dependence of the advanced features, this paper proposes the global feature enhancement module, which strengthens the structural information of the advanced features through the spatial self-attention mechanism and the channel self-attention mechanism connected serially, which enhances the global nature of the advanced features. Finally, to detect the ambiguity features of smoke, this study introduces the EEM to enhance the smoke boundary information by suppressing the smoke edge noise through median filter, and on the other hand, a different convolutional feature extraction process enhances the smoke feature representation capability. The introduction of these modules makes the smoke detection algorithm presented in this study able to accurately identify a variety of changing smoke features and ambiguous smoke features, which improves the precision of smoke detection in the early stage of forest fires and reduces the false alarm rate as well as reduces the phenomenon of missed alarms.

2.3. Multi-Feature Extraction Module

Smoke uncertainty characteristics mainly come from two aspects; the first is that smoke is a non-rigid object without a fixed geometric shape, so the morphology of smoke is uncertain, leading to the diminished precision of the existing smoke detection algorithms; the second is that there are many objects in the background of the smoke image that are similar to the smoke, such as clouds, light, fog, as well as haze, and so on, so the classification of smoke is uncertain, leading to the existing smoke detection algorithms having a high rate of false positives. The primary cause of the low accuracy and elevated false alarm rate in existing smoke detection algorithms lies in the predominant use of various forms of standard convolution within neural networks for feature extraction. Standard convolution’s fixed geometric mechanism fails to capture the deformable geometrical attributes inherent in non-rigid objects like smoke. Moreover, the weight-sharing property of standard convolution struggles to discern effectively between smoke plumes and smoke-like things. Consequently, existing neural networks extract a single feature, unable to represent multiple features adequately. To address the aforementioned issues, this study suggests an MEM in which the standard convolution, deformable convolution, and involution are spliced in parallel with a residual connection structure. The detailed network architecture is shown in Figure 2b, while an extensive architectural analysis is presented in Section 3.2.

The standard convolution (Conv) as the base operator of a neural network has spatial invariance and channel specificity, i.e., the parameters of the convolution kernel are shared spatially and not shared in the channel; the specific network structure is shown in Figure 2b. The standard convolution effectively captures local image features but lacks adaptability to varying visual patterns across spatial locations, thereby restricting the convolutional kernel’s perception field. Assuming a given input feature map X, the standard convolution output feature map

Y_{s}

can be computed using the following equation:

y_{s} (p) = \sum_{k = 1}^{K} w_{s} (p_{k}) \cdot x (p + p_{k})

(1)

where K is the number of convolutional kernel parameters,

p_{k}

is the position corresponding to different convolutional kernel parameters,

w_{s}

is the standard convolutional learnable parameters, and p is the position corresponding to each pixel point in the feature map.

Unlike standard convolution, involution (IVC) [48] possesses spatial specificity and channel invariance characteristics. The network structure is depicted in Figure 2b. Involution can aggregate context in the broader space and can adaptively assign weights to different pixel points on the feature map, thus highlighting the most informative visual elements in the spatial domain, and it can well distinguish smoke plumes from smoke-like objects. The feature map

Y_{i}

of the involution output can be computed using the subsequent equation:

y_{i} (p) = \sum_{k = 1}^{K} w_{i_{1}} \cdot w_{i_{2}} (p) \cdot x (p + p_{k})

(2)

where

w_{i_{1}} \in R^{\frac{C}{r} \times C}, w_{i_{2}} \in R^{(K \times K \times G) \times \frac{c}{r}}

are both linear transformation matrices, C is the number of input channels,

G ≪

C is the number of convolution kernels shared by all the channels, and r is the channel reduction ratio. The two linear matrix transformations can adaptively compute the weight parameters at different locations of the convolution kernel according to different inputs.

Deformable convolution [49] (DCN) can adaptively adjust the sampling offset and tuning index according to the input data to achieve adaptive spatial aggregation; the detailed network structure is illustrated in Figure 2b. DCN can well capture the non-rigid features of the image, but the region of deformable convolution is sometimes larger than the region where the target is located and thus is prone to erroneous detection results. The feature map

Y_{d}

output by deformable convolution can be computed using the subsequent equation:

y_{d} (p) = \sum_{k = 1}^{K} w_{d} \cdot x (p + p_{k} + \cdot p_{k}) \cdot Δ m_{k}

(3)

where

Δ p_{k}

and

Δ m_{k}

are obtained by convolution computation on the original input feature map, the convolution layer maintains the same size of spatial resolution as the input feature map, the output of the convolution layer is 3K channels, where the 2K channels correspond to the x-axis and y-axis offsets of each convolution kernel parameter, the last K channels correspond to the modulation scalars obtained by the sigmoid layer of each convolution kernel parameter, and

w_{d}

is the parameter that can be learnt by the deformable convolution. Since

Δ p_{k}

is usually a fraction, the pixel value of

x (p + p_{k} + Δ p_{k})

is computed by bilinear interpolation, which can be computed using the subsequent equation:

x (p) = \sum_{q} G (p, q) \cdot x (p)

(4)

where q is the spatial position in the input feature map that is involved in the computation, p is the fractional position, and G () is the bilinear interpolation kernel.

Using the above equation, we can see that

Conv = SiLU (BN (Y_{s} (X)))

, DCN

= SiLU (BN (Y_{s} (Y_{d} (X)))),

IVC

= SiLU (BN (Y_{s} (Y_{i} (X))))

, and the standard convolution within the DCN and IVC modules is just a simple 1 × 1 convolution used to align the dimensions of parameters. For a given input feature map X, the output feature map

f_{out}

obtained by the MEM can be calculated using the following equation:

f_{1} = s p l i t (C o n v (X))

(5)

f_{2} = C o n v (C o n v (f_{1})) + f_{1}

(6)

f_{3} = D C N (C o n v (f_{2})) + f_{2}

(7)

f_{4} = I V C (C o n v (f_{3})) + f_{3}

(8)

f_{5} = C o n v (C o n v (f_{4})) + f_{4}

(9)

f_{6} = D C N (C o n v (f_{5})) + f_{5}

(10)

f_{7} = I V C (C o n v (f_{6})) + f_{6}

(11)

f_{o u t} = {\begin{matrix} c o n c a t [f_{1}, f_{2}, f_{3}, f_{4}], n = 3 \\ c o n c a t [f_{1}, f_{2}, f_{3}, f_{4,}, f_{5,}, f_{6,}, f_{7}], n = 6 \end{matrix}

(12)

where the number of Bottleneck block repetitions is shown.

The MEM proposed in this paper combines the advantages of various convolutions, which can well extract the local and global features of the image, and achieves the functions of adaptive spatial aggregation and adaptive weight assignment.

2.4. Global Feature Enhancement Module

As the neural network layers increase, while the high-level feature map offers rich semantic information beneficial for accurate classification, it lacks structural information essential for precise localization in the regression network. Therefore, this paper proposes a GEM to enhance the structural information of the high-level feature map; the specific network structure is depicted in Figure 2a. GEM can mine the structural information of the smoke in the feature map from a global perspective for better feature learning. Specifically, for each feature location in the high-level feature map, in order to intensively capture the structural information of global scope and local appearance information, this paper splices together the pairwise correlations of each feature point with all feature locations as well as the relationship of the feature itself and learns the attention through the spatial and channel dimensions [50], thus not only achieving semantic enhancement of clustering-like information but also augmenting the structural information of the target.

In this study, we first introduce the channel attention calculation method, assuming that the given input feature graph

f_{in}

outputs

X \in R^{C \times H \times W}

after convolutional computation, where C, H, and W correspond to the channels, heights, and widths of the input feature map, each channel pixel in

X

has a

W \times H

dimensional vector, all the pixels in the channel are stacked into a graph

G_{c}

with a total of C nodes, and each node is denoted as

x_{i} \in R^{WH}

,

i = 1, 2, \dots, C

. The correlation between the two nodes

i

and

j

is denoted as

r_{i . j}^{c} \in R^{c}

, which is computed by using the dot product of the affinity matrix. To maintain the structural details of the extracted features, this study adopts the bi-directional pairwise correlation of each node to represent the correlation vector of each node as follows:

r_{i} = c o n c a t [r_{i, j,}^{c}, r_{j, i}^{c}] = c o n c a t [φ_{c} {(x_{i})}^{T} δ_{c} (x_{j}), φ_{c} {(x_{j})}^{T} δ_{c} (x_{i})], j = 1, 2, \dots, C

(13)

where

φ_{c}

and

δ_{c}

are both modular functions consisting of 1 × 1 spatial convolution, batch normalization, and the activation function,

r_{i} \in R^{2 C}

. To make the proposed channel attention not only focus on the correlation vector of each feature point but also focus on the feature itself, this paper splices the input feature maps and the correlation vectors of the input feature maps and finally adopts the sigmoid function to calculate the channel attention

a_{c}

, which is shown as follows:

a_{c} = s i g m o i d (c o n v (c o n c a t [G A P_{s} (α_{c} (x_{i})), β_{c} (r_{i})]))

(14)

where

α_{c}

and

β_{c}

are both modular functions consisting of 1 × 1 spatial convolution, batch normalization, and activation functions.

{GAP}_{s}

is the global average pooling along the spatial dimension.

The computation of spatial attention is similar to the computation of channel attention for the output feature graph

X^{*} \in R^{C \times H \times W}

computed by multiplying

a_{c}

and X. Each pixel point of the space has a C-dimensional vector, and all the pixels in the space are stacked into a graph

G_{s}

that has a total of A = W × H nodes. The correlation between every two nodes

i

and

j

is denoted as

r_{i . j}^{s} \in R^{A}

; the bi-directional pairwise correlation of each node representing the correlation vector of each node is as follows:

r_{i} = c o n c a t [r_{i, j,}^{s}, r_{j, i}^{s}] = c o n c a t [φ_{s} {(x_{i}^{*})}^{T} δ_{s} (x_{j}^{*}), φ_{s} {(x_{j}^{*})}^{T} δ_{c} (x_{i}^{*})] i, j = 1, 2, \dots, C

(15)

where

φ_{s}

and

δ_{s}

are both modular functions consisting of 1 × 1 spatial convolution, batch normalization, and activation functions,

r_{i} \in R^{2 A}

. Similar to the computation principle of channel attention, the computation of spatial attention is specified as follows:

a_{s} = s i g m o i d (c o n v (c o n c a t [G A P_{c} (α_{s} (x_{i})), β_{s} (r_{i})]))

(16)

where

α_{s}

and

β_{s}

are both modular functions consisting of 1 × 1 spatial convolution, batch normalization, and activation functions.

{GAP}_{c}

is the global average pooling along the channel dimension.

The above formula shows that for the input feature map

f_{in}

, the output

f_{out}

computed by the GEM is calculated as follows:

f_{o u t} = a_{s} \cdot a_{c} \cdot c o n v (f_{i n}) + f_{i n}

(17)

The introduced GEM in this study enhances the network’s attention towards the structural information of smoke targets. Consequently, the entire neural network achieves more precise identification of smoke areas, thereby reducing instances of missed detections.

2.5. Edge Feature Enhancement Module

The ambiguity feature of smoke mainly comes from the interference of noise and bad weather. Noise, heavy fog, bright light, and other weather conditions can cause the edges of the smoke image to be damaged, thus affecting the classification and localization detection results of the smoke region. To attenuate image edge noise and enhance the discrimination between smoke plumes and smoke-like objects, this paper proposes an EEM, which mainly consists of a median filter [51] and an enhanced convolution, as shown in Figure 2a. The median filter is highly responsive to edge information within the image, effectively eliminating noise while enhancing the edge information integrity. Enhanced convolution consists of standard convolution, deformable convolution, and involution connected in parallel, which can improve the extraction of smoke features and thus help to distinguish between smoke plumes and smoke-like objects. For the input feature map

f_{eem}

, the output feature map

y_{eem}

computed by the EEM is calculated as follows:

y_{e e m} = f_{e e m} + γ_{2} (p o o l (γ_{1} (f_{e e m})))

(18)

where

γ_{2}

represents a combined operation of a median filter of size

5 \times 5

and an augmented convolution, and

γ_{1}

represents a combined operation of a median filter of size

3 \times 3

and an augmented convolution.

The median filter suggested in this study can effectively remove the noise in the feature mapping without significantly degrading the clarity of the image, and the enhanced convolution enriches the edge information of the smoke image and enhances the distinction between smoke plumes and smoke-like objects.

3. Results

This section initially outlines the experimental details. Subsequently, to showcase the superior performance of the algorithm introduced in this study, the following analyses are conducted: architectural analysis of the MEM, ablation experiments, comparisons with target detection-based improved smoke detection algorithms and neural network-based improved smoke detection algorithms, visualization analysis, and real application analysis.

3.1. Experimental Details

In this study, the experimental platform consisted of a personal desktop computer running on the Ubuntu operating system. The hardware configuration included an AMD Ryzen 9 5900X 12-Core Processor and an NVIDIA GeForce RTX 3090 GPU, employing the PyTorch framework. Image inputs were uniformly adjusted to 640 × 640 dimensions. Data augmentation primarily employed Mosaic and fliplr strategies. The activation function used was the Sigmoid-weighted Linear Unit (SiLU). The batch size was set to 8, and the model was trained using stochastic gradient descent (SGD) with a momentum of 0.937. The initial learning rate was 0.01, and the learning rate decay coefficient was 0.0005. Since many similar objects always interfere with smoke images in forest scenes, such as fog, clouds, light changes, and white roofs or forest paths, the detection false alarm rate FAR is a vital evaluation index for smoke detection algorithms. Therefore, besides the typical average precision AP(IoU = 0.50:0.95),

{AP}_{50}

(IoU = 0.50), and average recall AR, this paper still uses the false alarm rate FAR as an evaluation index.

3.2. Architectural Analysis of MEM

Currently, the mainstream feature extraction operators include standard convolution, deformable convolution, and involution; to better extract smoke features, a total of six fusion strategies are proposed for multi-feature extraction; the specific network architecture is illustrated in Figure 3, and the performance comparison of the various strategies is illustrated in Table 2.

The six fusion strategies proposed in this paper are the Conv module only; the DCN module only; the IVC module only; the combination of the Conv and DCN modules; the combination of the Conv and IVC modules; and the combination of the Conv, DCN, and IVC modules, where the Conv is a combination of standard convolution, batch normalization, and activation function, the DCN is a combination of standard convolution, deformable convolution, batch normalization, and activation function, and the IVC is a combination of standard convolution, involution, batch normalization, and activation function. The standard convolution within the DCN and IVC modules is just a simple 1 × 1 convolution used to align the dimensions of parameters.

From the data analysis in Table 2, it is evident that compared with the baseline network, only using the DCN module can improve the AP of the network by 2.1%, and using the combination of the Conv and DCN modules can enhance the AP of the network by 3.5%. The performance of the combined module is 1.4% higher than only using the separate module, which shows that the DCN module operation can significantly enhance the accuracy of the network, and the combination of multiple modules can further improve the precision of the network. This is mainly because the introduction of deformable convolution makes the network more accurate in extracting the non-rigid features of smoke, and the extraction of multiple features can significantly enhance the feature extraction ability of the network, thus improving the detection accuracy of the network. Similarly, from the data analysis in Table 2, it is evident that compared with the baseline network, only using the IVC module can reduce the FAR of the network by 1.3%, and using the combination of the Conv and IVC modules can reduce the FAR of the network by 1.6%. The performance of the combined module is 0.3% higher than that of only using the individual module, which shows that the operation of the IVC module can notably decrease the false alarm rate of the network, and the combination of multiple modules can more nearly reduce the false alarm rate. This is mainly because the IVC module can extract global features in a wider range and adaptively assign weights to each pixel, which is beneficial for the network to differentiate between smoke plumes and smoke-like objects, thus decreasing the false alarm rate. Finally, the combination of the three modules achieves the best performance in all the metrics, so the introduced MEM in this paper chooses the combination of the three modules.

3.3. Ablation Experiments

Compared to the baseline algorithm, the advantages of this paper’s algorithm mainly come from the improvements brought by the MEM, GEM, and EEM. Table 3 shows the ablation analysis of the proposed MGE-YOLOv8 on the RSF dataset with different modules. Considering that early forest fire warning has higher requirements for time, this paper adds the comparative analysis of the Inference time and GFLOPs in Table 3.

3.3.1. MEM

From experiments 2 to 4, the addition of MEM, when a separate module is added to the baseline network, leads to a significant enhancement in the average precision coupled with a substantial reduction in the false alarm rate. If MEM is considered a base value, then by comparing experiments 3 and 5, 4 and 7, and 6 and 8, it can be seen that the incorporation of the MEM module increases the AP of the algorithm by 2.8%, 2%, and 2.3%, respectively, while at the same time, the introduction of the MEM module decreases the FAR of the algorithm by 1.3%, 0.9%, and 0.8%, respectively. From the comparison of the above experimental outcomes, it can be verified that the MEM module combines the advantages of various convolutions to extract the local, non-rigid, and global features of the feature map, thus improving the detection average precision and recall and decreasing the false alarm rate, of which the increase in the average precision and recall mainly comes from the standard convolution and deformable convolution operations. The reduction in the false alarm rate primarily comes from the involution operation.

3.3.2. GEM

If GEM is considered a base value, then by comparing experiments 2 and 5, 4 and 6, and 7 and 8, it can be seen that the incorporation of the GEM module increases the AP of the algorithm by 1.4%, 1.2%, and 1.5%, respectively, while at the same time, the introduction of the MEM module decreases the FAR of the algorithm by 0.2%, 0.3%, and 0.2%, respectively. From the comparison of the above experimental outcomes, it can be verified that the GEM module extracts the structural information of the feature map from the global perspective, enriches the semantic and spatial information of the high-level features, and thus significantly improves the algorithm’s average precision, recall, and at the same time slightly improves the algorithm’s false alarm rate and significantly reduces the phenomenon of missed alarms, as described in Section 3.6 and Section 3.7.

3.3.3. EEM

If EEM is considered a base value, then by comparing experiments 3 and 6, 2 and 7, and 5 and 8, it can be seen that the incorporation of the EEM module increases the AP of the algorithm by 0.8%, 0.2%, and 0.3%, respectively, while at the same time, the introduction of the EEM module decreases the FAR of the algorithm by 1.0%, 0.5%, and 0.5%, respectively. From the above comparison of experimental results, it can be verified that the EEM module improves the ability to extract the fuzzy features of smoke, which is conducive to the differentiation of smoke plumes and smoke-like things, thus significantly reducing the false alarm rate of the algorithm and, at the same time, slightly improving the average precision and recall of the algorithm.

3.3.4. Evaluation Time

Because early forest fire warning has high time requirements, this paper takes Inference time as an evaluation index. The baseline and this paper’s Inference times are 4.51 ms and 4.86 ms, respectively. Although this paper’s Inference time has a slight increase compared to the baseline, the processing speed is almost four times 25 fps, which can achieve the real-time effect well and fight for more time for the firefighters to stop loss in time. In conclusion, the MGE-YOLOv8 proposed in this paper can substantially enhance the average precision and recall of the algorithm under the premise of meeting the real time and significantly reduce the false alarm rate, reducing the phenomenon of missed alarms, thus improving the efficiency of security personnel.

3.4. Comparisons with Target Detection-Based Improved Smoke Detection Algorithms

In this section, we compare the introduced algorithm and 12 state-of-the-art target detection-based improved smoke detection algorithms on RSF, XJTU-RS, and USTC-RF datasets, as presented in Table 4. As XJTU-RS and USTC-RF datasets feature single backgrounds with minimal interference from similar objects, this paper does not include comparative analyses of false alarm rates for these datasets. The smoke images selected for the self-made dataset in this paper possess uncertainty and ambiguity, leading to lower performance indices for the algorithm proposed herein. Conversely, the algorithm showcases higher performance indices on the USTC-RF dataset.

The EfficientDet model is simple, but the average precision is low, which is undoubtedly fatal for the forest fire warning task with devastating losses. The YOLOX algorithm is better in real time but has a high rate of false alarms, and frequent false alarms can lead to a smoke detection system that is not helpful but a nuisance with no practical application value. Compared with other smoke detection algorithms, the SASC-YOLOX algorithm has higher detection accuracy, faster processing speed, and a lower false alarm rate on the RSF dataset. Still, compared with the algorithm MGE-YOLOv8 introduced in this paper, the AP50 is reduced by 2.4%. The false alarm rate is increased by 0.9%, mainly due to the proposed algorithm improving the network’s ability to extract features. The GEM attention module presented in this paper performs significantly better than CBAM [34], although the processing speed is relatively slow. Still, it is also almost four times the real-time frame rate, which can provide good real-time processing. In summary, the algorithm introduced in this study has good network performance, and the MEM, GEM, and EEM modules are simple. They can be inserted into any network architecture with good generalization.

3.5. Comparisons with Neural Network-Based Improved Smoke Detection Algorithms

To comprehensively analyze the superior performance of the introduced algorithms, this study compared six state-of-the-art neural network-based improved smoke detection algorithms with the proposed algorithm on RSF, XJTU-RS, and USTC-RF datasets. The experimental comparison results are presented in Table 5.

In Table 5, DCNN, Deep CNN, W-Net, and STCNet employ deep neural networks as the foundational framework for smoke detection algorithms. These methods exhibit comparable performance metrics with lower average precision and higher false alarm rates. Conversely, MVMNet and SASC-YOLOX utilize the YOLO network architecture for one-stage target detection, resulting in superior network performance compared to deep neural networks. Hence, this paper adopts the latest YOLOv8 network architecture as the foundational framework, demonstrating the most optimal performance across all evaluation metrics. Furthermore, except for the algorithm proposed in this paper, all other methods rely on standard convolution for smoke detection. This approach limits the network to extracting a single feature, inadequately representing the non-rigid features of smoke. Consequently, it leads to reduced detection accuracy in identifying smoke characteristics and results in a higher rate of false alarms due to the inability to effectively differentiate between smoke plumes and smoke-like objects.

3.6. Visualization Analysis

To effectively demonstrate the efficacy of the algorithm introduced in this paper, a visualization analysis is conducted on different datasets. Specifically, the model proposed in this study is trained and subsequently tested on RSF, XJTU-RS, and USTC-RF datasets. Figure 4 presents the visual analysis proposed in this paper, encompassing bounding box visualization and Grad-CAM [52] visualization. The Grad-CAM visualization is primarily employed on the RSF dataset, offering a more intuitive judgment regarding the enhancement of the introduced algorithm compared to the baseline algorithm in mitigating the false alarm rate. On the other hand, bounding box visualization is predominantly used on the XJTU-RS and USTC-RF datasets, facilitating a more intuitive assessment of the improvement brought about by the proposed algorithm in terms of confidence values compared to the baseline algorithm.

In Figure 5a, the RSF dataset displays smoke images with various interferences: the first column involves light interference, the second and fifth columns exhibit similar color interferences (pavement, roof), and the third and fourth columns showcase diverse cloud interferences. The proposed algorithm accurately identifies smoke regions, eliminating light, similar color, and different cloud interferences from the highlighted boxes. In contrast, the baseline method incorrectly categorizes smoke-like objects as smoke, resulting in a higher false alarm rate. This discrepancy is primarily due to the MEM and EEM functionalities in this paper, enabling adaptive spatial aggregation and weight assignment. Consequently, these modules enhance smoke edge feature extraction, significantly improving accuracy and reducing false alarms. Moreover, the sixth to eighth column images depict typical forest smoke images. The Grad-cam visualization highlights that the proposed algorithm more accurately identifies smoke regions compared to the baseline. This enhancement is attributed to the GEM proposed in this paper, which effectively captures global structural smoke information, thus mitigating instances of missed alarms.

This study employed bounding box visualization for comparison on the XJTU-RS and USTC-RF datasets, characterized by a single background without interference from similar smoke-like objects. In Figure 5b, considerable enhancement is observed in the confidence level of the bounding box detection on the XJTU-RS dataset. This improvement is primarily attributed to the multi-feature enhancement module introduced in this study, which significantly enhances the detection precision. The proposed algorithm accurately identifies the smoke region in the fifth column image, primarily because of the GEM’s superior ability to capture global structural smoke detail. In Figure 5c, a slight elevation in the confidence level for bounding box detection is noticed in the USTC-RF dataset. This marginal improvement stems from the dataset’s straightforward semantic information and uniform smoke shapes, allowing better performance across various network models.

3.7. Real Applications Analysis

In order to verify the practicality of the algorithm introduced in this study, this study downloads a set of smoke images in the network and uses the baseline network and the network presented in this paper to carry out target detection on this set of smoke images; due to the fact that there is no labeled data for the smoke images downloaded from the network, this paper only carries out bounding box visualization on this set of data, as shown in Figure 5.

In Figure 5a, the presented algorithm accurately identifies the smoke region and effectively excludes interference from clouds and similar-colored objects, thus notably reducing the false alarm rate in practical application scenarios. Figure 5b demonstrates the algorithm’s comprehensive smoke region identification, significantly mitigating missed alarms in real-world applications. Additionally, Figure 5c highlights the clear advantage of the proposed algorithm’s detection frame confidence over the baseline algorithm in various scenarios, affirming the reliability of the presented algorithm’s detection accuracy in practical applications.

4. Discussion

Forest fires present a significant threat to both natural resources and human life. Accurate and prompt forest fire detection is paramount for minimizing associated losses. This paper suggests employing smoke detection as a method for early forest fire warning, recognizing that smoke often serves as the initial indicator of a fire and is observable from a distance. Nevertheless, existing smoke detection methods grapple with two primary challenges: low accuracy and a high false alarm rate. Low accuracy results in missed alarms, posing a potentially fatal risk for early forest fire warning. Conversely, a high false alarm rate generates nuisance alarms that provide little value, particularly when interfering targets are present. Addressing these challenges is crucial for enhancing the effectiveness of early forest fire warning systems.

The low accuracy and high false alarm rate observed in current smoke detection methods primarily stem from the inaccurate extraction of discriminative smoke features by the feature extraction network, negatively impacting overall performance. Forest fire smoke exhibits characteristics of uncertainty and ambiguity. The uncertainty features of smoke primarily originate from two sources. Firstly, smoke is a non-rigid object with no fixed geometric shape, contributing to reduced accuracy in existing smoke detection algorithms. Secondly, the presence of numerous objects in the background of smoke images resembling smoke introduces uncertainty in smoke classification, resulting in a high false alarm rate in current smoke detection algorithms. The ambiguity feature of smoke mainly arises from noise and adverse weather conditions, leading to interference that damages the edges of smoke images. This interference, in turn, affects the smoke region’s classification and localization detection outcomes. Addressing these challenges is crucial for enhancing the performance of smoke detection algorithms. Existing neural network-based smoke detection algorithms predominantly utilize standard convolution for feature extraction, including DCNN, Deep CNN, W-Net, STCNet, MVMNet, and SASC-YOLOX. Despite incorporating various attention mechanisms, the detection performance of the network experiences only marginal improvement. However, the limitation of a fixed network structure in standard convolution prevents the effective extraction of deformable smoke features, thereby impacting the overall network detection accuracy. Furthermore, the spatial specificity and channel invariance of standard convolution lead the neural network to assign identical weights to similar features during the extraction process. This characteristic is not conducive to distinguishing between smoke and smoke-like objects effectively. Fast R-CNN demonstrates the highest detection accuracy among current target detection-based improved smoke detection algorithms. However, the algorithm exhibits the highest GFLOPs and poor FPS, rendering it unsuitable for meeting real-time requirements in early forest fire warning. On the other hand, EfficientDet boasts the fewest model parameters and GFLOPs, offering better real-time performance. Nevertheless, it suffers from lower detection accuracy. YOLOV8 has the fewest model parameters and the highest AP and FPS within the YOLO series. Compared to other YOLO series models, such as SSD, RetinaNet-50, and RetinaNet-101, which have higher model parameters and GFLOPs, these algorithms exhibit reduced real-time performance and less satisfactory detection accuracy. Consequently, this paper adopts YOLOV8 as the foundational network framework.

To accurately detect smoke’s uncertain and fuzzy features, this paper introduces three novel modules—MEM, GEM, and EEM—into the original YOLOv8 network. MEM strategically combines the strengths of different convolutions to extract the shape of smoke adaptively and assign weights to other pixels. GEM is designed to capture global features of smoke, enhancing the identification of smoke regions. In feature mapping, EEM reduces smoke edge noise and improves smoke plume discrimination from smoke-like objects through enhanced convolution. This integrated approach enhances detection accuracy and reduces false alarms. Additionally, this paper creates a large-scale real-scenario forest fire smoke dataset to facilitate direct application in forest fire monitoring, ensuring better model robustness and generalization. The experimental findings affirm that the proposed MGE-YOLOv8 algorithm is well suited for direct application in early forest fire warning systems. In practical scenarios, the algorithm can be deployed on fire monitoring towers, forest stations, industrial parks, or city perimeters for real-time monitoring of potential fires and timely warnings. Additionally, the algorithm is adaptable for implementation on surveillance devices such as UAVs to monitor expansive areas and issue prompt warnings. Customization of these deployment strategies to specific forest terrains and requirements is essential to improve the sensitivity and accuracy of fire warnings. The extensive coverage, rapid detection speed, and cost-effectiveness of vision-based detection devices facilitate widespread adoption. However, it is crucial to acknowledge the limitations of the proposed algorithm. Although it performs real-time detection and early warning effectively during daylight hours, its efficacy diminishes in low-light or nighttime environments. Future enhancements should prioritize optimizing the dataset, expanding nighttime images of natural scenes, or improving nighttime image preprocessing. These efforts will enable the algorithm to achieve all-weather real-time detection and early warning of forest fires.

5. Conclusions

To overcome the challenges of low accuracy and a high false alarm rate in forest fire smoke detection, this paper presents the MGE-YOLOV8 network architecture as an enhancement to the original YOLOv8 network. The network incorporates several crucial modifications for enhanced performance. Initially, the MEM module replaces the original feature extraction module to strengthen the network’s ability to extract discriminative smoke features. Subsequently, the GEM module is introduced at the end of the feature extraction network to enhance the global dependence on advanced features. Finally, the EEM module is added before the prediction network to reduce edge noise in smoke without compromising image clarity. Experimental results demonstrate that the proposed algorithm in this paper achieves a 4.2% improvement in AP and a 3.9% increase in AR compared to the baseline network on the RSF dataset. Moreover, it reduces the false alarm rate by 2.1%, maintaining an impressive operational speed of 87.682 FPS and computational efficiency of 33.682 GFLOPs. These results highlight the exceptional detection capabilities of the algorithm.

Author Contributions

Conceptualization, Y.W. and Y.P.; methodology, Y.W.; software, Y.W. and H.Z.; validation, Y.W. and B.L.; formal analysis, Y.W.; writing—original draft, Y.W.; writing—review and editing, Y.W.; visualization, Y.W.; supervision, Y.P.; resources, Y.P.; project administration, Y.P.; funding acquisition, Y.P.; data curation, H.W. and B.L.; investigation, H.W. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of the Jilin Provincial Department of Science and Technology (Grant No. 20220201062GX).

Data Availability Statement

The data introduced in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Toreyin, B.U.; Cetin, A.E. Wildfire detection using LMS based active learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 1461–1464. [Google Scholar]
Toreyin, B.U.; Cinbis, R.G.; Dedeoglu, Y.; Cetin, A.E. Fire detection in infrared video using wavelet analysis. Opt. Eng. 2007, 46, 7204. [Google Scholar] [CrossRef]
Toreyin, B.U.; Dedeoglu, Y.; Cetin, A.E. Contour based smoke detection in video using wavelets. In Proceedings of the 14th European Signal Processing Conference, Florence, Italy, 4–8 September 2006; pp. 1–5. [Google Scholar]
Besbes, O.; Benazza-Benyahia, A. A novel video-based smoke detection method based on color invariants. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 1911–1915. [Google Scholar]
Gomes, P.; Santana, P.; Barata, J. A vision-based approach to fire detection. Int. J. Adv. Rob. Syst. 2014, 11, 149. [Google Scholar] [CrossRef]
Wang, Y. Smoke recognition based on machine vision. In Proceedings of the International Symposium on Computer, Consumer and Control (IS3C), Xi’an, China, 4–6 July 2016; pp. 668–671. [Google Scholar]
Hossain, F.A.; Zhang, Y.M.; Tonima, M.A. Forest fire flame and smoke detection from uav-captured images using fire-specific color features and multi-color space local binary pattern. J. Unmanned Veh. Syst. 2020, 8, 285–309. [Google Scholar] [CrossRef]
Wang, Y.; Wu, A.; Zhang, J.; Zhao, M.; Li, W.; Dong, N. Fire smoke detection based on texture features and optical flow vector of contour. In Proceedings of the 2016 12th World Congress on Intelligent Control and Automation (WCICA), Guilin, China, 29 September 2016; pp. 2879–2883. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot· multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ultralytics-Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 October 2022).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Ultralytics-Yolov8. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2023).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. arXiv 2023, arXiv:2302.07483. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Guo, X.; Cao, Y.; Hu, T. An Efficient and Lightweight Detection Model for Forest Smoke Recognition. Forests 2024, 15, 210. [Google Scholar] [CrossRef]
Yang, G.; Wang, J.; Nie, Z.; Yang, H.; Yu, S. A Lightweight YOLOv8 Tomato Detection Algorithm Combining Feature Enhancement and Attention. Agronomy 2023, 13, 1824. [Google Scholar] [CrossRef]
Gu, K.; Xia, Z.; Qiao, J.; Lin, W. Deep dual-channel neural network for image-based smoke Detection. IEEE Trans. Multimed. 2020, 22, 311–323. [Google Scholar] [CrossRef]
Khan, S.; Muhammad, K.; Mumtaz, S.; Baik, S.W.; de Albuquerque, V.H.C. Energy-efficient deep CNN for smoke detection in foggy IoT environment. IEEE Internet Things J. 2019, 6, 9237–9245. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, L.; Xia, X.; Huang, Q.; Li, X. A wave-shaped deep neural network for smoke density estimation. IEEE Trans. Image Process. 2019, 29, 2301–2313. [Google Scholar] [CrossRef] [PubMed]
Cao, Y.; Tang, Q.; Lu, X.; Li, F.; Cao, J. STCNet: Spatiotemporal cross networkfor industrial smoke detection. Multimed. Tools Appl. 2022, 81, 10261–10277. [Google Scholar] [CrossRef]
Hu, Y.; Zhan, J.; Zhou, G.; Chen, A.; Cai, W.; Guo, K.; Hu, Y.; Li, L. Fast forest fire smoke detection using MVMNet. Knowl. Based Syst. 2022, 241, 108219. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Jing, K.; Zhang, C. Learning precise feature via self-attention and self-cooperation YOLOX for smoke detection. Expert Syst. Appl. 2023, 228, 120330. [Google Scholar] [CrossRef]
Yin, M.; Lang, C.; Li, Z.; Feng, S.; Wang, T. Recurrent convolutional network for video-based smoke detection. Multimed. Tools Appl. 2019, 78, 237–256. [Google Scholar] [CrossRef]
Zhang, L.; Lu, C.; Xu, H.; Chen, A.; Li, L.; Zhou, G. MMFNet: Forest Fire Smoke Detection Using Multiscale Convergence Coordinated Pyramid Network With Mixed Attention and Fast-Robust NMS. IEEE Internet Things J. 2023, 10, 18168–18180. [Google Scholar] [CrossRef]
Tran, D.Q.; Park, M.; Jeon, Y.; Bak, J.; Park, S. Forest-Fire Response System Using Deep-Learning-Based Approaches With CCTV Images and Weather Data. IEEE Access 2022, 10, 66061–66071. [Google Scholar] [CrossRef]
Wang, L.; Zhang, H.; Zhang, Y.; Hu, K.; An, K. A Deep Learning-Based Experiment on Forest Wildfire Detection in Machine Vision Course. IEEE Access 2023, 11, 32671–32681. [Google Scholar] [CrossRef]
Armando, M.F.; Andrei, B.U.; Paulo, C. Automatic Early Detection of Wildfire Smoke With Visible Light Cameras Using Deep Learning and Visual Explanation. IEEE Access 2022, 10, 12814–12828. [Google Scholar]
Almeida, J.S.; Huang, C.; Nogueira, F.G.; Bhatia, S.; de Albuquerque, V.H.C. EdgeFireSmoke: A Novel Lightweight CNN Model for Real-Time Video Fire–Smoke Detection. IEEE Trans. Ind. Inform. 2022, 18, 7889–7898. [Google Scholar] [CrossRef]
Li, X.; Chen, Z.; Wu, Q.M.; Liu, C. 3D Parallel Fully Convolutional Networks for Real-Time Video Wildfire Smoke Detection. IEEE Trans. Circuits Syst. VideoTechnol. 2018, 30, 89–103. [Google Scholar] [CrossRef]
Cao, Y.; Tang, Q.; Wu, X.; Lu, X. EFFNet: Enhanced Feature Foreground Network for Video Smoke Source Prediction and Detection. IEEE Trans. Circuits Syst. VideoTechnol. 2021, 32, 1820–1833. [Google Scholar] [CrossRef]
Lin, G.; Zhang, Y.; Xu, G.; Zhang, Q. Smoke detection on video sequences using 3D convolutional neural networks. Fire Technol. 2019, 55, 1827–1847. [Google Scholar] [CrossRef]
Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 831–846. [Google Scholar]
Lin, J.; Gan, C.; Han, S. TSM: Temporal shift module for efficient video understanding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7082–7092. [Google Scholar]
Zhang, Q.; Lin, G.; Zhang, Y.; Xu, G.; Wang, J. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
Ko, B.; Ham, S.; Nam, J. Modeling and formalization of fuzzy finite automata for detection of irregular fire flames. IEEE Trans. Circuits Syst. Video Technol. 2011, 21, 1903–1912. [Google Scholar] [CrossRef]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12321–12330. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Luo, J.; Yang, Z.; Li, S.; Wu, Y. FPCB surface defect detection: A decoupled two-stage object detection framework. IEEE Trans. Instrum. Meas. 2021, 70, 5012311. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-Cam: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Smoke datasets used in the experiments of this paper. (a) USTC-RF smoke dataset; (b) XJTU-RS smoke dataset; (c) RSF smoke dataset (Proposed).

Figure 2. The whole network architecture of the proposed MGE-YOLOv8. (a) Overall network architecture and GEM, EEM module architectures, (b) MEM module architectures, and various convolutional module architectures.

Figure 3. MEM architecture. (a) Conv module architecture, (b) DCN module architecture, (c) IVC module architecture, (d) Combined Conv and DCN module architecture, (e) Combined Conv and IVC module architecture, (f) Combined Conv, DCN and IVC module architecture (Proposed).

Figure 4. Comparison of experimental results between MGE-YOLOv8 and YOLOv8 on different datasets. (a) RSF dataset, (b) XJTU-RS dataset, (c) USTC-RF dataset.

Figure 5. Comparison of experimental results between MGE-YOLOv8 and YOLOv8 on real applications. (a) False alarms reduce smoke images, (b) Missed alarms reduce smoke images, (c) Confidence value increases smoke images.

Table 1. Training configurations.

Dataset	Train Samples	Validate Samples	Test Samples	Total Samples
USTC-RF	3155	3155	6310	12,620
XJTU-RS	4792	1369	684	6845
RSF	10,762	3074	1537	15,373

Table 2. MEM architecture analysis.

Structure	AP	AP₅₀	AR	FAR
(a)	0.749	0.961	0.737	0.058
(b)	0.77	0.97	0.75	0.055
(c)	0.761	0.968	0.742	0.045
(d)	0.784	0.982	0.776	0.049
(e)	0.775	0.974	0.758	0.042
(f)	0.791	0.986	0.776	0.037

Table 3. Ablation analysis.

Experiment Number	MEM	GEM	EEM	AP	AP₅₀	AR	FAR	Inference Time (ms)	GFLOPs
1				0.749	0.961	0.737	0.058	4.51	28.624
2	√			0.774	0.974	0.759	0.044	4.75	30.432
3		√		0.760	0.971	0.755	0.055	4.62	29.201
4			√	0.756	0.966	0.748	0.048	4.58	28.862
5	√	√		0.788	0.988	0.782	0.042	4.81	32.662
6		√	√	0.768	0.977	0.762	0.045	4.68	30.428
7	√		√	0.776	0.981	0.771	0.039	4.74	31.868
8	√	√	√	0.791	0.986	0.776	0.037	4.86	33.682

Table 4. Comparative performances of the proposed and target detection-based improved smoke detection algorithms.

Database	Method	AP	AP₅₀	AR	Parameters	FPS	GFLOPs	FAR
RSF	Fast R-CNN [10]	0.912	-	0.938	28.28 M	8.133	278.422	0.055
	SSD [12]	0.655	0.951	0.682	23.75 M	28.858	64.562	0.058
	RetinaNet-50 [23]	0.657	0.953	0.684	36.33 M	8.681	32.688	0.048
	RetinaNet-101 [23]	0.662	0.958	0.702	55.32 M	6.244	68.642	0.048
	EfficientDet [24]	0.614	0.94	0.672	3.83 M	44.285	4.648	0.044
	YOLOv3 [15]	0.618	0.942	0.674	61.50 M	28.623	23.262	0.054
	YOLOv5 [17]	0.632	0.949	0.685	7.05 M	109.117	16.486	0.054
	YOLOv7 [19]	0.701	0.949	0.728	9.14 M	22.713	32.464	0.054
	EdgeYOLO [22]	0.652	0.951	0.696	9.86 M	91.244	19.271	0.052
	YOLOX [21]	0.686	0.952	0.712	8.94 M	220.688	12.943	0.055
	SASC-YOLOX [32]	0.712	0.962	0.724	8.94 M	198.424	15.514	0.046
	YOLOv8 [20]	0.749	0.961	0.737	6.28 M	112.315	28.624	0.058
	MGE-YOLOv8	0.791	0.986	0.776	8.46 M	87.682	33.682	0.037
XJTU-RS	Fast R-CNN [10]	0.936	-	0.961	28.28 M	8.597	278.421	-
	SSD [12]	0.659	0.953	0.687	23.75 M	39.214	64.552	-
	RetinaNet-50 [23]	0.69	0.957	0.7	36.33 M	8.421	32.658	-
	RetinaNet-101 [23]	0.691	0.964	0.701	55.32 M	6.539	68.632	-
	EfficientDet [24]	0.654	0.951	0.696	3.83 M	41.997	4.644	-
	YOLOv3 [15]	0.724	0.951	0.721	61.50 M	31.695	23.262	-
	YOLOv5 [17]	0.661	0.957	0.704	7.05 M	109.208	16.486	-
	YOLOv7 [19]	0.709	0.957	0.7	9.14 M	25.242	32.462	-
	EdgeYOLO [22]	0.711	0.959	0.705	9.86 M	92.851	19.261	-
	YOLOX [21]	0.683	0.953	0.678	8.94 M	232.019	12.943	-
	SASC-YOLOX [32]	0.726	0.964	0.714	8.94 M	213.675	15.514	-
	YOLOv8 [20]	0.732	0.968	0.718	6.28 M	132.414	28.624	-
	MGE-YOLOv8	0.792	0.990	0.742	8.46 M	102.568	33.682	-
USTC-RF	Fast R-CNN [10]	0.334	-	0.558	28.28 M	8.861	278.423	-
	SSD [12]	0.81	0.987	0.852	23.75 M	24.138	64.553	-
	RetinaNet-50 [23]	0.829	0.989	0.86	36.33 M	6.917	32.682	-
	RetinaNet-101 [23]	0.868	0.989	0.907	55.32 M	5.268	68.641	-
	EfficientDet [24]	0.344	0.524	0.37	3.83 M	43.126	4.648	-
	YOLOv3 [15]	0.96	0.99	0.969	61.50 M	30.642	23.262	-
	YOLOv5 [17]	0.885	0.99	0.907	7.05 M	109.96	16.486	-
	YOLOv7 [19]	0.905	0.99	0.924	9.14 M	28.736	32.461	-
	EdgeYOLO [22]	0.911	0.99	0.928	9.86 M	93.545	19.268	-
	YOLOX [21]	0.905	0.99	0.923	8.94 M	265.252	12.943	-
	SASC-YOLOX [32]	0.921	0.99	0.939	8.94 M	255.102	15.514	-
	YOLOv8 [20]	0.926	0.99	0.936	6.28 M	158.668	28.624	-
	MGE-YOLOv8	0.938	0.99	0.941	8.46 M	122.483	33.682	-

Table 5. Comparative performances of the proposed and neural network-based improved smoke detection algorithms.

Database	Method	AP	AP₅₀	AR	FAR
RSF	DCNN [27]	0.642	0.927	0.662	0.056
	Deep CNN [28]	0.624	0.93	0.701	0.056
	W-Net [29]	0.686	0.942	0.714	0.044
	STCNet [30]	0.624	0.924	0.689	0.039
	MVMNet [31]	0.718	0.951	0.704	0.048
	SASC-YOLOX [32]	0.712	0.962	0.724	0.046
	MGE-YOLOv8	0.791	0.986	0.776	0.037
XJTU-RS	DCNN [27]	0.66	0.937	0.706	-
	Deep CNN [28]	0.627	0.937	0.678	-
	W-Net [29]	0.651	0.948	0.704	-
	STCNet [30]	0.572	0.931	0.635	-
	MVMNet [31]	0.714	0.961	0.707	-
	SASC-YOLOX [32]	0.726	0.964	0.714	-
	MGE-YOLOv8	0.792	0.990	0.742	-
USTC-RF	DCNN [27]	0.853	0.989	0.876	-
	Deep CNN [28]	0.876	0.99	0.896	-
	W-Net [29]	0.77	0.986	0.807	-
	STCNet [30]	0.709	0.979	0.755	-
	MVMNet [31]	0.888	0.99	0.907	-
	SASC-YOLOX [32]	0.921	0.99	0.939	-
	MGE-YOLOv8	0.938	0.99	0.941	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Piao, Y.; Wang, H.; Zhang, H.; Li, B. An Improved Forest Smoke Detection Model Based on YOLOv8. Forests 2024, 15, 409. https://doi.org/10.3390/f15030409

AMA Style

Wang Y, Piao Y, Wang H, Zhang H, Li B. An Improved Forest Smoke Detection Model Based on YOLOv8. Forests. 2024; 15(3):409. https://doi.org/10.3390/f15030409

Chicago/Turabian Style

Wang, Yue, Yan Piao, Haowen Wang, Hao Zhang, and Bing Li. 2024. "An Improved Forest Smoke Detection Model Based on YOLOv8" Forests 15, no. 3: 409. https://doi.org/10.3390/f15030409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Forest Smoke Detection Model Based on YOLOv8

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. The Proposed Network Architecture

2.3. Multi-Feature Extraction Module

2.4. Global Feature Enhancement Module

2.5. Edge Feature Enhancement Module

3. Results

3.1. Experimental Details

3.2. Architectural Analysis of MEM

3.3. Ablation Experiments

3.3.1. MEM

3.3.2. GEM

3.3.3. EEM

3.3.4. Evaluation Time

3.4. Comparisons with Target Detection-Based Improved Smoke Detection Algorithms

3.5. Comparisons with Neural Network-Based Improved Smoke Detection Algorithms

3.6. Visualization Analysis

3.7. Real Applications Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI