Research on Coal and Gangue Recognition Model Based on CAM-Hardswish with EfficientNetV2

Li, Na; Xue, Jiameng; Wu, Sibo; Qin, Kunde; Liu, Na

doi:10.3390/app13158887

Open AccessArticle

Research on Coal and Gangue Recognition Model Based on CAM-Hardswish with EfficientNetV2

by

Na Li

^*

,

Jiameng Xue

^*,

Sibo Wu

,

Kunde Qin

and

Na Liu

College of Computer Science and Technology, Xi’an University of Science and Technology, Xi’an 710054, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8887; https://doi.org/10.3390/app13158887

Submission received: 8 June 2023 / Revised: 25 July 2023 / Accepted: 30 July 2023 / Published: 2 August 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In response to the multiscale shape of coal and gangue in actual production conditions, existing coal separation methods are inefficient in recognizing coal and gangue, causing environmental pollution and other problems. Combining image data preprocessing and deep learning techniques, this paper presents an improved EfficientNetV2 network for coal and gangue recognition. To expand the dataset and prevent network overfitting, a pipeline-based data enhancement method is used on small sample datasets to simulate coal and gangue production conditions under actual working conditions. This method involves modifying the attention mechanism module in the model, employing the CAM attention mechanism module, selecting the Hardswish activation function, and updating the block structure in the network. The parallel pooling layer introduced in the CAM module can minimize information loss and extract rich feature information compared with the single pooling layer of the SE module. The Hardswish activation function is characterized by excellent numerical stability and fast computation speed. It can effectively be deployed to solve complex computation and derivation problems, compensate for the limitations of the ReLu activation function, and improve the efficiency of neural network training. We increased the training speed of the network while maintaining the accuracy of the model by selecting optimized hyperparameters for the network structure. Finally, we applied the improved model to the problem of coal and gangue recognition. The experimental results showed that the improved EfficientNetV2 coal and gangue recognition method is easy to train, has fast convergence and training speeds, and thus achieves high recognition accuracy under insufficient dataset conditions. The accuracy of coal and gangue recognition increased by 3.98% compared with the original model, reaching 98.24%. Moreover, the training speed improved, and the inference time of the improved model decreased by 6.6 ms. The effectiveness of our proposed model improvements is confirmed by these observations.

Keywords:

coal and gangue recognition; deep learning; EfficientNetV2; attention mechanism

1. Introduction

As one of our country’s most important energy minerals, coal is a vital part of modern industry and is widely regarded as the backbone of the economy [1]. In coal mining, the production of gangue, an associated waste material, becomes a major obstacle in the coal extraction process. The gangue mixed into the coal reduces the combustion efficiency and utilization rate of the coal, and also produces a large number of harmful gases in the combustion process, which are harmful to the human body and pollute the environment [2]. Therefore, coal and gangue sorting directly affects the quality of coal products, and coal and gangue recognition is an important research topic in the process of coal production in intelligent coal washing production.

Various techniques are currently used to recognize coal gangue, all of which use the physical characteristics of the coal gangue as the basic criterion for recognition [3]. The difference in density between coal and gangue is utilized through a density-based method for performing sorting operations in coal and gangue recognition [4,5]. However, due to high costs and extensive water resource consumption, this method is not practical and also results in environmental pollution. The coal and gangue recognition method is based on radiation, which is based more on research on the γ-ray or X-ray methods [6,7]. But the radiation hazard potential of this method is higher, and special protective measures need to be taken to prevent radiation leakage, which leads to an increase in equipment costs. In addition to the above methods, there is also an acoustic signal method for coal and gangue recognition in conjunction with the longwall shearer during coal mining [8].

Deep learning techniques are widely used in various fields. In agriculture, for example, a new method for detecting and calculating the number of banana bunches in complex orchard environments was solved by combining deep learning with classical image processing algorithms [9]. In engineering, deep learning is used to measure the width of cracks in dams, and so on [10]. Therefore, in the field of coal mining, coal and gangue recognition based on deep learning is undoubtedly an important means to solve the problem of coal and gangue recognition. In deep learning-based coal and gangue recognition, image acquisition is a critical step. However, in actual production conditions, coal and gangue images have different shapes, gray textures, and low illumination, resulting in nonuniform characteristics [11,12].

Manually obtaining an image dataset of coal and gangue is necessary. However, in most cases, the acquired images are unable to fully capture the characteristics of coal and gangue images under actual working conditions. Thus, image processing technology is needed to artificially enhance images by adding feature information to them [13]. To this end, Wu Guoping [14] applied image preprocessing on a collection of coal and gangue images utilizing image processing and deep learning technology before feeding the data into the convolutional neural network (CNN) for training. Subsequently, the image samples were trained through the ResNet18 CNN for classification [15]. As attention mechanisms have become a research hotspot in various fields, they have gradually been introduced into deep learning networks, and it has been found that there are varying degrees of improvement in various mainstream networks [16,17]. Gao Yasong [18] developed a deep learning-based coal and gangue recognition method using an improved MobileNetV3 network and incorporated the Convolutional Block Attention Module (CBAM) attention mechanism module to significantly enhance model performance [19]. The recognition of coal and gangue using deep learning technology seeks to achieve critical factors, including model accuracy and model training speed. The pursuit of model accuracy has resulted in increasingly complex network structures. Consequently, discovering network structure hyperparameters that align with the coal and gangue issue has become especially crucial [20,21]. Hence, it is essential to acknowledge the significance of image data preprocessing and the selection of an appropriate deep learning network structure for the recognition of coal and gangue problems [22,23].

Image classification network models have evolved rapidly. There have been significant changes in network structures, such as MobileNet v3, which modified the inverted residual structure, Vision Transformer, which applies a transformer to computer vision, and the ConvNeXt network, which combines a transformer with existing CNN networks [24,25]. All of these enhancements are aimed at improving the performance of the CNN networks. In this paper, the EfficientNetV2-S network is used as the basic network. The EfficientNet network model was proposed by Google in 2019 and was able to achieve the highest accuracy of the year on Imagenet top-1, reaching 84.3%. In addition, the number of network parameters was significantly reduced, and the inference speed was improved. EfficientNet is smaller in terms of the number of parameters and more accurate than many mainstream models. EfficientNetV2, proposed in 2021, has shown excellent performance in the field of image recognition; whereas EfficientNetV1 focuses on accuracy, EfficientNetV2 also focuses on the training speed of the model [26,27]. The attention mechanism module plays an important role in improving the network representation, but the Squeeze-and-Excitation block (SE) module in the EfficientNetV2 network applied to the gangue recognition research problem was found to lose important information for complex production environments with only a single pooling layer.

In this paper, an improved coal and gangue recognition network based on EfficientNetV2 is proposed by combining image data preprocessing and deep learning techniques. The EfficientNetV2-S network is used as the basic feature network skeleton, the block structure in its network is updated, the channel attention module (CAM) is used, and the activation function is changed. In addition, without affecting the model performance, this paper also improved the network training speed by exploring the network structure hyperparameters and changing the network structure hyperparameters. For the preprocessing of coal gangue image data, previous researchers used methods that were too simplistic for the preprocessing of coal gangue images. In this paper, we use pipeline-based preprocessing and image processing techniques for coal and gangue image data to obtain a multiscale morphology of the coal gangue dataset, prevent network overfitting, and achieve efficient and accurate recognition of coal and gangue.

2. Coal and Gangue Recognition Network Model

2.1. EffcientNetV2 Network Model

The EfficientNetV2 network is a CNN characterized by the use of the Fused-MBConv module and a progressive learning strategy. The Fused-MBConv module was introduced to address the problem of reduced speed caused by the utilization of DepthWise (DW) convolution. Additionally, the network consists of multiple stages, with the MBConv and Fused-MBConv modules comprising the primary components.

The MBConv module is derived from the block in the MobileNetV3 network, where it is a variation of the inverted residual structure with an SE module added, as shown in Figure 1. The feature map matrix input to the module is first up-dimensioned by a 1 × 1 convolutional layer, where the number of output feature matrix channels is n times the number of input feature matrix channels. The feature map matrix after dimensioning is immediately followed by DW operation to reduce the parameters in the model and the number of operations during the convolution process. The feature matrix is then adapted using the SE attention mechanism module. Finally, a 1 × 1 convolution operation is performed on the feature map matrix.

The module structure of the SE attention mechanism, shown in Figure 2, is composed of global average pooling and two fully connected layers. The feature map matrix after the average pooling is converted into a one-dimensional vector and is then passed through two fully connected layers one after the other. The number of nodes in the first fully connected layer is a quarter of the number of channels in the feature matrix input to the MBConv module. The number of nodes in the second fully connected layer is identical to the number of channels in the feature matrix input to the SE module.

The output of the second fully connected layer in the SE module structure is the attention weights of each channel of the feature matrix input to the SE module. The higher the weight value of the channel, the more important it is. Conversely, the lower the weight value of the channel, the less important it is.

The purpose of the Fused-MBConv module is to address the slowdown in network speed caused by the use of DW convolution. The structure of Fused-MBConv is shown in Figure 3. Fused-MBConv combines the 1 × 1 convolution and DW convolution of MBConv into a single 3 × 3 convolution operation, reducing the number of parameters and speeding up network training.

2.2. EffcientNetV2 Network Based on CAM-Hardswish

The role of the attention mechanism module is to allow the model to learn which parts of the content are more important by adding weights to allow the model to learn the more important content, amplifying the important features and improving the accuracy of the model. The SE attention mechanism module is used in both EfficientNetV1 and V2. In coal and gangue recognition research, solely relying on a single pooling layer can result in the loss of essential information. To improve the extraction of critical information from coal and gangue images by EfficientNetV2, we devised a new block structure comprising EfficientNetV2-S as the basic feature network backbone and utilizing the CAM attention mechanism module. This module differs from the SE module by introducing a parallel pooling layer, allowing for a more comprehensive and richer extraction of high-level features. Compared with a single pooling operation, the parallel pooling of AvgPool and MaxPool can minimize information loss and improve efficiency. The CAM attention mechanism is similar to the SE attention mechanism, and both are classified as channel attention mechanisms. The structure of the CAM module is shown in Figure 4. The CAM module consists of two parallel pooling layers (max pooling and global average pooling) and a common MLP module.

The feature map matrix input to the module first passes through two parallel pooling layers (the max pooling layer and average pooling layer). The pooling operation does not change the number of channels of the feature map, but only its size, changing the feature map matrix from C × H × W to a size of C × 1 × 1, where C, H, and W are the number of channels, height, and width of the feature map, respectively. It then passes through the shared MLP module, where it first compresses the number of channels to a multiple of 1/R (rate of reduction) and then expands to the original number of channels, and after this, the ReLu activation function obtains the two activated results. Finally, the two output results are added together, element by element, through a sigmoid activation function to obtain the final output result, and then multiplied by the original image to return to the C*H*W size.

After replacing the SE module in the EfficientNetV2 structure with the CAM module, a new MBConv module and a Fused-MBConv module are formed. The module structure is shown in Figure 5.

The activation function in the original CAM attention mechanism module is the ReLu activation function. The advantage of this activation function is that it can improve the training effect of the neural network and has a faster convergence speed. However, the ReLu activation function has the problem of dead neurons and disappearing gradients. The input value is too small (less than zero), causing the neuron to die and stop updating the parameters. If the input value is too large, the gradient disappears and the neural network is ineffective. As a result, the neural network does not learn useful features effectively, which has an impact on the performance and accuracy of the model. To further improve the performance of the neural network, the Hardswish activation function is introduced, which has the advantages of good numerical stability and fast computation speed. The Hardswish activation function combines the ReLu6 and H-sigmoid activation functions to provide faster calculation speed, solve complex calculation and derivation problems, compensate for the shortcomings of the ReLu activation function, and improve the effectiveness of neural network training. The function is expressed as follows:

h - s w i s h [x] = x \frac{Re LU 6 (x + 3)}{6}

(1)

where x represents the input pixel value.

2.3. Setting of Network Structure Hyperparameters

From input to the classification network, images go through a series of convolution and pooling operations, and a final output feature map results from this process of abstracting information. Several structural hyperparameters of the image affect the effectiveness of network training, such as image resolution, network depth, network width, number and size of convolutional kernels, and number of convolutional layers. Most of the ImageNet classification datasets use image resolutions of 224 pixels × 224 pixels for the classic image classification network model. However, the EfficientNet network differs because the image size required to feed the network is too large.

Using a deeper network structure and larger input image size in the EfficientNet network results in high video memory usage and slower training speed for different datasets. To further improve network training speed, network structure hyperparameters are modified by exploring them without compromising model accuracy. The structural parameters of the EfficientNetV2 network that were modified to improve training speed are shown in Table 1.

In this paper, the network that integrates the CAM attention mechanism module and the Hardswish activation function and modifies the hyperparameters of the network structure is named EfficientNetV2-CAMHardswish.

3. Image Data Preprocessing

3.1. Acquisition of Image Data

There are 200 coal and 200 gangue samples in the image data collected for this experiment. The samples were captured using an industrial camera, as shown in Figure 6. Coal and gangue display distinct physical characteristics. The coal samples exhibit deeper color, rounder contours, and reflective areas, as visually demonstrated in Figure 6a. On the other hand, the collected gangue samples are grayish-white, have rougher textures, and lack reflections, as demonstrated in Figure 6b.

The accuracy of the deep recognition network model is heavily influenced by the quality of the dataset used. As there are no public datasets for coal and gangue images, the image data have to be acquired manually. Preprocessing of the image data has resulted in a coal and gangue dataset with rich feature information.

3.2. Coal and Gangue Image Data Preprocessing

Data augmentation is used to augment the dataset before the image data are fed into the network. Data augmentation is used to make effective use of limited data, thereby increasing the intrinsic value of the dataset and enriching the content of the dataset.

The main image data preprocessing operation for coal and gangue image data is mainly divided into two parts: pipeline-based coal and gangue image processing and a histogram equalization operation.

3.2.1. Image Processing of Coal and Gangue Based on Pipeline

This method is a stochastic, pipeline-based approach that cobbles together operations, each with a user-defined probability of execution, on each image that passes through it. Most of the operations are parameterized, allowing the degree of freedom to be controlled. The parameters of the operation itself are randomly chosen within the range specified by the user. The main operations include geometric transformation of the image, adjustment of image brightness, chroma, contrast, and sharpness, random occlusion, and other operations; the pipeline example is shown in Figure 7.

3.2.2. Histogram Equalization Operation

Histogram equalization is different from contrast adjustment. Histogram equalization can improve the overall contrast of an image by effectively correcting poor visibility caused by an uneven distribution of gray levels. As an effective image enhancement technology, it is particularly effective for images with extreme brightness or darkness in the background, as shown in Figure 8.

After performing data enhancement and other operations on the raw dataset, a total of 6272 coal and gangue images were obtained, consisting of 3223 coal images and 3049 gangue images.

Subsequently, data-augmented coal and gangue images are fed into the EfficientNetV2-CAMHardswish network for training. The size of input images is normalized to 224 pixels × 224 pixels to fit the network model.

4. Experimental Results and Analysis

4.1. Training of CNN Model

The simulation experimental environment is as follows: Windows 10 system, Intel(R) Xeon(R) Platinum 8255C, 43 GB memory, NVIDIA RTX 3080 graphics card, 10 GB display. Based on the Pytorch 1.11.0 framework, a coal and gangue recognition deep network was built, and the training process was accelerated through CUDA (Compute Unified Device Architecture).

The model configuration parameters are as follows: the network input size is 224 pixels × 224 pixels; the number of network input channels is three; the epoch cycle is 100 times; the optimizer is Adam; the learning rate is 0.0001; the batch size is 16.

4.2. Evaluation Index

In this paper, the second- and third-level indicators of the confusion matrix and the training speed are the main performance indicators of the evaluation model.

The confusion matrix is a widely used method for evaluating classification models, as shown in Table 2. The resulting values are categorized as positive and negative, providing four basic indicators (TP, FN, FP, and TN), known as first-level metrics. The four metrics together form a matrix, the confusion matrix. The classification of a model depends on the values of TP, TN, FP, and FN, with higher values of TP and TN indicating better model classification and lower values of FN and FP indicating better classification.

The number of observations of the wrong class and the right class of the classification model are counted in the confusion matrix, but it is difficult to measure the quality of the model based solely on these numbers. For this reason, second- and third-level indicators are often used for the evaluation. Accuracy and Recall were selected as the second-level indicators for model evaluation. The calculation formula for accuracy is as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(2)

The formula for calculating the recall rate is as follows:

Recall = \frac{T P}{T P + F N}

(3)

The third-level metric is the F1 score, an indicator used to assess the accuracy of binary classification models by combining the model’s accuracy and recall. The larger the F1 score value, the higher the accuracy of the model. The calculation formula for the F1 score is as follows:

F 1 Score = \frac{2 P R}{P + R}

(4)

where P is accuracy and R is recall.

The model’s training speed is a crucial metric for assessing its performance. Training speed can be further reflected by two metrics: model training time and inference time (or inference speed). Shorter training and inference times result in better performance.

4.3. Comparative Analysis of Model Performance

To verify the effectiveness of the improved EfficientNetV2 network, the performance was experimentally compared before and after modifications to the attention mechanism module and activation function based on the EfficientNetV2-S basic feature network backbone. EfficientNetV2-SE with the SE attention mechanism and ReLu activation function, EfficientNetV2-CAMReLu with the CAM attention mechanism and ReLu activation function, EfficientNetV2-CBAM with the CBAM attention mechanism and ReLu activation function, and the EfficientNetV2-CAMHardswish network were compared. Figure 9 shows the test accuracy graph, where EV2 stands for EfficientNetV2-S.

According to Figure 9, the accuracy curve of the improved EfficientNetV2 network is higher and converges faster than that of the original network. The graph shows that after 50 epochs, the EfficientNetV2-S network based on CAM-Hardswish converged more quickly and steadily and achieved an accuracy improvement of 3.98% compared with the original model, reaching 98.24%. As shown in Table 3, the introduced CAM module and the usage of the Hardswish activation function resulted in a significant increase in the accuracy of the newly formed network model as compared with the original network and its variants.

4.4. Comparative Analysis of Model Training Speeds

We also conducted a comparative analysis of the training speed between EfficientNetV2-S, EfficientNetV2-M, EfficientNetV2-L, and the improved network to ensure that the latter did not experience any loss in performance. EfficientNetV2-S, EfficientNetV2-M, and EfficientNetV2-L are three versions of the EfficientNetV2 network. The main differences between them are the size of the image input to the network and the depth of the network (number of layers). The input size of the EfficientNetV2-S network is 300 pixels × 300 pixels, and the network depth is 42 layers. The input size of the EfficientNetV2-M network is 384 pixels × 384 pixels, and the network depth is 59 layers. Lastly, the input size of the EfficientNetV2-L network is 384 pixels × 384 pixels, and the network depth is 81 layers. As shown in Table 4, the training speed of the modified model surpasses those of the three versions of its original network, and it also outperforms the other networks in terms of accuracy.

4.5. Comparison between EfficientNetV2-CAMHardswish and Mainstream Model

To further evaluate the improved model’s performance, its test accuracy, recall rate, F1 score, and training speed were utilized as evaluation metrics. These parameters were then used to compare the improved model quantitatively with a network model that has achieved exceptional performance in image classification in recent years. The results of this analysis are presented in Table 5.

As shown in Table 5, the construction of the coal and gangue recognition network with EfficientNetV2-CAMHardswish architecture in this paper led to significant improvements in both the recognition accuracy and training speed. Consequently, the revised EfficientNetV2 network is an apt choice for actual coal and gangue production conditions.

5. Conclusions

The aim of this paper is to efficiently recognize coal and gangue using a combination of image data preprocessing and deep learning technology. Pipeline-based coal and gangue image data preprocessing and image processing technologies were performed on the small sample of coal and gangue datasets to simulate the harsh production conditions under actual working conditions and obtain coal and gangue samples with multiscale, multimorphology, and color images. Then the network was trained after EfficientNetV2 was improved, the performance of the model was tested, and the following conclusions were drawn:

(1): Compared with the SE module in the original model, the CAM channel attention mechanism module that was introduced in this study adopted a parallel connection method that combined average pooling and max pooling. This module could efficiently extract essential image information. This solved the problem that the single pooling operation of the SE module caused the image to lose information. In order to further improve the performance of the coal and gangue recognition network, the ReLu activation function was replaced by the Hardswish activation function, which improved the recognition accuracy.
(2): The original EfficientNetV2 network suffered from excessive video memory usage, which led to a decrease in training speed and efficiency. The proposed coal and gangue recognition network in this study significantly enhanced its training speed, reduced video memory usage, and shortened the training time without compromising model accuracy.
(3): This study resulted in an improved EfficientNetV2 network that showcases excellent performance in recognizing coal and gangue. It can achieve a recognition rate of 98.24%, which is a significant improvement of 3.98% over the previously tested network. The improved model reduced the inference time by 6.6 ms compared with its predecessor. The effectiveness of the improved model was validated.

Author Contributions

Conceptualization, N.L. (Na Li); methodology, N.L. (Na Li) and J.X.; supervision, S.W. and K.Q.; investigation, J.X.; writing—original draft preparation, J.X.; writing—review and editing, S.W., K.Q. and N.L. (Na Liu); funding acquisition, N.L. (Na Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (grant no.: 62002285).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions eg privacy or ethical. The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the dataset in this paper is a self-constructed dataset and will continue to be used in subsequent studies, so we do not intend to public the dataset.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, L. Analysis on composition characteristics and resource utilization status of coal gangue. Jiangxi Coal Sci. Technol. 2022, 176, 233–235+238. [Google Scholar]
Wang, X.; Niu, Y. Review of research on coal gangue: Classification, hazard and comprehensive utilization. Ind. Miner. Process. 2023. Available online: https://kns.cnki.net/kcms/detail/32.1492.TQ.20230228.1822.002.html (accessed on 7 June 2023).
Cao, X.; Li, Y.; Wang, P. Coal and gangue recognition method research status and prospect. J. Ind. Autom. 2020, 46, 38–43. [Google Scholar]
Li, L.; Zhu, J.; Liu, L. Coal and gangue recognition method based on density difference study. J. Coal Technol. 2022, 41, 181–184. [Google Scholar]
Huo, P.; Zeng, H.; Huo, K. Research on coal gangue density recognition system based on image processing. Coal Prep. Technol. 2015, 249, 69–73. [Google Scholar]
Pan, Y.; Zeng, Z.; Zhang, E. Research on Recognition of coal gangue in X-ray Detection Based on MATLAB and Image Gray Value. Coal Technol. 2017, 36, 307–309. [Google Scholar]
Guo, Y.; He, L.; Liu, P. Coal gangue dual-energy X-ray image dimensional analysis recognition method. J. China Coal Soc. 2021, 46, 300–309. [Google Scholar]
Kiljan, P.; Moczulski, W.; Kalinowski, K. Initial Study into the Possible Use of Digital Sound Processing for the Development of Automatic Longwall Shearer Operation. Energies 2021, 14, 2877. [Google Scholar] [CrossRef]
Wu, F.; Yang, Z.; Mo, X.; Wu, Z.; Tang, W.; Duan, J.; Zou, X. Detection and counting of banana bunches by integrating deep learning and classic image-processing algorithms. Comput. Electron. Agr. 2023, 209, 107827. [Google Scholar] [CrossRef]
Tang, Y.; Huang, Z.; Chen, Z.; Chen, M.; Zhou, H.; Zhang, H.; Sun, J. Novel visual crack width measurement based on backbone double-scale features for improved detection automation. Eng. Struct. 2023, 274, 115158. [Google Scholar] [CrossRef]
Li, N.; Gong, X.; Jia, P. Segmentation method for low quality images of coal and gangue based on Retinex and local texture features with multifractal. J. Electron. Imaging 2022, 31, 061820. [Google Scholar] [CrossRef]
Li, N.; Gong, X. An Image Preprocessing Model of Coal and Gangue in High Dust and Low Light Conditions Based on the Joint Enhancement Algorithm. Comput. Intel. Neurosc. 2021, 2021, 2436486. [Google Scholar] [CrossRef] [PubMed]
Bloice, M.; Stocker, C.; Holzinger, A. Augmentor: An image augmentation library for machine learning. arXiv 2017, arXiv:1708.04680. [Google Scholar] [CrossRef]
Wu, G.; Liang, X.; Hu, J.; Zhang, X. Coal gangue recognition method based on image processing and convolutional Neural Network. Microcomput. Appl. 2021, 37, 100–103. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Woo, S.; Park, J.; Lee, J. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Gao, Y.; Zhang, B.; Lang, L. Based on in-depth study of coal gangue identification technology and implementation. Coal Sci. Technol. 2021, 49, 202–208. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 Octobe–2 November 2019; pp. 1314–1324. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Pan, X.; Cao, Y.; Jia, R. Neural network architecture search development review. J. Xi’an Univ. Posts Telecommun. 2022, 27, 43–63. [Google Scholar]
Hong, H.; Zheng, L.; Zhu, J. Automatic recognition of coal and gangue based on convolution neural network. arXiv 2017, arXiv:1712.00720. [Google Scholar]
Li, D.; Zhang, Z.; Xu, Z. An image-based hierarchical deep learning framework for coal and gangue detection. IEEE Access 2019, 7, 184686–184699. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE/CVF), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 10096–10106. [Google Scholar]

Figure 1. MBConv structure.

Figure 2. SE structure.

Figure 3. Fused-MBConv structure.

Figure 4. CAM structure.

Figure 5. Modified MBConv (a) and Fused-MBConv (b) modules.

Figure 6. Partial sample images (figure (a) is a partial sample of coal and figure (b) is a partial sample of gangue).

Figure 7. Example image preprocessing pipeline (the original image undergoes a series of pipeline operations to generate a new multiscale image set).

Figure 8. Example diagram of histogram equalization operation.

Figure 9. Test accuracy of the improved network model compared with the original network and its variants.

Table 1. Structural parameters of EffcientNetV2 network after improving network training speed.

Stage	Block	Channels	Stride	Layers
0	Conv3×3	24	2	1
1	Fused-MBConv3×3	24	1	2
2	Fused-MBConv3×3	48	2	2
3	Fused-MBConv3×3	64	2	2
4	MBConv3×3	128	2	3
5	MBConv3×3	160	1	6
6	MBConv3×3	256	2	10
7	Conv1×1&Pooling&FC	1280	1 or 2	1

Table 2. Confusion matrix.

Confusion Matrix		True Label
Confusion Matrix		Positive	Negative
Predicted label	Positive	TP	FP
Predicted label	Negative	FN	TN

Table 3. Comparative analysis of model performance.

Model	Attention Mechanism	Activation Function	Accuracy /%	Recall /%	F1 Score	Training Time /(h’min)	Inference Time /ms
EV2-SE	SE	ReLu	94.26%	90.4%	0.942	1 h’6 min	25.7
EV2-CBAM	CBAM	ReLu	97.45%	96.9%	0.973	1 h’31 min	34
EV2-CAMReLu	CAM	ReLu	97.87%	98.6%	0.984	1 h’35 min	28
EV2-CAM Hardswish	CAM	Hardswish	98.24%	99.1%	0.989	48 min	19.1

Table 4. Model training speed comparison.

Model	Accuracy /%	Recall /%	F1 Score	Training Time /(h’min)	Inference Time /ms
EfficientNetV2-S	94.26%	90.4%	0.942	1 h’6 min	25.7
EfficientNetV2-M	94.57%	93.8%	0.962	2 h’13 min	38.2
EfficientNetV2-L	95.0%	92.9%	0.96	3 h’45 min	54.8
EfficientNetV2-CAM-Hardswish	98.24%	99.1%	0.989	48 min	19.1

Table 5. Model evaluation results.

Model	Accuracy /%	Recall /%	F1 Score	Training Time /(h’min)	Inference Time /ms
Vision Transformer	94.31	94.2	0.952	1 h’17 min	8.2
MobileNetV3-large	94.57	94.1	0.956	30 min	8.5
ConvNeXt	94.0	93.8	0.954	1 h’8 min	6.5
EfficientNetV2-CAMHardswish	98.24	99.1	0.989	48 min	19.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, N.; Xue, J.; Wu, S.; Qin, K.; Liu, N. Research on Coal and Gangue Recognition Model Based on CAM-Hardswish with EfficientNetV2. Appl. Sci. 2023, 13, 8887. https://doi.org/10.3390/app13158887

AMA Style

Li N, Xue J, Wu S, Qin K, Liu N. Research on Coal and Gangue Recognition Model Based on CAM-Hardswish with EfficientNetV2. Applied Sciences. 2023; 13(15):8887. https://doi.org/10.3390/app13158887

Chicago/Turabian Style

Li, Na, Jiameng Xue, Sibo Wu, Kunde Qin, and Na Liu. 2023. "Research on Coal and Gangue Recognition Model Based on CAM-Hardswish with EfficientNetV2" Applied Sciences 13, no. 15: 8887. https://doi.org/10.3390/app13158887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Coal and Gangue Recognition Model Based on CAM-Hardswish with EfficientNetV2

Abstract

1. Introduction

2. Coal and Gangue Recognition Network Model

2.1. EffcientNetV2 Network Model

2.2. EffcientNetV2 Network Based on CAM-Hardswish

2.3. Setting of Network Structure Hyperparameters

3. Image Data Preprocessing

3.1. Acquisition of Image Data

3.2. Coal and Gangue Image Data Preprocessing

3.2.1. Image Processing of Coal and Gangue Based on Pipeline

3.2.2. Histogram Equalization Operation

4. Experimental Results and Analysis

4.1. Training of CNN Model

4.2. Evaluation Index

4.3. Comparative Analysis of Model Performance

4.4. Comparative Analysis of Model Training Speeds

4.5. Comparison between EfficientNetV2-CAMHardswish and Mainstream Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI