Anomaly Detection in Chest X-rays Based on Dual-Attention Mechanism and Multi-Scale Feature Fusion

Liu, Dong; Lu, Shuzhen; Zhang, Lingrong; Liu, Yaohui

doi:10.3390/sym15030668

Open AccessArticle

Anomaly Detection in Chest X-rays Based on Dual-Attention Mechanism and Multi-Scale Feature Fusion

by

Dong Liu

^1,2,3

,

Shuzhen Lu

⁴,

Lingrong Zhang

^1,2 and

Yaohui Liu

^1,2,3,*

¹

School of Computer and Artificial Intelligence, Xiangnan University, Chenzhou 423300, China

²

Hunan Engineering Research Center of Advanced Embedded Computing and Intelligent Medical Systems, Xiangnan University, Chenzhou 423300, China

³

Key Laboratory of Medical Imaging and Artificial Intelligence of Hunan Province, Xiangnan University, Chenzhou 423300, China

⁴

School of Computer Science and Technology, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Symmetry 2023, 15(3), 668; https://doi.org/10.3390/sym15030668

Submission received: 1 February 2023 / Revised: 3 March 2023 / Accepted: 5 March 2023 / Published: 7 March 2023

(This article belongs to the Special Issue Symmetry/Asymmetry in Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

The efficient and automatic detection of chest abnormalities is vital for the auxiliary diagnosis of medical images. Many studies utilize computer vision and deep learning approaches involving symmetry and asymmetry concepts to detect chest abnormalities, and achieve promising findings. However, an accurate instance-level and multi-label detection of abnormalities in chest X-rays remains a significant challenge. Here, a novel anomaly detection method for symmetric chest X-rays using dual-attention and multi-scale feature fusion is proposed. Three aspects of our method should be noted in comparison with the previous approaches. We improved the deep neural network with channel-dimensional and spatial-dimensional attention to capture the abundant contextual features. We then used an optimized multi-scale learning framework for feature fusion to adapt to the scale variation in the abnormalities. Considering the influence of the data imbalance and other factors, we introduced a seesaw loss function to flexibly adjust the sample weights and enhance the model learning efficiency. The rigorous experimental evaluation of a public chest X-ray dataset with fourteen different types of abnormalities demonstrates that our model has a mean average precision of 0.362 and outperforms existing methods.

Keywords:

thoracic disease detection; medical diagnosis; instance-level detection; deep learning; attention mechanism; computer vision

1. Introduction

The rapid and accurate diagnosis of chest abnormalities is crucial, as some paroxysmal diseases may cause death within a very short time [1]. Medical imaging, as a widespread screening method, is of great significance for thoracic anomaly detection, particularly for detecting heart and lung diseases and skeletal abnormalities [2]. Chest X-ray, as one of the most common types of medical imaging, is usually the first radiological examination for clinical diagnosis. Indeed, chest X-rays have remarkable advantages, such as having a low cost, fast imaging speed, small amounts of radiation, and reasonable sensitivity to various pathologies [3], allowing them to play a vital role in common screening and emergency treatments, such as heart failure, pneumonia, pulmonary edema, pulmonary nodules, pleurisy, and chest foreign bodies.

Currently, chest X-rays are mainly read manually by professional radiologists, who are primarily limited by their experience, efficiency, and the complexity of the image itself. The analysis and recognition of chest X-rays are formidable challenges, because their anatomical structures and pathological features are extremely complicated. It is difficult to identify abnormalities in certain parts, especially multi-scale abnormalities located in indistinguishable regions with different appearances. Therefore, observing and analyzing chest X-rays are time-consuming and labor-intensive tasks for radiologists. With the development of artificial intelligence and big data, data-driven computer vision technology has played an increasingly significant role in assisting radiologists in the detection and diagnosis of illnesses. The intelligent computing and analysis of medical images cannot only effectively relieve the working pressure of radiologists, but also increase the diagnosis speed and improve sensitivity to subtle abnormalities [4]. Considering the interpretation complexity and clinical value of chest X-ray images, many scholars have been inspired to research automated algorithms for identifying heart and lung diseases.

Recently, deep learning technology has demonstrated the advantages of the automatic and efficient analysis of big data, which is widely used in various fields, including medical image analysis [5]. Numerous variants of convolutional neural networks (CNNs) have been developed. These variants achieved remarkable performances in numerous medical analysis tasks and were comparable to professional radiologists [6,7,8]. Specifically, many advanced CNN algorithms have demonstrated promising performances in the interpretation of chest X-rays. Based on the type of visual label, a chest X-ray analysis can be roughly divided into image-level prediction, image segmentation, and instance-level detection. First, image-level prediction is the process of predicting a category label or set of consecutive values for the entire image. The classification labels can include some common abnormalities, such as pneumonia and emphysema, and the regression value may indicate the severity score of a particular anomaly. In terms of image-level prediction, a representative study of chest X-rays was performed on the ChestX14 dataset [9]. Some studies used classification networks such as ResNet [10] and DenseNet [11] to predict anomaly labels of the Chest X-ray14 dataset, and then showed abnormal regions through visualization methods, such as class activation mapping (CAM) [12] or gradient-weighted class activation mapping (Grad-CAM) [13]. These strategies can improve the interpretability of the network to some extent, but they cannot be used for an accurate quantitative analysis of the anomaly localization accuracy. Second, chest X-ray segmentation refers to the labeling of each pixel in the image. In the chest X-ray imaging domain, segmentation targets can include organs, anomalies, the lungs, skeleton, and external objects [14]. These tasks usually segment only one object of interest, and all remaining pixels are marked as the background, because accurately annotating multiple outlines of chest abnormalities in X-ray images is difficult and time-consuming. Finally, chest X-ray instance-level prediction refers to the labeling of one or multiple specific regions in a chest X-ray image. These regions are typically rectangular boxes. This prediction pays more attention to the detection of anomalous locations and provides rapid and practical assistance for clinical diagnosis. In this study, we aim to improve deep networks and achieve more accurate instance-level predictions under supervised learning.

Many deep networks have been developed to achieve a more accurate and faster detection, such as the YOLO series model [15], RetinaNet [16], Faster R-CNN [17], and Swin Transformer [18]. Symmetry and asymmetry ideas are frequently used in the design of these methods. For example, symmetrical convolutions or network architectures can be developed to capture hierarchical features. Symmetry has been incorporated as an inductive bias into deep models to enhance their generalization performance [19]. Symmetrical convolutional block attention has been designed to improve feature maps in deep networks [20]. Skip connect technology with asymmetry has been proposed to reduce the degradation of neural networks [10]. Some of these have been improved and applied to chest X-rays, and the performance of anomaly detection in chest X-rays has been significantly enhanced. However, accurate chest radiography detection remains problematic. First, many chest X-ray data are not adequately annotated, and many have category imbalance problems that affect the effectiveness of deep learning models. Furthermore, most current studies mainly focus on the detection of one or a few types of chest anomalies, such as lung nodules and pneumo-thoraxes, whereas there are relatively few studies on the detection of multi-label chest abnormalities. However, heart and lung diseases are often concurrent, and the detection of multi-label chest abnormalities has crucial clinical value. In general, anomaly detection for chest X-rays still has plenty of room for improvement, especially for multi-label and instance-level detection tasks, owing to the subtle changes and diversity of chest anomalies, the complexity of pathological features, and the limitations of X-ray imaging characteristics.

Therefore, to address the above problems, this study is devoted to the multi-label and instance-level anomaly detection of chest X-rays. To summarize, our contribution is threefold: (1) We propose a novel network framework with the dual embedding of channel attention and spatial context attention mechanisms, which can flexibly capture rich context information. (2) An enhanced structure, Aug-FPN, is introduced to optimize the multi-scale feature fusion process. (3) We introduce seesaw loss as the classification loss to dynamically reduce the excessive negative sample gradient imposed on the tail category by the head category to effectively overcome the influence of an imbalance in the medical data distribution. Rigorous experimental evaluations on a benchmark dataset with 14 types of chest abnormalities demonstrate that the proposed method outperforms representative conventional models and compares favorably with more sophisticated approaches.

The remainder of this article is structured as follows. First, the related work is introduced in Section 2. The proposed detection approach for chest X-rays is described in Section 3. The experimental evaluation and results are described in Section 4. Finally, the conclusions and future work are presented in the last section.

2. Related Work

Many approaches with various strategies have been proposed for automatic chest X-ray image analysis, including the recognition or localization of specific categories of anomalies and the recognition of multi-label anomalies. Next, we summarized related work, from specific chest disease detection to multiple chest disease detection using computer vision technology.

2.1. Chest Anomaly Detection of Specific Diseases

Among the numerous chest diseases, pulmonary nodules, pneumonia, and pneumo-thoraxes are the most common.

Taking pulmonary nodule detection as an example, Schultheiss et al. [21] proposed a RetinaNet network to detect pulmonary nodules, and used the UNet network to generate a pulmonary mask and effectively remove redundant detections. Their convincing experimental results demonstrated the network’s potential to help doctors detect pulmonary nodules in clinical practice. Giacomo et al. [22] proposed an effective method for small lung nodule detection, in which a fuzzy logic system was employed for segmentation and a probabilistic neural network with bioinspired reinforcement learning was further designed to improve the detection performance. In the study of Peng et al. [23], a two-stage lung anomaly detection strategy was proposed. First, Mask-RCNN [24] was used to obtain rough segmentation results, and then a closed principal curve and multiple learning strategies were employed to optimize the results. Li et al. [25] proposed a multi-instance learning strategy for pulmonary nodule recognition with a performance that was comparable to that of radiologists. The image was first segmented and pre-processed with rib suppression, and then the image blocks centered on the lung field region were extracted. Then, a multi-resolution CNN was utilized to extract the features of different image blocks. Finally, a fusion strategy was developed to classify lung nodules. The COVID-19 outbreak has become a significant challenge for all humankind, with the auxiliary diagnosis of pneumonia having been widely studied by researchers. Many deep learning models have been designed to identify COVID-19 and pneumonia in chest X-rays and generate binary or multiple predictions. Rahman et al. [26] presented a comprehensive survey for the automated detection of COVID-19 using X-ray-based deep learning models. Specifically, the authors analyzed the diversity of the datasets, the performance of mainstream deep algorithms, and strategies for image processing. The survey also marked challenges for pneumonia detection, such as imbalances in data and the weak interpretability of features. Recently, EI-Dahshan et al. [27] provided a novel residual deep framework for COVID-19 detection that achieved superior accuracy in COVID-19 diagnosis. In their study, chest X-rays were first processed using an empirical wavelet transform, and then ResNet-50 integrated with a temporal convolutional neural network was built for recognition. Aiming at the inadequacy of existing COVID-19 detection studies to capture cross-channel and cross-spatial inter-relationships in multiple ranges, Fan et al. [28] proposed a multi-kernel-size spatial channel attention method to detect COVID-19 using chest X-rays, with the performance verified using three public datasets. Cha et al. [29] used three pre-trained models to extract visual features from X-rays, and then designed an attention mechanism to select these features, improving pneumonia detection accuracy.

A pneumo-thorax is a potentially life-threatening emergency, and delays in its recognition and treatment can cause serious harm to patients, and even death. Several researchers have conducted studies to determine pneumo-thoraxes using chest X-rays. Park et al. [30] used a 26-layer YOLO model to identify the location of pneumo-thoraxes in chest X-rays with an overall accuracy of 87.3%. Another typical work is that of Tolkachev et al. [31], who combined UNet with various classification networks for pneumo-thorax segmentation. In their study, the recently published challenge dataset by Kaggle was used, and a Dice coefficient of 0.8574 was achieved.

2.2. Chest Anomaly Detection of Multiple Diseases

Compared with single-category anomaly recognition or detection, it is a more complex and arduous challenge to construct a deep learning framework to accurately detect multiple categories of anomalies from chest X-ray images. At present, many research works have chosen to locate abnormalities in the form of classified labels and activation maps, because most public datasets of chest X-ray images currently only have image-level labels or have very few instance-level annotations of rectangular boxes. For example, Hwang et al. [32] improved the DenseNet network to detect malignant pulmonary tumors, active tuberculosis, pneumonia, and pneumo-thoraxes, which can predict the image-level probability and pixel-level localization probability map of each anomaly. Pham et al. [33] used domain knowledge and hierarchical dependence between abnormal labels to train a deep CNN model for atelectasis, cardiomegaly, edema, and lung consolidation recognition. However, these methods generally detect fewer types of anomalies, and their localization accuracy cannot be quantitatively evaluated.

Furthermore, some researchers have used weakly supervised, semi-supervised, and un-supervised learning to recognize or locate chest anomalies. Weakly supervised anomaly detection can identify anomalies using image-level labels. Both the instance level and image level are necessary for semi-supervised anomaly detection. Pesce et al. [34] proposed an attention feedback CNN to detect lung nodules using chest X-ray samples, and the training process was completed using semi-supervised learning. In their study, the image features and saliency maps were obtained using the VGG13 network. The localization loss from instance-level labels and classification loss from image-level labels were combined to supervise the model training. The localization loss helps to further reduce classification errors, acting as useful feedback information. However, saliency maps in this manner are usually noisy, which may lead to increased false positives and a low localization quality. Zhao et al. [35] used context information of symmetric regions to enhance the feature representation of anomaly candidate regions, which improved the performance of chest anomaly detection under weak supervision. However, the positioning accuracy of this method was not sufficiently high, and the prior box generation algorithm reduced the reasoning speed of the model. Antoni et al. [36] proposed the use of the heuristic red fox optimization algorithm for medical image processing, in which the best threshold was selected in an unsupervised manner and showed good potential in lung image segmentation and detection. However, this method still faces challenges when dealing with multiple anomalies with varied scales.

According to the discussions in related studies, many have been critical for detecting chest X-ray anomalies. However, certain constraints need to be addressed. To this end, we propose a new strategy that utilizes a dual-attention mechanism and multi-scale feature fusion. Fourteen types of chest abnormalities, including aortic enlargement, cardiomegaly, pleural thickening, pulmonary fibrosis, nodules and masses, pulmonary opacity, pleural effusion, infiltration, interstitial lung disease, calcification, lung consolidation, atelectasis, and pneumo-thoraxes, are included with instance-level labels. The proposed method achieves significant improvements compared with other popular detection models.

3. Methods

In this study, we developed a novel dual-attention mechanism and multi-scale feature fusion strategy for the anomaly detection of chest X-rays to obtain more discriminant visual features and improve the detection performance. We first presented the feature extraction approach based on a dual-attention residual network in Section 3.1. Then, we introduced the multi-scale feature fusion strategy in Section 3.2. Finally, the design of the loss function was presented in Section 3.3.

3.1. Feature Extraction Based on Dual-Attention Mechanism Residual Network

The proposed overall framework for the anomaly detection of chest X-rays is shown in Figure 1, which mainly included the backbone network feature extraction, multi-scale fusion, and detection head. To extract discriminant features from chest X-ray images, a dual-attention mechanism-guided network (DAMGnet) was proposed as the feature extraction module of the detection model to obtain a feature representation. On the one hand, we used channel dimension attention to ensure the effective interaction of abnormal information between different channels to effectively eliminate the redundant features contained in the recombination features. On the other hand, a spatial context attention mechanism was proposed as a supplement to the channel information. More specifically, channel dimension attention was introduced to explicitly model the inter-dependence between channels and determine the content that needed to be focused on in each layer of the feature map. Spatial context attention was introduced to extract effective spatial context information based on the retained spatial location information, which was used to determine the concerned location in the corresponding feature map. These two types of attention complemented each other and assisted in detecting chest abnormalities. Specifically, our method could be summarized in three key steps. First, the features reconstructed in the channel dimension were fused with the attention weight in the spatial dimension. Subsequently, modulated deformable convolution was used to obtain a more extensive and flexible receptive field and to improve the sensitivity of capturing the context information. Finally, the feature map could be modeled by expressing the spatial correlation between features at different positions.

The network structure of the DAMGnet is presented in Figure 1. Conv1 to Conv5 represent the five phases of the network, and each stage was composed of a convolution, max pooling, and network-building module. There were two network building blocks, and the BottleNeck module came from ResNeSt [37]. The proposed DAM-BottleNeck, as shown in Figure 2, is an improved version of the original ResNeSt network structure with a dual spatial context attention mechanism. The two mechanisms of attention were described below.

3.1.1. Split Attention Module

We first introduced the inter-channel attention, called the split attention module. As shown in Figure 1, the DAM-BottleNeck first divided the feature graph into K cardinal groups (cardinal 1, cardinal 2, …, cardinal K) according to the channel dimensions, and then each base group was divided into R smaller split groups (split 1, split 2, …, split R). Therefore, the

G = K \times R

groups could be obtained. Next, a series of convolution operations were performed for each group of feature maps, and the intermediate results obtained with this convolution operation in each group were expressed as

U_{i} = F_{i} (X)

, where

i \in {1, 2, \dots, G}

. Finally, the

U_{i}

could be considered as the input for the split attention module. The main calculation formalization process for the split attention module was as follows:

STEP 1: Feature compression. The element-level summation for the feature maps of all groups in the kth split group was first calculated, denoted as

{\hat{U}}^{k}

.

{\hat{U}}^{k} = \sum_{i = R (k - 1) + 1}^{R k}

(1)

where

k \in 1, 2, \dots, K

and

{\hat{U}}^{k} \in ℝ^{H \times W \times C / K}

. H, W, and C are the sizes of the output feature graphs of the DAM_BottleNeck module. Then, the channel weight statistics

s_{c}^{k} \in ℝ^{C / K}

could be obtained using the global average pooling operation.

s_{c}^{k} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {\hat{U}}^{k}_{c} (i, j)

(2)

where

s_{c}^{k}

represents the global average pooling result of the cth channel.

STEP 2: Incentive calculation. The attention factors were calculated for different split groups in the cardinal group. Mathematically, we assumed that

γ_{i}^{c}

denoted the full connection mapping result of the ith split group. Then, the attention factor

a_{i}^{k} (c)

could be calculated as follows:

a_{i}^{k} (c) = {\begin{cases} \frac{\exp (γ_{i}^{c} (s^{k}))}{\sum_{j = 0}^{R} \exp (γ_{i}^{c} (s^{k}))} i f R > 1 \\ \frac{1}{1 + \exp (- γ_{i}^{c} (s^{k})} i f R = 1 \end{cases}

(3)

It is worth mentioning that all channels within a cardinal group were treated as a whole if R = 1.

STEP 3: Channel dimension feature recombination. First, weighted fusion for different split groups in the kth cardinal group was performed according to Equation (4), and the results were denoted as

V_{c}^{k} \in ℝ^{H \times W \times C / K}

.

V_{c}^{k} = \sum_{i = 1}^{R} a_{i}^{k} (c) U_{R (k - 1) + i}

(4)

After the calculation of all the split groups of all the cardinal groups, the output of each cardinal group was concatenated to obtain the recombinational results

V = C o n c a t {V^{1}, V^{2}, \dots, V^{K}}

. Then, we performed a 1 × 1 convolution to fuse the features of the different cardinal groups, and the output Y of the DAM_BottleNeck could be further obtained through element-level addition with input X:

Y = c o n v (V) + X

(5)

where conv denotes a convolution operation.

The split attention mechanism could effectively learn the correlation between the channel features to avoid the generation of redundant features and further enhance the discriminative ability of visual features for chest image anomaly detection.

3.1.2. Spatial Context Attention Module

Context provides important information for radiologists to analyze medical images. To further optimize the feature expression, we proposed embedding a spatial context attention module in the DAM_BottleNeck to characterize the context information between different location features. Spatial context attention could supplement channel feature information, which is mainly composed of two aspects.

In fact, the relationship between the region of interest and its surrounding context was modeled to enhance the learning ability of the network to capture relevant discriminant features. We extracted the C4 and C5 layers of the network and replaced the ordinary convolution with a modulated deformable convolution [38]. More specifically, suppose that the convolution kernel has S sampling positions, and

w_{s}

and

p_{s}

denote the weight and predefined bias of the s-th position, respectively. Let

X (p)

and

Y (p)

denote the input feature map X and the output feature map Y at position p, respectively. The modulated deformable convolution could then be defined as follows:

Y (p) = \sum_{s = 1}^{S} w_{s} \cdot X (p + p_{s} + Δ p_{s}) \cdot Δ m_{s}

(6)

where

Δ p_{s} \in R

is the learnable offset at the s-the position, and

Δ m_{s}

is the modulation scalar at the s-th position and lies in the range [0, 1], i.e.,

Δ m_{s} \in [0, 1]

. Both

Δ p_{s}

and

Δ m_{s}

were obtained through separate convolutions on the input feature map X. Because the position after adding the offsets was a non-integer, i.e.,

p + p_{s} + Δ p_{s}

was usually a fraction, it did not correspond to the actual pixels in the feature map. In addition, bilinear interpolation was used to obtain the features

X (p + p_{s} + Δ p_{s})

as follows:

X (t) = \sum_{q} \max (0, 1 - | q_{x} - t_{x} |) \cdot \max (0, 1 - | q_{y} - t_{y} |) \cdot X (q)

(7)

where

t = p + p_{s} + Δ p_{s}

denotes an arbitrary position that was fractional,

q

enumerated all integral spatial positions in the feature map X, and the subscripts x and y denote the two-dimensional components. Equation (7) indicated that the pixel value of the interpolation point position was set as the weighted sum of the pixels in its neighbors. In addition, it is worth explaining that the reason why we proposed embedding modulated deformable convolution in the C4 and C5 layers instead of the whole network layer was that the high network layers could extract more abstract overview information, so as to reduce the interference in redundant background information around the target in the feature map.

With the modulated deformable convolution, adaptive, irregular, and effective receptive fields could be obtained during the learning process; these receptive fields described the spatial context information and salient features among the different locations. Especially for chest X-rays, complex geometric variations could be found in the abnormal regions. This spatial-context-aware convolution strategy could effectively perceive the complex changes, thus, improving the performance of the model.

3.2. Multi-Scale Feature Fusion Framework

For object detection tasks, the feature pyramid network (FPN), as an embedded feature fusion framework, demonstrated good performance in dealing with multi-scale targets. However, the original FPN had some drawbacks, and there was room for improvement.

In this study, we introduced an optimized FPN for multi-scale feature fusion inspired by Guo et al. [39], called Aug-FPN. Aug-FPN improved the FPN in three stages: before, during, and after the fusion. Its model structure is shown in Figure 3. Similar to the FPN, each layer of the feature pyramid was selected from

{C_{2}, C_{3}, C_{4}, C_{5}}

, respectively,

{M_{2}, M_{3}, M_{4}, M_{5}}

were reduced features after a 1 × 1 convolution operation, and

{P_{2}, P_{3}, P_{4}, P_{5}}

features could be obtained via the pyramid.

Compared with the original FPN, Aug-FPN was optimized from three aspects.

(1): Consistent supervision: In this case, the same supervised signal was applied to the multi-scale features before the fusion. Specifically, each candidate region obtained from the deep network was mapped to ${M_{2}, M_{3}, M_{4}, M_{5}}$ and the corresponding feature map was obtained. Next, classification and regression operations were performed directly on these feature maps to calculate the loss function of backup supervision, which was further added to the loss of the network itself. With this strategy, similar semantic information could be effectively obtained via different feature maps, which was beneficial for improving the discriminative ability of the model.
(2): Residual feature augmentation: The proportional invariant and adaptive pooling strategies were used to capture different context features and reduce information loss. Meanwhile, spatial context information extraction was used to reduce the information loss in the M5 channel and to improve the performance of the feature pyramid. Specifically, the C5 layer with scale S was performed with adaptive pooling with a constant proportion, and multiple scale feature maps $(a_{1} \times S, a_{2} \times S, \dots, a_{n} \times S)$ could then be obtained. Next, the channel dimension was reduced to 256 through a 1 × 1 convolution. Finally, up-sampling was performed on the scale of S through the bi-linear interpolation of the feature maps. However, the interpolation operation could result in aliasing, i.e., the image gray was discontinuous and a jagged shape could appear in the area with a drastic grayscale change. To this end, an adaptive spatial fusion module was introduced to adaptively learn the weight. The features of the different layers were then fused with adaptive weights in the training process instead of through direct addition.
(3): Soft RoI selection: In the original FPN model, the features of each RoI were extracted using a specific feature layer. However, overlooked features from other layers also contained important information useful for target classification or regression. In Aug-FPN, a soft RoI selection strategy was used, in which adaptive weights were introduced to better measure the importance of features in different RoIs. The final RoI features were generated based on adaptive weights rather than hard selection methods, such as RoI assignment or maximization operations. More specifically, the features of all pyramid layers for each RoI were first collected. Subsequently, an adaptive spatial fusion module was used to adaptively integrate these features; that is, various spatial weight maps were generated for different levels of RoI features. Finally, the weighted aggregation and fusion of RoI features were performed.

3.3. Loss Function

The imbalance problem of medical data often exists in computer vision tasks. In particular, for chest anomaly detection, samples with different anomalies present the characteristics of a long-tail distribution, posing huge challenges in the training process. More specifically, the head categories that have many samples impose excessive negative sample gradients, resulting in the near disappearance of the positive sample gradient of the tail category with a small number of samples. Therefore, the prediction accuracy of the model for the tail category is extremely low, which significantly affects the overall detection performance. To address this issue, we used seesaw loss [40] as the classification loss function, which could dynamically reduce the excessive weight of negative sample gradients inflicted by head categories on tail categories to achieve a relatively balanced positive and negative sample gradient. Mathematically, seesaw loss was expressed as follows:

L_{s e e s a w} (z) = - \sum_{i = 1}^{C} y_{i} \log (\hat{σ})

(8)

{\hat{σ}}_{i} = \frac{e^{z_{i}}}{\sum_{j \neq i}^{C} S_{i j} e^{z_{j}} + e^{z_{i}}}

(9)

where

y_{i} \in {0, 1}

is one-hot label,

z \in [z_{1}, z_{2}, \dots, z_{C}]

is the predicted probability for each category, and

s_{i j}

is the balance factor. By adjusting

s_{i j}

, the effect of the negative sample gradient imposed by the samples from the ith category on those from the jth category could be enlarged or reduced. Therefore, we could achieve a balance between the positive and negative sample gradients by selecting an appropriate balance factor.

The selection of

s_{i j}

was mainly based on two aspects. First, the relationship between the sample distribution among categories needed to be emphasized, so as to reduce the “penalty” (negative sample gradient) of the head category on the tail category. Second, it had to be ensured that the misclassified samples in the training process received enough “punishment” to avoid increasing the risk of misclassification. Accordingly, the balance factor was designed as

s_{i j} = M_{i j} \cdot C_{i j}

, where

M_{i j}

was used to alleviate the excess of negative sample gradients in the tail category and

C_{i j}

was used to supplement the “penalty” on those misclassified samples. The calculation equation was as follows:

M_{i j} = {\begin{cases} 1, if N_{i} \leq N_{j} \\ {(\frac{N_{j}}{N_{i}})}^{p}, if N_{i} > N_{j} \end{cases}

(10)

C_{i j} = {\begin{cases} 1, if σ_{j} \leq σ_{i} \\ {(\frac{σ_{j}}{σ_{i}})}^{q}, if σ_{j} > σ_{i} \end{cases}

(11)

where

N_{i}

represents the sample number of the ith category;

σ_{i}

is the classification confidence of the ith category;

M_{i j}

indicates that when the ith category was more frequent than the jth category, the negative sample gradient imposed by the ith category on the jth category was automatically reduced according to the magnitude of imbalance between the two categories;

C_{i j}

indicates that if the sample from the ith category was misclassified into the jth category, the penalty for the jth category would be increased according to the classification confidence ratio between the two categories. p and q are adjustable hyperparameters, which were set to 0.8 and 2.0 during the experiments.

In summary, we constructed a backbone network with a dual-attention mechanism, Aug-FPN, and a seesaw loss function. Although Aug-FPN performed well in multi-scale object detection, data imbalance affected the performance of the backbone network and Aug-FPN in the chest anomaly detection task. The combination of seesaw loss and Aug-FPN could significantly mitigate this impact, and we performed rigorous experiments to verify this.

4. Experiments and Analysis

4.1. Experimental Datasets and Pre-Processing

In this study, we used the VinDr-CXR chest X-ray dataset [41] collected from two hospitals in Vietnam for the experimental evaluation. A total of 15,000 images as the original data were provided to the academic community for scientific research, including 10,606 normal chest images and 4394 abnormal chest target images. The chest X-ray abnormalities in the VinDr-CXR dataset were further divided into 14 abnormal types. The number of each type is shown in Table 1, and sample images of the various types are shown in Figure 4. It is worth noting that the public dataset we used was published and peer-reviewed, where the X-ray images were processed and there was no privacy or harm involved. During our research, we also strictly followed the ethics of using medical datasets.

By analyzing the distribution of the normal category and the other 14 categories with chest abnormalities in Table 1, it could be determined that the sample distribution among different categories in VinDr-CXR was very unbalanced and presented the characteristics of an obvious long-tailed distribution. The number of normal categories was too large compared with that of the other abnormal categories. If all normal images with more than 10,000 samples were taken as negative samples for training, the learning effect of the model was significantly affected. Therefore, we randomly selected 500 images from the normal category and 4394 abnormal images to form the final dataset, which was randomly divided into 80%, 10%, and 10% for the validation and test sets.

Because the chest X-ray abnormalities were independently labeled by three professional radiologists, many overlaps in annotation boxes were found in the VinDr-CXR dataset. Therefore, the weighted box fusion (WBF) technology [42] was used to reduce the overlap and redundancy of the annotation boxes. In Figure 5, the left image shows the original label and the right image shows the result obtained using WBF. It can be observed that the annotation box was significantly smaller and more concise after pre-processing. In addition, the original chest X-rays had a low contrast and brightness, which may have influenced the learning effect. For this reason, we performed image pre-processing for each sample through histogram equalization, and the results are shown in Figure 6.

4.2. Experimental Setup and Evaluation

For the experimental environment, we used a Tesla K80 platform with a single GPU as the hardware. The PyTorch frameworks 1.10.0, cuda 11.1, cudnn 8.0.5, python 3.7.12, pytorch 1.10.0, torchvision 0.11.1, opencv 4.1.2, and mm-detection 2.21.0 were used as the software environments. In the training process, conventional random data enhancement operations, such as scaling, flipping, and translation, were used to extend the training data. Then, cascade R-CNN [42] based on the backbone network of ResNeSt50 and the feature fusion module FPN was selected as the baseline, where the ResNeSt50 backbone network was pre-trained with the ImageNet dataset. Single-scale training and testing with a scale of (1,333,800) were used in the experiment. For the experimental parameters, the batch size and training period were set to 2 and 12, respectively. The initial learning rate was set to 0.02, the momentum was set to 0.9, and the weight decay was set to 0.0001. In addition, an SGD optimizer was employed to regulate the learning rate.

For the experimental evaluation, we used the average precision (AP) and mean average precision (mAP) as the main evaluation indicators. More specifically, in the object detection task, the intersection over the union (IoU) denoted the intersection ratio of the area of the prediction box and true bounding box. True positive (TP) is the number of prediction boxes with an IoU greater than the threshold, false positive (FP) is the number of prediction boxes with an IoU less than or equal to the threshold, and false negative (FN) is the number of undetected ground truths. The precision and recall were defined as follows:

Precision = TP / (TP + FP)

(12)

Recall = TP / (TP + FN)

(13)

Additionally, the P–R curve consisted of multiple groups of precisions and recalls, and the area under the P–R curve for each category was denoted as AP. Then, the mAP could be calculated by averaging the APs for all categories. In particular, the IoU threshold in our experiment was set to 0.5, according to the COCO evaluation benchmark.

4.3. Comparison with the Latest Methods

For an objective evaluation of our approach, three popular approaches (i.e., cascade R-CNN [43], TridentNet [44], and Libra R-CNN [45]) and two of the latest approaches (Sparse R-CNN [46] and Swin Transformer [18]) were selected for comparison, all of which achieved the state of the art in object detection while publishing. For the experimental parameters of the compared methods, we used the best configuration recommended by the mm-detection toolbox. Specifically, for cascade R-CNN, TridentNet, and Libra R-CNN, the initial learning rate, momentum, and weight decay were set to 0.02, 0.9, and 0.0001, respectively, and the SGD optimizer was employed to regulate the learning rate; for Sparse R-CNN and Swin Transformer, the initial learning rate and weight decay were set to 0.000025 and 0.0001, respectively, and the AdamW optimizer was employed. For the sake of a fair comparison, the proposed method used the same parameters with cascade R-CNN, TridentNet, and Libra R-CNN, since the parameters they required were similar. In addition, in terms of the convergence criteria, the batch size and training period were set to 2 and 12, respectively, for all compared methods. The AP and mAP of the compared methods in the VinDr-CXR dataset are listed in Table 2, where the values marked in bold indicated the best performance. The P–R diagrams of some representative abnormal categories for the compared methods are shown in Figure 7. To demonstrate the effectiveness of the proposed approach more intuitively, detection examples of the compared methods are shown in Figure 8.

The following findings were obtained from the above experimental results.

First, the proposed method achieved the highest mAP of 0.362, which was 3.2% higher than that of cascade R-CNN (baseline) and 4.8%, 5.6%, 11.3%, and 2.6% higher than that of TridentNet, Libra R-CNN, Sparse R-CNN, and Swin Transformer, respectively. Second, different models were suitable for different types of chest anomaly detection, but our model achieved the highest AP in most categories; that is, we achieved the best performance in five out of the fourteen categories. For example, Libra R-CNN achieved a superior detection precision in pleural effusion, whereas Sparse R-CNN was suitable for detecting large-scale chest abnormalities such as aortic enlargement. Another recently published method that has to be mentioned is Swin Transformer, which showed good potential in the anomaly detection of chest X-rays and achieved the highest AP in three categories, i.e., calcification, pneumo-thorax, and pulmonary fibrosis. However, the proposed method achieved the best results in most categories, including large-scale and small-scale abnormalities, such as infiltration, atelectasis, pleural thickening, consolidation, and cardiomegaly.

Second, by observing the anomaly detection examples shown in Figure 8, visual comparisons also proved that the proposed method generally achieved more accurate detection results, and that our approach could effectively identify the specific locations of various types of chest anomalies. Even for small targets, our method exhibited a high degree of coincidence with the ground truth. This was mainly due to the fact that we improved the backbone networks with embedded local features and context information. Moreover, with the optimized feature fusion process, information from different scales was rationally used to obtain a more accurate prediction region.

In addition, another objective phenomenon that requires a greater explanation is that the difference in the AP values was relatively large for each category. This was mainly because the sample distribution among different categories in the dataset was very unbalanced. The categories that did not obtain a high AP were mainly those representing small proportions of the dataset, such as the atelectasis, pneumo-thorax, and consolidation. The category of other lesions had the smallest detection AP value among all the categories, not only because it had less training data, but also because it included many other types of anomalies that were difficult to distinguish. In particular, the proposed method still achieved considerable performance gains in these challenging categories compared to the baseline and other remarkable models, such as other lesions, atelectasis, consolidation, and infiltration, which indicated that our method could handle imbalanced data effectively.

To further study the computational complexity of the compared method, we used giga floating point operations (GFLOPs) and parameter memory as the evaluation indicators for the computational consumption; the results are reported in Table 3. To provide a more intuitive and comprehensive comparison, we also presented a schematic diagram of the detection capability vs. computational complexity requirements in Figure 9.

From Table 3, we could see that our method had a moderate and satisfactory computational complexity for both the GFLOPs and parameter memory. We significantly improved the baseline in mAP with very little computational complexity growth. Figure 9 illustrates that there was a balanced tradeoff between the detection capability and computing requirements. Considering the superior detection precision obtained, we could conclude that the proposed method could achieve a good balance between the computational complexity and detection accuracy.

In general, the proposed method achieved the best overall performance for chest X-ray anomaly detection and improved the existing representative models with a moderate computational complexity requirement. This was because the proposed dual-attention mechanism embedded rich contextual information. Combined with optimized multi-feature fusion and seesaw loss, the detection performance for both large and small targets was enhanced.

4.4. Ablation Study

In this section, an ablation experiment was presented to evaluate the effectiveness of each component in the proposed method. Five models with different configurations were selected for the comparison: (1) the L1 model, which was the baseline; (2) the L2 model, which added the dual-attention mechanism on top of L1; (3) the L3 model, which used Aug-FPN instead of the FPN based on L1; (4) the L4 model, which used seesaw loss based on L1; (5) the L5 model, which used the dual-attention mechanism, Aug-FPN, and seesaw loss, i.e., the model proposed in this article. The experimental results for the five compared models are reported in Table 4, where the values marked in bold indicated those higher than the baseline.

From Table 4, the following conclusions could be drawn: First, the dual-attention mechanism achieved a 1.7% improvement in the backbone network, and surpassed the baseline in most categories of chest anomalies. Second, the optimized multi-feature fusion and seesaw loss components improved the backbone network by 1.1% and 0.8%, respectively. From the perspective of quantitative values, the improvement in the Aug-FPN and seesaw loss components seemed insignificant. However, Aug-FPN and seesaw loss strengthened the capability of the model to manipulate different aspects. More specifically, the optimized Aug-FPN enhanced the detection performance for multi-scale abnormalities, such as atelectasis, interstitial lung disease, calcification, and cardiomegaly, whereas seesaw loss significantly improved the detection precision of most tail categories, such as atelectasis, pneumo-thorax, and calcification. Finally, it should be emphasized that the proposed method, that is, the L5 model, took full advantage of the three optimization strategies and achieved a 3.2% improvement. More importantly, our approach yielded performance gains for 11 categories. Only three types of abnormalities, i.e., nodules/masses, lung opacity, and pulmonary fibrosis, showed a slight decrease in AP. This may have been due to the fact that these three categories were all small-scale abnormalities, which are some of the most difficult to accurately detect among the 14 multi-scale categories. However, the overall findings of the ablation study indicated that our approach could effectively improve the performance of chest X-ray anomaly detection by integrating the advantages of context information extraction, feature fusion, dynamic balanced adjustments in positive and negative sample gradients, and other aspects.

Although the proposed method improved the baseline and compared favorably to more sophisticated models, there were still many detection errors. To provide a further analysis, the general toolbox TIDE [47] for identifying object detection errors was used to evaluate our model. The results including classification errors (Cls), localization errors (Loc), both Cls and Loc errors (both), duplicate detection errors (dupe), background errors (Bkg), missed ground truths (miss), and special errors for the baseline and our method are presented in Table 5. From the results, we found that our method reduced most errors compared to the baseline. However, although our method reduced false positive detections, it relatively increased false negative detections. This error may have been caused by the imbalanced dataset with multiple anomalies, and this was a limitation of our method.

Finally, to carry out an interpretative analysis of our model, we used Grad-CAM [13] to generate a heat map visualization. Some representative examples of our model and the baseline are presented in Figure 10. By analyzing the results, the following conclusions could be drawn. First, compared with the baseline, the heat map generated by our model was more consistent with the ground truth. Second, the results of our qualitative prediction box also showed a certain correspondence with the heat map. This implied that Grad-CAM had a certain faithfulness to our model, and that the proposed method had good reliability. In general, the heat map produced with our method highlighted the discriminative regions that were more consistent with the visual judgment of radiologists, although it must be noted that this method of comparison was a relatively subjective way of evaluating the proposed method.

5. Conclusions and Future Work

In this article, we mainly researched chest X-ray anomaly detection problems and proposed a novel detection method based on dual-attention and multi-scale feature fusion. In the proposed method, context information and space correlation characterization could be captured through channel and spatial context attention mechanisms. The optimized pyramid structure was used as a feature fusion module to strengthen the feature discrimination ability. Finally, seesaw loss as a classification loss function was used for the weight adjustment of positive and negative samples, which improved the anomaly detection performance at different scales and overcame the influence of data imbalances. Experiments on the VinDr-CXR dataset indicated that the proposed model achieved the best performance among the compared methods in terms of the objective evaluation indicators. The ablation experiments demonstrated that the modules in our model were organically connected and supported by each other. Based on an objective evaluation and subjective observation of the experimental data, the chest anomaly detection results of some abnormalities obtained with the proposed method were close to the diagnostic marks of radiologists, especially for aortic enlargement, cardiomegaly, and pleural effusion, which are expected to have important reference values for the chest X-ray image-assisted diagnosis. As a whole, the symmetry and asymmetry ideas were both applied in our method. For example, symmetry was fully considered in the backbone network structure and multi-scale fusion framework; the deformable convolution we used could automatically adjust the scale or receptive field to improve the generalization ability, which could be considered as asymmetry compared with the original convolution. These ideas together contributed to the performance improvement of our method.

Although the proposed method achieved outstanding results in the detection of some specific categories, such as aortic enlargement and cardiomegaly, errors in anomaly detection with multiple diseases still exist. One reason for this is that the number of training samples in some categories was insufficient. In future work, we aim to address these limitations from the perspective of weak supervision learning and multi-modality data, and more robust models are expected to yield more accurate results for chest X-ray anomaly detection.

Author Contributions

Conceptualization, D.L. and S.L.; methodology, S.L. and D.L.; software, S.L. and L.Z.; validation, D.L., L.Z. and Y.L.; formal analysis, D.L.; investigation, S.L. and D.L.; resources, D.L. and Y.L.; data curation, S.L.; writing—original draft preparation, D.L.; writing—review and editing, D.L., S.L. and Y.L.; visualization, S.L. and L.Z.; supervision, D.L. and Y.L.; project administration, D.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Start-up Fund for High-level Talents in Xiangnan University and the Key Research Project of the Hunan Engineering Research Center of Advanced Embedded Computing and Intelligent Medical Systems (No. GCZX202202; No. GCZX202201).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used to support the findings of the study are included within the article, which is a benchmark dataset and is publicly available for researchers.

Acknowledgments

We would like to thank MDPI (https://www.mdpi.com/authors/english, 1 March 2023) for its linguistic assistance during the improvement process of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Boeddinghaus, J.; Nestelberger, T.; Lopez-Ayala, P.; Ratmann, P.D.; Wussler, D.; Zimmermann, T.; Wildi, K.; Gimenez, M.R.; Miro, O.; Martin-Sanchez, F.J.; et al. Early diagnosis of myocardial infarction in patients presenting late after chest pain onset. Eur. Heart J. 2020, 41, 1706. [Google Scholar] [CrossRef]
Zhao, H.; Li, Y.X.; He, N.J.; Ma, K.; Fang, L.Y.; Li, H.Q.; Zheng, Y.F. Anomaly Detection for Medical Images using Self-supervised and Translation-consistent Features. IEEE Trans. Med. Imaging 2021, 40, 3641–3651. [Google Scholar] [CrossRef] [PubMed]
Çallı, E.; Sogancioglu, E.; Ginneken, B.V.; Leeuwen, K.G.; Murphy, K. Deep learning for chest X-ray analysis: A survey. Med. Image Anal. 2021, 72, 102125. [Google Scholar] [CrossRef]
Wang, G.Y.; Liu, X.H.; Shen, J.; Wang, C.D.; Li, Z.H.; Ye, L.S.; Wu, X.W.; Chen, T.; Wang, K.; Zhang, X.; et al. A deep-learning pipeline for the diagnosis and discrimination of viral, non-viral and COVID-19 pneumonia from chest X-ray images. Nat. Biomed. Eng. 2021, 5, 509–521. [Google Scholar] [CrossRef] [PubMed]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Laak, J.A.W.M.V.D.; Ginneken, B.V.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.; Jian, A.; Eng, C.; Way, D.H. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 2020, 26, 900–908. [Google Scholar] [CrossRef]
Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef] [Green Version]
Liu, Y.H.; Zhang, F.D.; Zhang, Q.Y.; Wang, S.W.; Wang, Y.Z.; Yu, Y.Z. Cross-view correspondence reasoning based on bipartite graph convolutional network for mammogram mass detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3812–3822. [Google Scholar]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2097–2106. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Dawid, P.; Marcin, W.; Robertas, D.; Wei, W. Chest radiographs segmentation by the use of nature-inspired algorithm for lung disease detection. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Bangalore, India, 18–21 November 2018. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Ze, L.; Yutong, L.; Yue, C.; Han, H.; Yixuan, W.; Zheng, Z.; Stephen, L.; Baining, G. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Wang, R.; Walters, R.; Yu, R. Incorporating Symmetry into Deep Dynamics Models for Improved Generalization. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Schultheiss, M.; Schober, S.A.; Lodde, M.; Bodden, J.; Aichele, J.; Müller-Leisse, C.; Renger, B.; Pfeiffer, F.; Pfeiffer, D. A robust convolutional neural network for lung nodule detection in the presence of foreign bodies. Sci. Rep. 2020, 10, 12987. [Google Scholar] [CrossRef]
Giacomo, C.; Grazia, L.S.; Christian, N.; Dawid, P.; Marcin, W. Small lung nodules detection based on fuzzy-logic and probabilistic neural network with bioinspired reinforcement learning. IEEE Trans. Fuzzy Syst. 2020, 28, 1178–1189. [Google Scholar]
Peng, T.; Gu, Y.D.; Ye, Z.Y.; Cheng, X.X.; Wang, J. A-LugSeg: Automatic and explainability-guided multi-site lung detection in chest X-ray images. Expert Syst. Appl. 2022, 198, 116873. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Li, X.; Shen, L.; Xie, X.; Huang, S.; Xie, Z.; Hong, X.; Yu, J. Multi-resolution convolutional networks for chest X-ray radiograph based lung nodule detection. Artif. Intell. Med. 2020, 103, 101744. [Google Scholar] [CrossRef] [PubMed]
Rahman, S.; Sarker, S.; Miraj, A.A.; Nihal, R.A.; Haque, A.K.M.N.; Noman, A.A. Deep Learning Driven Automated Detection of COVID-19 from Radiography Images: A Comparative Analysis. Cogn. Comput. 2021. [Google Scholar] [CrossRef] [PubMed]
EI-Dahshan, E.A.; Bassiouni, M.M.; Hagag, A.; Chakrabortty, R.K.; Loh, H.; Acharya, U.R. RESCOVIDTCNnet: A residual neural network-based framework for COVID-19 detection using TCN and EWT with chest X-ray images. Expert Syst. Appl. 2022, 204, 117410. [Google Scholar] [CrossRef]
Fan, Y.; Liu, J.; Yao, R.; Yuan, X. COVID-19 Detection from X-ray Images using Multi-Kernel-Size Spatial-Channel Attention Network. Pattern Recognit. 2021, 119, 108055. [Google Scholar] [CrossRef]
Cha, S.-M.; Lee, S.-S.; Ko, B. Attention-Based Transfer Learning for Efficient Pneumonia Detection in Chest X-ray Images. Appl. Sci. 2021, 11, 1242. [Google Scholar] [CrossRef]
Park, S.; Lee, S.M.; Kim, N.; Choe, J.; Cho, Y.; Do, K.H.; Seo, J.B. Application of deep learning–based computer-aided detection system: Detecting pneumothorax on chest radiograph after biopsy. Eur. Radiol. 2019, 29, 5341–5348. [Google Scholar] [CrossRef]
Tolkachev, A.; Sirazitdinov, I.; Kholiavchenko, M.; Mustafaev, T.; Ibragimov, B. Deep learning for diagnosis and segmentation of pneumothorax: The results on the Kaggle Competition and Validation Against Radiologists. IEEE J. Biomed. Health Inform. 2020, 25, 1660–1672. [Google Scholar] [CrossRef]
Hwang, E.J.; Park, S.; Jin, K.N.; Kim, J.I.; Choi, S.Y.; Lee, J.H.; Goo, J.M.; Aum, J.; Yim, J.J.; Cohen, J.G.; et al. Development and validation of a deep learning–based automated detection algorithm for major thoracic diseases on chest radiographs. Soc. Sci. Electron. Publ. 2019, 2, e191095. [Google Scholar] [CrossRef] [Green Version]
Pham, H.H.; Le, T.T.; Tran, D.Q.; Ngo, D.T.; Nguyen, H.Q. Interpreting chest X-rays via CNNs that exploit hierarchical disease dependencies and uncertainty labels. Neurocomputing 2021, 437, 186–194. [Google Scholar] [CrossRef]
Pesce, E.; Withey, S.J.; Ypsilantis, P.P.; Bakewell, R.; Goh, V.; Montana, G. Learning to detect chest radiographs containing pulmonary lesions using visual attention networks. Med. Image Anal. 2019, 53, 26–38. [Google Scholar] [CrossRef] [Green Version]
Zhao, G.; Fang, C.; Li, G.; Jiao, L.; Yu, Y. Contralaterally Enhanced Networks for Thoracic Disease Detection. IEEE Trans. Med. Imaging 2021, 40, 2428–2438. [Google Scholar] [CrossRef] [PubMed]
Antoni, J.; Dawid, P.; Robertas, D. Lung X-Ray Image Segmentation Using Heuristic Red Fox Optimization Algorithm. Sci. Program. 2022, 2022, 4494139. [Google Scholar]
Zhang, H.; Wu, C.R.; Zhang, Z.Y.; Zhu, Y.; Lin, H.B.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–19 June 2019; pp. 9300–9308. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Chen, K.; Liu, Z.; Loy, C.C.; Lin, D. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9695–9704. [Google Scholar]
Nguyen, H.Q.; Lam, K.; Le, L.T.; Pham, H.H.; Tran, D.Q.; Nguyen, D.B.; Le, D.D.; Pham, C.M.; Tong, H.T.T.; Dinh, D.H.; et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci. Data 2022, 9, 429. [Google Scholar] [CrossRef] [PubMed]
Solovyev, R.; Wang, W.; Gabruseva, T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vis. Comput. 2021, 107, 104117. [Google Scholar] [CrossRef]
Cai, Z.W.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Li, Y.H.; Chen, Y.T.; Wang, N.Y.; Zhang, Z.X. Scale-Aware Trident Networks for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 821–830. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–21 June 2021; pp. 14454–14463. [Google Scholar]
Daniel, B.; Sean, F.; James, H.; Judy, H. TIDE: A General Toolbox for Identifying Object Detection Errors. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]

Figure 1. The proposed overall framework for the anomaly detection of chest X-rays.

Figure 2. The DAM_BottleNeck module structure.

Figure 3. The Aug-FPN model structure.

Figure 4. Samples with different abnormalities in the VinDr-CXR dataset: (a) aortic enlargement; (b) cardiomegaly; (c) pleural thickening; (d) pulmonary fibrosis; (e) nodule/mass; (f) lung opacity; (g) pleural effusion; (h) infiltration; (i) interstitial lung disease; (j) calcification; (k) consolidation; (l) atelectasis; (m) pneumo-thorax; (n) other lesion; (o) normal.

Figure 5. The pre-processing for label box: (a) original label in the VinDr-CXR dataset; (b) optimized annotation result obtained using WBF. The red box indicates the location of the chest abnormality.

Figure 6. The image pre-processing with histogram equalization: (a) original image in the VinDr-CXR dataset; (b) pre-processing result using histogram equalization.

Figure 7. The P–R diagram of the compared methods in detecting representative anomalies: (a) aortic enlargement; (b) cardiomegaly; (c) lung opacity; (d) pleural effusion; (e) pleural thickening; (f) pulmonary fibrosis.

Figure 8. Detection examples of the compared methods.

Figure 9. Detection capability vs. computational complexity requirements: (a) mAP vs. GFLOPs; (b) mAP vs. parameter memory (M).

Figure 10. The heat map visualization for proposed model and baseline.

Table 1. The 14 types of abnormal chest images and their proportions.

Type of Abnormal Chest Image	Proportion	Type of Abnormal Chest Image	Proportion
Aortic Enlargement, AE	10.54%	Other Lesion, OL	3.24%
Cardiomegaly, CM	7.99%	Infiltration, IF	1.83%
Pleural Thickening, PT	7.12%	Interstitial Lung Disease, ILD	1.47%
Pulmonary Fibrosis, PF	6.85%	Calcification, CC	1.41%
Nodule/Mass, NM	3.79%	Consolidation, CS	0.81%
Lung Opacity, LO	3.65%	Atelectasis, AL	0.41%
Pleural Effusion, PE	3.64%	Pneumo-Thorax, PM	0.33%

Table 2. Experimental results for the compared methods on the VinDr-CXR dataset.

Abnormal Categories	Baseline	TridentNet	Libra R-CNN	Sparse R-CNN	Swin Transformer	Ours
Aortic Enlargement	0.899	0.891	0.876	0.917	0.907	0.908
Atelectasis	0.058	0.094	0.171	0.076	0.136	0.176
Calcification	0.089	0.092	0.090	0.112	0.192	0.131
Cardiomegaly	0.910	0.904	0.919	0.932	0.923	0.933
Consolidation	0.371	0.335	0.262	0.111	0.218	0.453
Interstitial Lung Disease	0.289	0.305	0.293	0.207	0.251	0.294
Infiltration	0.332	0.239	0.291	0.173	0.285	0.350
Lung Opacity	0.270	0.192	0.184	0.106	0.197	0.269
Nodule/Mass	0.286	0.156	0.172	0.130	0.270	0.274
Pleural Effusion	0.385	0.362	0.443	0.331	0.385	0.425
Pleural Thickening	0.232	0.162	0.170	0.194	0.224	0.238
Pneumo-Thorax	0.145	0.328	0.099	0.028	0.374	0.255
Pulmonary Fibrosis	0.273	0.230	0.271	0.143	0.286	0.261
Other Lesion	0.081	0.106	0.049	0.026	0.061	0.097
mAP	0.330	0.314	0.306	0.249	0.336	0.362

Table 3. Computational complexity analysis of the compared methods.

	Baseline	TridentNet	Libra R-CNN	Sparse R-CNN	Swin Transformer	Ours
GFLOPS	260.88	822.23	261.93	176.27	267.00	263.55
Parameter Memory (M)	70.85	32.89	71.12	107.85	47.44	74.03

Table 4. The experimental results of the five compared models on the VinDr-CXR dataset.

Abnormal Categories	L1 (Baseline)	L2	L3	L4	L5 (Ours)
Aortic enlargement	0.899	0.906	0.896	0.895	0.908
Atelectasis	0.058	0.112	0.130	0.210	0.176
Calcification	0.089	0.134	0.129	0.095	0.131
Cardiomegaly	0.910	0.933	0.935	0.917	0.933
Consolidation	0.371	0.310	0.376	0.319	0.453
Interstitial lung disease	0.289	0.306	0.307	0.249	0.294
Infiltration	0.332	0.315	0.342	0.336	0.350
Lung opacity	0.270	0.270	0.247	0.252	0.269
Nodule/mass	0.286	0.287	0.284	0.287	0.274
Pleural effusion	0.385	0.385	0.406	0.389	0.425
Pleural thickening	0.232	0.241	0.214	0.223	0.238
Pneumo-thorax	0.145	0.247	0.116	0.226	0.255
Pulmonary fibrosis	0.273	0.296	0.288	0.241	0.261
Other lesion	0.081	0.111	0.026	0.087	0.097
mAP	0.330	0.347	0.341	0.338	0.362

Table 5. The detection errors for the baseline and our method. Cls: classification errors; Loc: localization errors; both: both Cls and Loc errors; dupe: duplicate detection errors; Bkg: background errors; miss: missed ground truths; FP: false positives; FN: false negatives.

	Main Errors						Special Errors
	Cls	Loc	Both	Dupe	Bkg	Miss	FP	FN
Baseline	7.07	10.93	1.85	0.17	4.09	4.87	28.24	14.41
Ours	5.60	10.19	1.72	0.22	3.35	6.49	25.96	17.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, D.; Lu, S.; Zhang, L.; Liu, Y. Anomaly Detection in Chest X-rays Based on Dual-Attention Mechanism and Multi-Scale Feature Fusion. Symmetry 2023, 15, 668. https://doi.org/10.3390/sym15030668

AMA Style

Liu D, Lu S, Zhang L, Liu Y. Anomaly Detection in Chest X-rays Based on Dual-Attention Mechanism and Multi-Scale Feature Fusion. Symmetry. 2023; 15(3):668. https://doi.org/10.3390/sym15030668

Chicago/Turabian Style

Liu, Dong, Shuzhen Lu, Lingrong Zhang, and Yaohui Liu. 2023. "Anomaly Detection in Chest X-rays Based on Dual-Attention Mechanism and Multi-Scale Feature Fusion" Symmetry 15, no. 3: 668. https://doi.org/10.3390/sym15030668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anomaly Detection in Chest X-rays Based on Dual-Attention Mechanism and Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Chest Anomaly Detection of Specific Diseases

2.2. Chest Anomaly Detection of Multiple Diseases

3. Methods

3.1. Feature Extraction Based on Dual-Attention Mechanism Residual Network

3.1.1. Split Attention Module

3.1.2. Spatial Context Attention Module

3.2. Multi-Scale Feature Fusion Framework

3.3. Loss Function

4. Experiments and Analysis

4.1. Experimental Datasets and Pre-Processing

4.2. Experimental Setup and Evaluation

4.3. Comparison with the Latest Methods

4.4. Ablation Study

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI