SARFNet: Selective Layer and Axial Receptive Field Network for Multimodal Brain Tumor Segmentation

Guo, Bin; Cao, Ning; Yang, Peng; Zhang, Ruihao

doi:10.3390/app14104233

Open AccessArticle

SARFNet: Selective Layer and Axial Receptive Field Network for Multimodal Brain Tumor Segmentation

by

Bin Guo

^1,2

,

Ning Cao

^1,*,

Peng Yang

² and

Ruihao Zhang

²

¹

College of Information Science and Engineering, Hohai University, Nanjing 210098, China

²

College of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4233; https://doi.org/10.3390/app14104233

Submission received: 10 April 2024 / Revised: 11 May 2024 / Accepted: 14 May 2024 / Published: 16 May 2024

(This article belongs to the Special Issue Applications of Computer Vision and Image Processing in Medicine)

Download

Browse Figures

Versions Notes

Abstract

:

Efficient magnetic resonance imaging (MRI) segmentation, which is helpful for treatment planning, is essential for identifying brain tumors from detailed images. In recent years, various convolutional neural network (CNN) structures have been introduced for brain tumor segmentation tasks and have performed well. However, the downsampling blocks of most existing methods are typically used only for processing the variation in image sizes and lack sufficient capacity for further extraction features. We, therefore, propose SARFNet, a method based on UNet architecture, which consists of the proposed SL_iRF module and advanced AAM module. The SL_iRF downsampling module can extract feature information and prevent the loss of important information while reducing the image size. The AAM block, incorporated into the bottleneck layer, captures more contextual information. The Channel Attention Module (CAM) is introduced into skip connections to enhance the connections between channel features to improve accuracy and produce better feature expression. Ultimately, deep supervision is utilized in the decoder layer to avoid vanishing gradients and generate better feature representations. Many experiments were performed to validate the effectiveness of our model on the BraTS2018 dataset. SARFNet achieved Dice coefficient scores of 90.40, 85.54, and 82.15 for the whole tumor (WT), tumor core (TC), and enhancing tumor (ET), respectively. The results show that the proposed model achieves state-of-the-art performance compared with twelve or more benchmarks.

Keywords:

brain tumor segmentation; MRI; medical image; deep learning; UNet

1. Introduction

Brain tumors are usually caused by the abnormal growth of a large quantity of cells in or around the brain and pose a considerable risk to human health [1,2]. Gliomas, the most common brain tumors [3], are invasive [4] and classified into four levels by the World Health Organization (WHO) [5]. Low-grade gliomas (LGGs) with a grade of 1 or 2 and high-grade gliomas (HGGs) with a grade of 3 or 4 are generally associated with increased mortality [6,7]. Medical imaging is often used in clinical diagnosis [8] to reduce brain tumor mortality. The medical images used for brain tumor diagnosis include X-rays [9], computed tomography (CT), and magnetic resonance imaging (MRI) [10]. Compared with other auxiliary methods, MRI is a primary tool for analyzing brain tumors [11,12,13] and has important clinical value in early diagnosis and treatment [14]. Conventional MRI in the BraTS2018 dataset involves fusing four sequences—T1-weighted (T1), T1 contrasted-enhanced (T1ce), T2-weighted (T2), and fluid attenuated inversion recovery (FLAIR) sequences—to assess the brain structure [15]. A common problem is that it takes a long time for radiologists to segment medical images accurately, and the accuracy is highly dependent on the radiologist’s experience [16,17]. Additionally, segmentation by hand is subject to human error, and evaluation varies among radiologists [18], which accounts for the inconsistent results [19].

Many studies have proposed alternative segmentation methods for brain tumors. It is widely acknowledged that convolutional neural networks (CNNs) can be applied for image segmentation [20,21,22]. A CNN customizes feature prediction into a classification problem that can effectively capture image details [23]. The input image size must be fixed for a neural network with a well-designed parameter structure at all levels, such as AlexNet [24], VGGNet [25], and GoogleNet [26]. Fully convolutional network (FCN) is the first network applied to deep learning for semantic segmentation [27]. FCN enables a designed network to input images of any size [28]. Unlike FCN, which combines shallow features with deep features via addition operations, UNet combines shallow features with deep features by splicing. UNet, a network focused on medical image segmentation, has achieved excellent performance [29]. UNet has a U-symmetrical structure where the left branch performs the encoder task while the right branch conducts the decoder work. Another aspect of the architecture is that the encoder concatenates the corresponding layer of the decoder. UNet combines shallow features with deep features through a U-shaped network structure, which is used to perform the final semantic segmentation [30]. The blocks of UNet are generally composed of multiple convolutional layers [31]. A convolutional layer usually contains multiple feature maps with different weight vectors, which makes it possible to retain abundant image features. After the convolutional layer, the pooling layer is applied to the downsampling operation, which can reduce the size of the image and the number of parameters, thereby increasing the robustness of translation and deformation [32]. However, in the downsampling process, the features cannot be effectively extracted; therefore, important information and correlations between local and global features may be lost [33,34]. The UNet bottleneck features can be decomposed in space or channels [35]. Spatial features produce the most important position information. In contrast, channel features focus on the semantic categories of the segmented objects [36]. An appropriate increase in the receptive field and global dependence benefits spatial feature extraction [37]. However, a problem is that the spatial features of the bottleneck block exploit only global dependencies or local features. The main contributions of our work are as follows:

(1): We proposed SARFNet, which can fully utilize the properties and characteristics of local and global features, to improve brain tumor segmentation performance.
(2): We designed the SL_iRF block, which is composed of different convolution kernels, a group convolution, an attention mechanism of spatial features, and channel features in each downsampling layer to capture detailed features, avoid losing important information and enhance the correlations between local and global features during the downsampling process.
(3): We developed the AAM module in the bottleneck layer, which uses a global receptive field and prior knowledge with a multiscale receptive field to integrate more contextual information between the axial features.
(4): We exploited CAM and deep supervision to improve the segmentation effect further.

The remainder of this paper is organized as follows: related work is described in Section 2. The methods and details of SARFNet are presented in Section 3. The datasets and preprocessing, implementation details, evaluation metrics, comparisons with other methods, and ablation experiments are presented and analyzed in Section 4. Subsequently, Section 5 provides the discussion and conclusions.

2. Related Work

2.1. Deep Learning-Based Methods for Medical Image Segmentation

In recent years, deep learning has become an important branch of artificial intelligence. It is widely used in image processing, natural language processing, and other fields. Image segmentation is an important application in deep learning. It divides an image into several regions; the pixels in each region have the same characteristics, providing more refined region-level information for subsequent processing [38]. In 1996, Sahiner et al. [39] proposed a CNN that was applied to medical image processing. In this work, regions of interest (ROIs) containing biopsy-confirmed masses were extracted from mammograms. A traditional CNN is mainly used for image classification; all images are assigned corresponding class labels separately, while UNet mainly focuses on end-to-end pixel-level medical image segmentation. The UNet structure is generally composed of an encoder and a decoder. The encoder usually contains a convolutional block and a maximum pooling layer. In contrast, the decoder contains upsampling blocks that integrate deep and shallow semantic information and convolutional layers. Compared with 2D images, 3D images can provide more detailed brain tumor information. Özgün Çiçek et al. [40] proposed 3D UNet, which is a variant of the UNet architecture, to process volume data from CT or MR images with depth information. The network layer gradually deepens to learn more abstract and complex features and improve the network expression ability. However, with the gradual increase in network depth, problems such as gradient disappearance or gradient explosion will occur during backpropagation. Many UNet variants have been studied to address such challenges and improve brain tumor image segmentation results. A network architecture named UNet++ with dense connections was proposed by Zhou et al. [41] and makes full use of the characteristics of different levels to improve accuracy. In UNet++, the network parameters are greatly reduced to an acceptable range using the deep supervision mechanism. The V-Net architecture based on UNet, which uses a convolutional layer to replace upsampling and downsampling layers, was proposed by Milletari et al. [42]. Chen et al. [43] proposed a 3D dilated multifiber network (DMF) architecture to reduce the computational overhead and solve the problem that the model cannot be used in practical large-scale applications. The MBANet, which was developed by Cao et al. [44], achieved more accurate segmentation results using spatial and channel attention. Zhang et al. [45] constructed HMNet, which effectively focused on enhancing the relationship between low-level and high-level semantic features. The above methods fully demonstrate the effectiveness of the U-shaped structure in medical image processing, so we chose the U-shaped network as our framework.

2.2. The Attention-Based Module for Medical Image Segmentation

The attention mechanism is able to focus on some key points and reduce the loss of detailed features; this approach has attracted attention and has been increasingly applied to medical images. Cao et al. [44] used a novel multibranch 3D shuffle attention (SA) module as the attention layer in the encoder-constructed MBANet to obtain more accurate segmentation results. Liu et al. [46] proposed SGERes UNet, in which SGE attention is employed to enhance the ability to learn semantic features. Tian et al. [47] proposed AABTS-Net with an axial attention mechanism, which makes it easier for models to provide local–global contextual information and capture abundant semantic information. Zhuang et al. [48] proposed ACMINet to adaptively refine and fuse multimodal features. To reduce the semantic gap, Kuang et al. [49] presented UCA-Net to enhance the contextual extraction ability among distinct slices. Yang et al. [50] proposed a hybrid pooling (HP) module and a hybrid upsampling (HU) module to capture more contextual features. Wang et al. [51] designed a residual block by fusing large and small kernels to simultaneously capture global and local features while retaining the advantages of ViT. As to research in other fields, Li et al. [52] developed an attention mechanism that enhanced the ability of the downsampling module to extract the essential features of maize diseases.

Dosovitskiy et al. [53] introduced the Transformer, which has achieved great success in the NLP field, into computer vision. At present, efforts are being made to apply the Transformer to medical image segmentation tasks; these models are theoretically efficient. Chen et al. [54] proposed TransUNet, which combines the Transformer and UNet, to enhance finer details by recovering localized spatial information from medical image segmentation. Zhang et al. [55] exploited a novel multimodal medical Transformer (mmFormer) to realize segmentation of incomplete modalities in medical images; it performed well. Lu et al. [56] combined the local modeling of CNN and the long-range representation of Transformer in an auxiliary MetaFormer decoding path. Zhang et al. [57] incorporated an efficient spatial-channel attention layer in the bottleneck for global interaction to further capture high-level semantic information and highlight important semantic features from the encoder path output. Undoubtedly, many UNet variants have shown great success in brain tumor segmentation. However, downsampling, which reduces the image size, performs poorly. An architecture with an effective mechanism needs to be developed to address this challenge and help networks efficiently perform further brain tumor feature extraction.

3. Methodology

3.1. Network Architecture

The suggested U-shape structure with the encoder on the left, the decoder on the right, and the skip connections in the middle achieves good segmentation performance in many image segmentation tasks. As shown in Figure 1, the network we proposed is similar to the UNet structure. Firstly, the images of the four brain tumor modalities (T1, T1ce, T2, and FLAIR) are processed using two 3 × 3 × 3 convolutions with a stride of 1, and the dimensions are adjusted from 4 × 128 × 128 × 128 to 32 × 128 × 128 × 128. Group Normalization (GN) and Gaussian error linear unit (GeLU) are employed after each convolution. Empirically, a residual block is used in the first layer. Sequentially, the SL_iRF module is used for downsampling. Four layers are used in the encoder to perform the feature extraction operation. In the bottleneck stage, the three AAM modules and squeeze-and-excitation (SE) are used to improve the model performance by explicitly constructing the interdependence between feature channels. The four upsampling blocks in the decoder path alter the image resolution via deconvolution with a stride of 2 based on the previous image pixels. Each layer in the decoder contains two convolutions, a GN and a GeLU, and the size after convolution is the same as that of the equivalent encoder. There is a residual block between convolutions. The SE module, used in the residual block, can dynamically adjust the weights of different channels by learning the relationships between different channels. Skip connections are constructed that can integrate high-level semantic information and low-level semantic information between the encoder and decoder layers to accurately locate the target image. In the third and fourth layers of the skip connections, we applied CAM [58], which is shown in Figure 2, to guide high-level features to the selection of low-level semantic features. Naturally, deep supervision is introduced to solve convergence problems and improve training performance. At the end of the network, a 1 × 1 × 1 convolution is utilized to restore the dimensions of the obtained feature map to the original 4 × 128 × 128 × 128.

3.2. Selective Layer Receptive Field Attention Module (SL_iRF)

Spatial features are very important for medical image segmentation; however, the challenge of losing effective information during the downsampling stage has still not been addressed. Receptive field attention (RFA) [59] was introduced to address the issue of considering the significance of each feature in the receptive field to further improve the efficiency of the extraction features and the performance of the model. Inspired by RFA, we propose the selective layer receptive field attention module (SL_iRF), which is shown in Figure 3. The module is used in downsampling, which utilizes convolutions with different kernels to reduce the image size and capture more useful information on each channel and spatial feature. Generally, downsampling uses pooling or convolution with a stride of 2, which leads to information loss.

The SL_iRF module consists of SL_iConv, group convolution, and spatial and channel attention. SL_iConv, which adopts different convolution kernels in each layer, can expand the receptive field while completing the size change. One of its main functions is to retain important feature information during downsampling. Group convolution can extract spatial features with a larger receptive field. Batch Normalization (BN) and Rectified Linear Unit (ReLU) are used together to enhance the model’s ability to learn nonlinear mappings. The spatial and channel weights can be obtained while ensuring a large receptive field to improve segmentation performance using the global average pooling and maximum pooling operations in the spatial and channel attention mechanism. Based on the above operations, the SL_iRF module can ensure a large receptive field to effectively reduce the loss of important information during downsampling operations.

Formally, as shown in Figure 3, x is the input feature

x \in R^{C \times H \times W \times D}

of the SL_iRF module, where C, H, W, and D represent the number of channels, height, width, and depth of the feature map, respectively. The SL_iConv algorithm selects different convolution kernels according to the position of the layer, which can be represented as follows:

{U_{i} = f}_{s l i} (x)

(1)

where

f_{s l i}

denotes the convolution with different kernels of the SL_iConv module, whose kernel sizes are {3 × 3 × 3, 5 × 5 × 5, 7 × 7 × 7, 7 × 7 × 7}. U_i is the result calculated by the SL_iConv module. Index i represents the position of each encoder layer. We define the convolution kernels according to the different positions of the layer as follows:

Layer 1: Kernel Size is 3 × 3 × 3 → U1 → $f_{s l 1} \in R^{C \times H \times W \times D}$
Layer 2: Kernel Size is 5 × 5 × 5 → U2 → $f_{s l 2} \in R^{2 C \times \frac{1}{2} H \times \frac{1}{2} W \times \frac{1}{2} D}$
Layer 3: Kernel Size is 7 × 7 × 7 → U3 → $f_{s l 3} \in R^{4 C \times \frac{1}{4} H \times \frac{1}{4} W \times \frac{1}{4} D}$
Layer 4: Kernel Size is 7 × 7 × 7 → U4 → $f_{s l 4} \in R^{8 C \times \frac{1}{8} H \times \frac{1}{8} W \times \frac{1}{8} D}$

After SL_iConv, group convolution with the same number of groups as the number of input channels is used to quickly capture features with receptive fields. The specific expression is as follows:

G U_{i} = R e L U (B N (G C o n v_{3 \times 3 \times 3} (U_{i})))

(2)

where GConv_3×3×3 denotes Group Convolution with 3 × 3 × 3 kernel, with the number of groups the same as the number of input channels. BN represents batch normalization; ReLU is Rectified Linear Unit.

The top part in Figure 3 is the channel attention weight section, which captures the weights of the channel features. The specific expression is as follows:

C A_{i} = L N S (L N R (G P (U_{i})))

(3)

where LNR denotes linear operation and ReLU. LNS represents linear operation and sigmoid. GP is the global average pooling.

The spatial attention weight is presented in the bottom branch in Figure 3. The specific description is as follows:

S A_{i} = G W (C S (c o n c a t (A P ({G U}_{i}), M P (G U_{i}))))

(4)

where AP denotes average pooling; MP is max pooling; CS is defined as convolution with 3 × 3 × 3 kernel and sigmoid; and GW indicates the get weight operation, which can be written as follows:

{G W}_{i} (•) = A P (G C (•))

(5)

where GC is defined as group convolution, whose convolution kernel is 1, and the number of groups is equal to the number of input channels.

We combine the three results via multiplication to accurately extract spatial and channel features with receptive fields. The specific description is as follows:

Z_{i} = U_{i} \times S A_{i} \times C A_{i}

(6)

At the end of the SL_iRF block, a convolution with a 3 × 3 × 3 kernel and stride of 3 is used to extract the spatial features of the receptive field.

3.3. Axial Atrous Spatial Pyramid Pooling Module (AAM)

The image in the bottleneck layer contains considerable details, which plays an important role in image restoration during upsampling. The ordinary convolution affects the final restoration effect due to the lack of global information when dealing with detailed features. Axial attention [60] using self-attention can establish global dependence and expand the receptive field of the image on each axial feature. Its receptive field is larger than the convolution, and more contextual information can be obtained. However, the axial attention mechanism cannot effectively use prior knowledge of the scale invariance, translation invariance, or feature locality of the image, which leads to the inability to construct an accurate local relationship in some cases. ASPP [61] uses features with different expansion sampling rates generated by atrous convolution, so that the features contain multiscale receptive fields. It includes multiscale local information and retains the properties of ordinary convolution. The above two attention mechanisms are fused to create the axial atrous spatial pyramid pooling module (AAM) shown in Figure 4. It uses axial attention to extract global information and retains the characteristics of ordinary convolution and multiscale receptive fields. It can be represented as follows:

Z = c o n v_{1 \times 1 \times 1} (c a t (a t r o u s (A P (s e l f_a t t n (s e l f_a t t e n (x, w), h)), θ)))

(7)

where x is the input feature

x \in R^{C \times H \times W \times D}

; Z is the result processed by the AAM module; Conv_1×1×1 denotes convolution with a kernel of 1 × 1 × 1; Atrous is expressed as an atrous convolution operation with a rate of θ,

θ ϵ {1, 2, 3, 4}

; AP represents the average pooling; and self_attn describes the self-attention mechanism. Here, parameter h represents the column, and parameter w represents the row. Self-attention can be defined as follows:

Z = α \times t o r c h . b m m (C o n v (x), s i g m o i d (t o r c h . b m m (C o n v (x), {C o n v (x)}^{T}))) + x

(8)

where Z denotes the output result of the module, x denotes the input feature, and Conv represents convolution with a kernel of 1 × 1 × 1. α indicates learning parameters, which can learn suitable parameter values through model training. The torch.bmm operation is defined as a matrix product.

4. Experiments and Results

4.1. Datasets and Preprocessing

The BraTS dataset refers to the brain tumor segmentation challenge dataset, which is a public medical image dataset used to research and develop brain tumor segmentation algorithms. The dataset is provided by multiple medical centers and contains image data from multiple patients who underwent brain tumor MRI. The BraTS 2018 [62,63,64,65,66], which consists of 285 patients used for training and 66 patients used for online validation, has become popular. Generally, the cases all contain ground truths labeled by board-certified neuroradiologists. In comparison, the ground truths of 66 cases are hidden from the public, and the results of the model can be obtained only by online validation. Our training strategy was to use all 285 cases for training. Our results were evaluated on the official BraTS platform (https://www.med.upenn.edu/sbia/brats2018.html, accessed on 13 May 2024).

The brain tumor competition team conducted data preprocessing, including registration, resampling, and skull removal. However, further data preprocessing was also necessary for segmentation tasks. First, we reduced as much of the background as possible while ensuring that all the brain images were included and randomly re-cropped the fixed patch size of the image to 128 × 128 × 128. Second, the Z-score method was used to standardize each image to reduce the contrast differences between the four sequences. All intensity values were clipped to the 1st and 99th percentiles of the nonzero voxel distribution of the volume. Similarly, we applied data augmentation techniques to prevent overfitting. During image processing, we performed channel rescaling between 0.9 and 1.1, a channel intensity shift between −0.1 and 0.1, additive Gaussian noise of a centered normal distribution with a standard deviation of 0.1, channel dropping, and 80% random flipping along each spatial axis.

4.2. Implementation Details

Our network was constructed in Python 3.8.10 and PyTorch 1.10.0. A single NVIDIA GeForce RTX 3090 (Nvidia, Santa Clara, CA, USA) with 24 GB of memory and Intel(R) Xeon(R) Platinum 8255C (Intel, Santa Clara, CA, USA) were used during training. As shown in Table 1, the initial learning rate was 1.00 × 10⁻⁴, and the batch size was 1. Ranger was used to optimize our network. Unlike hybrid loss, only the ordinary soft Dice loss was trained in our network. The sizes of the input in the first stage and the output of the last layer were 128 × 128 × 128.

4.3. Evaluation Metrics

Quantitative and qualitative analyses were carried out by evaluation metrics, including the Dice similarity coefficient (Dice) score and the Hausdorff distance (HD).

The Dice coefficient is a measure of the similarity between the network-predicted segmentation results and manual masks in image segmentation, which can be represented as follows:

D i c e = \frac{2 TP}{2 TP + FP + FN}

(9)

where TP, FP, and FN are the values of true positive, false positive, and false negative voxels, respectively.

HD represents the maximum distance between the predicted segmentation region boundary and the real region boundary. The smaller the value, the smaller the prediction boundary segmentation error and the better the quality. The HD can be written as follows:

H D (P, T) = m a x {s u p_{t \in T} i n f_{p \in P} d (t, p), s u p_{p \in P} i n f_{t \in T} d (t, p)}

(10)

where t and p represent the real region boundary and predicted segmentation region boundary, respectively; d(•) represents the distance between t and p; and sup denotes the supremum.

4.4. Comparison with Other Methods

To evaluate the advantages of the proposed model, we compared twelve advanced models. The numbers of networks compared were 2, 4, 2, 1, 1, and 2 for 2024, 2023, 2022, 2020, 2019, and 2016, respectively. 3D UNet, UNet++, V-Net, DMFNet, MBANet, HMNet, GMetaNet, RFTNet, SPA-Net and ETUNet are architecture variants based on basic UNet, and TransUNet and mmformer are structures based on Transformer.

Table 2 and Figure 5 and Figure 6 show that the Dice coefficients of SARFNet are 90.40, 85.54, and 82.15, and the HDs are 5.95, 7.30, and 2.72 for the three tumor subregions (WT, TC, and ET), respectively. We compared the traditional 3D UNet as the baseline, and our WT, TC, and ET results increased by 1.87, 13.77, and 6.19, respectively. In the experiment, we conducted multiple comparisons, including the variants based on UNet that perform brain tumor image segmentation tasks well, as well as CNN + Transformer architecture based on Transformer. The implementation foundation of Transformer is self-attention, which is an attention mechanism that can establish global dependencies and expand the receptive field of images. Compared to CNN, it has a larger receptive field and can capture more contextual information. It achieves better performance by selecting important information and ignoring unimportant information. However, this results in its ability to capture effective local information to be lower than CNN’s. Recently, many networks based on Transformer have also incorporated CNN for local characteristics. Although there has been a significant improvement in the results of brain tumor segmentation tasks, it has not achieved the best performance of brain tumor segmentation tasks, due to only adding local features without considering maintaining the receptive field. As a result, some networks based on Transformer have poorer results than the networks based on UNet. The attention mechanism is currently working well in networks based on UNet architecture. However, due to its inferior ability to establish accurate global relationships compared to Transformer, the loss of global relationship information affects the final segmentation performance.

In brain tumor segmentation tasks, both global relationships and local detail features are crucial. The purpose of the network we propose is to maintain global relationships as much as possible, thereby improving performance in brain tumor segmentation. The results show that our network dramatically improved in TC and ET; that is, our network performs better than the baseline in small target segmentation.

Figure 7 shows the visualization results of the SARFNet model on the BraTS2021 [69] dataset, from which five cases are randomly chosen. The medical case numbers 003_107, 020_84, 56_76, 62_83, and 66_80, according to sequences a, b, c, d, and e, are segmented by the SARFNet. Green, yellow, and red represent the whole tumor, tumor core, and enhancing tumor, respectively. The results of SARFNet are close to the labeled ground truth. Compared with the variant network based on UNet, the best MBANet has lower indicators. We also overcame the networks based on Transformer. In general, our architecture and modules achieved excellent results on BraTS2018, which can provide a good basis for subsequent research.

4.5. Ablation Experiments

4.5.1. Ablation Study of Each Module in SARFNet

We conducted ablation experiments to verify the effect of different modules in this architecture. Table 3 shows that, when AAM is used for UNet, the results are 89.62, 81.79, 80.98, and 84.13 for the WT, TC, ET, and average Dice, respectively. The results of UNet with AAM are better than baseline, which shows that the AAM module is effective at brain tumor segmentation tasks because of its ability to use axial attention to extract global information and retain the characteristics of ordinary convolution and multiscale receptive fields. After the SL_iRF module is added, the results are 90.42, 84.00, 80.66, and 85.03 for the WT, TC, ET, and average Dice, respectively, which indicates that it has been further improved on the basis of the previous module. The module plays a role in further feature extraction during the downsampling stage.

After CAM was added, the TC, ET, and average Dice increased, with values of 90.13, 84.19, 81.14, and 85.15 for the WT, TC, ET, and average Dice, respectively. The results of the ablation study of deep supervision show that the results are improved at WT, TC, ET, and average Dice. We fused the above modules to the network; the results are 90.40, 85.54, 82.15, and 86.03 for WT, TC, ET, and average Dice, respectively. The best results show that our network and all the modules can be effectively applied to brain tumor segmentation tasks.

4.5.2. Ablation Experiments for SL_iRF Module

In the previous ablation experiment, the SL_iRF module was added to the network to improve the effect of the network. However, using different convolution kernels for each layer changes the effect on the SL_iRF module. Four groups of experiments were performed to verify the convolution kernel size in different layers. The four experimental settings are as follows:

The convolution kernel sizes are 3 × 3 × 3 in all layers.

The convolution kernels sizes are 5 × 5 × 5 in all layers.

The convolution kernels sizes are 7 × 7 × 7 in all layers.

The convolution kernel sizes from the first to the fourth layers are 3 × 3 × 3, 5 × 5 × 5, 7 × 7 × 7, and 7 × 7 × 7, respectively.

The results of SL_iRF with the last line convolution kernel sizes were the best. The values in Table 4 are 90.40, 85.54, 82.15, and 86.03 for the WT, TC, ET, and average Dice, respectively. The results show that, as the number of layers increases, the increase in the size of convolution kernels has a greater benefit for the segmentation performance.

4.5.3. Ablation Experiments for CAM Module

The CAM, which is added to skip connections between the encoder and decoder, can fuse high-level features to guide the feature selection of the low-level features. The deeper the layer, the richer the features. For this reason, we studied the number of CAM applied to the skip connection layer.

As shown in Table 5, when CAM is employed simultaneously on the third and fourth layers, the effect is the best, and the results are 90.40, 85.54, 82.15, and 86.03 for the WT, TC, ET, and average Dice, respectively. We speculated that the shallow semantic information in first layer and second layer is less helpful to the decoder layer. CAM is beneficial to our architecture, but its effectiveness is limited.

4.5.4. Ablation Experiments for Axial Dimension of AAM Module

We conducted three experiments to analyze the relationship between axial dimensions and receptive fields. The experiments included:

A: expanding receptive fields on all three axial operations;

B: not expanding receptive fields on all three axial operations;

C: expanding receptive fields on two axial operations.

The depth axial operation lacking sufficient information on each slice has an impact on the results, as shown in Table 6 and Figure 8. Therefore, the operations of receptive fields applied to all three axial operations are not as effective as the horizontal and vertical axial operations alone. That is to say, AAM works best on the horizontal and vertical axial operation.

5. Discussion and Conclusions

An appropriately scaled receptive field is crucial for brain tumor segmentation tasks, and making good use of the receptive field can effectively enhance the performance of brain tumor segmentation. In this paper, we proposed a SARFNet architecture based on UNet that integrates the SL_iRF module, AAM module, CAM, and deep supervision to improve brain tumor segmentation performance. The SL_iRF module is developed for downsampling to reduce the image size and expand the receptive field to extract multi-scale information. The AAM module in the bottleneck layer can capture global information while preserving the advantages of convolution. The CAM is used in the skip connections of the third and fourth layers, using high-level semantic information to guide low-level semantic information to help restore images during the upsampling stage. In addition, we compared it with networks based on the UNet and Transformer from 2022 to 2024. Our results show Dice values of 90.40, 85.54, and 82.15 and HD values of 5.95, 7.30, and 2.72, respectively, for the three tumor subregions (WT, TC, and ET). The results show that the proposed model achieved state-of-the-art performance compared with more than twelve benchmarks. This indicates that the model and modules we proposed can achieve accurate brain tumor segmentation and enhance the performance of the segmentation.

By comparing with a traditional 3D UNet, we found that there were significant improvements in the segmentation of small targets. Compared with variant networks based on UNet and Transformer, our achievement lies in the fact that our proposed network uses different convolution kernels according to the characteristics of each layer in the network, effectively maintains the relationship between the receptive field and global features, and prevents the loss of important information during the feature processing. Specifically, we also conducted ablation experiments on the SL_iRF module, AAM module, CAM, and deep supervision, which demonstrated the effectiveness of our modules. From Table 4, it is known that, in our network, features become more detailed as the layers deepen. In this case, the information extracted by larger convolutional kernels is sufficient, and there is also an improvement in performance. This indicates that, in the deeper layers of the network, the size of the receptive field has an impact on the effective extraction of features. The results in Table 5 show that, in the network we proposed, the deeper the network layers, the more necessary it is to use high-level semantic information to guide the selection of lower-level semantic information.

However, our research also had some limitations. The extensive use of attention mechanisms will inevitably increase the complexity of the model. Therefore, we aim to perform lightweight research on models in the future. Although our network exhibited better performance when tested on the brain tumor dataset for tumor segmentation tasks, we have not attempted to evaluate images unrelated to brain tumors, such as RGB images. In the next stage, we aim to increase the scope of our experiments to make the network we proposed more practical. We believe that the encouraging results obtained with SARFNet will inspire further research into brain tumor segmentation.

Author Contributions

Conceptualization, N.C. and B.G.; Methodology, B.G.; Software, P.Y.; Data Curation, R.Z.; Writing—Original Draft, B.G.; Writing—Review and Editing, N.C. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 41830110.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets released to the public were analyzed in this study. The dataset can be found in the BraTS 2018 dataset: https://www.med.upenn.edu/sbia/brats2018/data.html, accessed on 13 May 2024. GitHub repository, https://github.com/drbinguo/SARFNet, accessed on 13 May 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wechsler-Reya, R.; Scott, M. The developmental biology of brain tumors. Annu. Rev. Neurosci. 2001, 24, 385–428. [Google Scholar] [CrossRef] [PubMed]
Bondy, M.L.; Scheurer, M.E.; Malmer, B.; Barnholtz, J.S.S.; Davis, F.G.; Il’Yasova, D.; Kruchko, C.; McCarthy, B.J.; Rajaraman, P.; Schwartzbaum, J.A.; et al. Brain tumor epidemiology: Consensus from the Brain Tumor Epidemiology Consortium. Cancer 2008, 113 (Suppl. S7), 1953–1968. [Google Scholar] [CrossRef]
Chandana, S.R.; Movva, S.; Arora, M.; Singh, T. Primary brain tumors in adults. Am. Fam. Physician 2008, 77, 1423–1430. [Google Scholar] [PubMed]
Pidhorecky, I.; Urschel, J.; Anderson, T. Resection of invasive pulmonary aspergillosis in immunocompromised patients. Ann. Surg. Oncol. 2000, 7, 312–317. [Google Scholar] [CrossRef] [PubMed]
Louis, D.N.; Perry, A.; Reifenberger, G.; von Deimling, A.; Figarella-Branger, D.; Cavenee, W.K.; Ohgaki, H.; Wiestler, O.D.; Kleihues, P.; Ellison, D.W. The 2016 World Health Organization Classification of tumors of the central nervous system: A summary. Acta Neuropathol. 2016, 131, 803–820. [Google Scholar] [CrossRef] [PubMed]
Malmer, B.; Henriksson, R.; Grönberg, H. Different aetiology of familial low-grade and high-grade glioma? A nationwide cohort study of familial glioma. Neuroepidemiology 2022, 21, 279–286. [Google Scholar] [CrossRef]
Cho, H.H.; Park, H. Classification of low-grade and high-grade glioma using multi-modal image radiomics features. In Proceedings of the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Republic of Korea, 11–15 July 2017; pp. 3081–3084. [Google Scholar]
Mabray, M.C.; Barajas, R.F.; Cha, S. Modern brain tumor imaging. Brain Tumor Res. Treat. 2015, 3, 8–23. [Google Scholar] [CrossRef]
Kasban, H.; El-Bendary, M.A.M.; Salama, D.H. A comparative study of medical imaging techniques. Int. J. Inf. Sci. Intell. Syst. 2015, 4, 37–58. [Google Scholar]
Suneetha, B.; JhansiRani, A. A survey on image processing techniques for brain tumor detection using magnetic resonance imaging. In Proceedings of the 2017 International Conference on Innovations in Green Energy and Healthcare Technologies (IGEHT), Coimbatore, India, 16–18 March 2017; pp. 1–6. [Google Scholar]
Mohan, G.; Subashini, M.M. MRI based medical image analysis: Survey on brain tumor grade classification. Biomed. Signal Process. Control 2018, 39, 139–161. [Google Scholar] [CrossRef]
Bauer, S.; Wiest, R.; Nolte, L.P.; Reyes, M. A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol. 2013, 58, R97. [Google Scholar] [CrossRef]
Liu, J.; Li, M.; Wang, J.; Wu, F.; Liu, T.; Pan, Y. A survey of MRI-based brain tumor segmentation methods. Tsinghua Sci. Technol. 2014, 19, 578–595. [Google Scholar]
Moffat, B.A.; Chenevert, T.L.; Lawrence, T.S.; Meyer, C.R.; Johnson, T.D.; Dong, Q.; Tsien, C.; Mukherji, S.; Quint, D.J.; Gebarski, S.S.; et al. Functional diffusion map: A noninvasive MRI biomarker for early stratification of clinical brain tumor response. Proc. Natl. Acad. Sci. USA 2005, 102, 5524–5529. [Google Scholar] [CrossRef] [PubMed]
Blystad, I.; Warntjes, J.M.; Smedby, Ö.; Lundberg, P.; Larsson, E.M.; Tisell, A. Quantitative MRI for analysis of peritumoral edema in malignant gliomas. PLoS ONE 2017, 12, e0177135. [Google Scholar] [CrossRef] [PubMed]
Yaniv, Z.; Cleary, K. Image-guided procedures: A review. Comput. Aided Interv. Med. Robot. 2006, 3, 7. [Google Scholar]
Shi, F.; Wang, J.; Shi, J.; Wu, Z.; Wang, Q.; Tang, Z.; He, K.; Shi, Y.; Shen, D. Review of artificial intelligence techniques in imaging data acquisition, segmentation, and diagnosis for COVID-19. IEEE Rev. Biomed. Eng. 2020, 14, 4–15. [Google Scholar] [CrossRef] [PubMed]
Mazzara, G.P.; Velthuizen, R.P.; Pearlman, J.L.; Greenberg, H.M.; Wagner, H. Brain tumor target volume determination for radiation treatment planning through automated MRI segmentation. Int. J. Radiat. Oncol. Biol. Phys. 2004, 59, 300–312. [Google Scholar] [CrossRef]
Paul, J.S.; Plassard, A.J.; Landman, B.A.; Fabbri, D. Deep learning for brain tumor classification. In Medical Imaging 2017: Biomedical Applications in Molecular, Structural, and Functional Imaging; SPIE: Bellingham, WA, USA, 2017; Volume 10137, pp. 253–268. [Google Scholar]
Sultana, F.; Sufian, A.; Dutta, P. Evolution of image segmentation using deep convolutional neural network: A survey. Knowl. -Based Syst. 2020, 201, 106062. [Google Scholar] [CrossRef]
Lu, Y.; Chen, Y.; Zhao, D.; Liu, B.; Lai, Z.; Chen, J. CNN-G: Convolutional neural network combined with graph for image segmentation with theoretical analysis. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 631–644. [Google Scholar] [CrossRef]
Duan, Y.; Liu, F.; Jiao, L.; Zhao, P.; Zhang, L. SAR image segmentation based on convolutional-wavelet neural network and Markov random field. Pattern Recognit. 2017, 64, 255–267. [Google Scholar] [CrossRef]
Li, G.; Yu, Y. Visual saliency detection based on multiscale deep CNN features. IEEE Trans. Image Process. 2016, 25, 5012–5024. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zhang, Y.; Chi, M. Mask-R-FCN: A deep fusion network for semantic segmentation. IEEE Access 2020, 8, 155753–155765. [Google Scholar] [CrossRef]
Chaiyasarn, K.; Buatik, A.; Mohamad, H.; Zhou, M.; Kongsilp, S.; Poovarodom, N. Integrated pixel-level CNN-FCN crack detection via photogrammetric 3D texture map of concrete structures. Autom. Constr. 2022, 140, 104388. [Google Scholar] [CrossRef]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Rehman, M.U.; Cho, S.; Kim, J.H.; Chong, K.T. Bu-net: Brain tumor segmentation using modified u-net architecture. Electronics 2020, 9, 2203. [Google Scholar] [CrossRef]
Pan, P.; Xu, Y.; Xing, C.; Chen, Y. Crack detection for nuclear containments based on multi-feature fused semantic segmentation. Constr. Build. Mater. 2022, 329, 127137. [Google Scholar] [CrossRef]
Zhou, D.X. Theory of deep convolutional neural networks: Downsampling. Neural Netw. 2022, 124, 319–327. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 59, 426–435. [Google Scholar] [CrossRef]
Peng, D.; Yu, X.; Peng, W.; Lu, J. DGFAU-Net: Global feature attention upsampling network for medical image segmentation. Neural Comput. Appl. 2021, 33, 12023–12037. [Google Scholar] [CrossRef]
Chen, F.; Wang, N.; Yu, B.; Wang, L. Res2-Unet, a new deep architecture for building detection from high spatial resolution images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1494–1501. [Google Scholar] [CrossRef]
Li, Y.; Si, S.; Li, G.; Hsieh, C.J.; Bengio, S. Learnable fourier features for multi-dimensional spatial positional encoding. Adv. Neural Inf. Process. Syst. 2021, 34, 15816–15829. [Google Scholar]
Chen, B.; Liu, Y.; Zhang, Z.; Lu, G.; Kong, A.W.K. Transattunet: Multi-level attention-guided u-net with transformer for medical image segmentation. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 55–68. [Google Scholar] [CrossRef]
Liu, X.; Song, L.; Liu, S.; Zhang, Y. A review of deep-learning-based medical image segmentation methods. Sustainability 2021, 13, 1224. [Google Scholar] [CrossRef]
Sahiner, B.; Chan, H.P.; Petrick, N.; Wei, D.; Helvie, M.A.; Adler, D.D.; Goodsitt, M.M. Classification of mass and normal breast tissue: A convolution neural network classifier with spatial domain and texture images. IEEE Trans. Med. Imaging 1996, 15, 598–610. [Google Scholar] [CrossRef] [PubMed]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; pp. 424–432. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Chen, C.; Liu, X.; Ding, M.; Zheng, J.; Li, J. 3D dilated multi-fiber network for real-time brain tumor segmentation in MRI. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019, Shenzhen, China, 13–17 October 2019; pp. 184–192. [Google Scholar]
Cao, Y.; Zhou, W.; Zang, M.; An, D.; Feng, Y.; Yu, B. MBANet: A 3D convolutional neural network with multi-branch attention for brain tumor segmentation from MRI images. Biomed. Signal Process. Control 2023, 80, 104296. [Google Scholar] [CrossRef]
Zhang, R.; Jia, S.; Adamu, M.J.; Nie, W.; Li, Q.; Wu, T. HMNet: Hierarchical multi-scale brain tumor segmentation network. J. Clin. Med. 2023, 12, 538. [Google Scholar] [CrossRef]
Liu, D.; Sheng, N.; He, T.; Wang, W.; Zhang, J.; Zhang, J. SGEResU-Net for brain tumor segmentation. Math. Biosci. Eng. 2022, 19, 5576–5590. [Google Scholar] [CrossRef]
Tian, W.; Li, D.; Lv, M.; Huang, P. Axial attention convolutional neural network for brain tumor segmentation with multi-modality MRI scans. Brain Sci. 2022, 13, 12. [Google Scholar] [CrossRef] [PubMed]
Zhuang, Y.; Liu, H.; Song, E.; Hung, C.C. A 3D cross-modality feature interaction network with volumetric feature alignment for brain tumor and tissue segmentation. IEEE J. Biomed. Health Inform. 2022, 27, 75–86. [Google Scholar] [CrossRef]
Kuang, H.; Yang, D.; Wang, S.; Wang, X.; Zhang, L. Towards simultaneous segmentation of liver tumors and intrahepatic vessels via cross-attention mechanism. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Yang, L.; Zhai, C.; Liu, Y.; Yu, H. CFHA-Net: A polyp segmentation method with cross-scale fusion strategy and hybrid attention. Comput. Biol. Med. 2023, 164, 107301. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Liu, H.; Zhou, Z.; Gou, S.; Wang, J.; Jiao, L. ASF-LKUNet: Adjacent-Scale Fusion U-Net with Large-kernel for Medical Image Segmentation. Authorea Prepr. 2023. [Google Scholar] [CrossRef]
Li, H.; Qi, M.; Du, B.; Li, Q.; Gao, H.; Yu, J.; Bi, C.; Yu, H.; Liang, M.; Ye, G.; et al. Maize Disease Classification System Design Based on Improved ConvNeXt. Sustainability 2023, 15, 14858. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Zhang, Y.; He, N.; Yang, J.; Li, Y.; Wei, D.; Huang, Y.; Zhang, Y.; He, Z.; Zheng, Y. mmformer: Multimodal medical transformer for incomplete multimodal learning of brain tumor segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 107–117. [Google Scholar]
Lu, Y.; Chang, Y.; Zheng, Z.; Sun, Y.; Zhao, M.; Yu, B.; Tian, C.; Zhang, Y. GMetaNet: Multi-scale ghost convolutional neural network with auxiliary MetaFormer decoding path for brain tumor segmentation. Biomed. Signal Process. Control 2023, 83, 104694. [Google Scholar] [CrossRef]
Zhang, W.; Chen, S.; Ma, Y.; Liu, Y.; Cao, X. ETUNet: Exploring efficient transformer enhanced UNet for 3D brain tumor segmentation. Comput. Biol. Med. 2024, 171, 108005. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Zhu, J.; Li, J.; Wang, Z.; Cheng, L.; Liu, L.; Li, H.; Zhou, J. Channel-attention U-Net: Channel attention mechanism for semantic segmentation of esophagus and esophageal cancer. IEEE Access 2020, 8, 122798–122810. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. Rfaconv: Innovating spatital attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Ho, J.; Kalchbrenner, N.; Weissenborn, D.; Salimans, T. Axial attention in multidimensional transformers. arXiv 2019, arXiv:1912.12180. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef] [PubMed]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.S.; Freymann, J.B.; Farahani, K.; Davatzikos, C. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Nat. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef] [PubMed]
Bakas, S.; Reyes, M.; Jakab, A.; Bauer, S.; Rempfler, M.; Crimi, A.; Shinohara, R.T.; Berger, C.; Ha, S.M.; Rozycki, M. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge. arXiv 2018, arXiv:1811.02629. [Google Scholar]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.; Freymann, J.; Farahani, K.; Davatzikos, C. Segmentation labels and radiomic features for the pre-operative scans of the TCGA-LGG collection. Cancer Imaging Arch. 2017, 286. [Google Scholar] [CrossRef]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.; Freymann, J.; Farahani, K.; Davatzikos, C. Segmentation labels and radiomic features for the pre-operative scans of the TCGA-GBM collection. Cancer Imaging Arch. 2017, 4, 10–1038. [Google Scholar] [CrossRef]
Jiao, C.; Yang, T.; Yan, Y.; Yang, A. RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation. Electronics 2023, 13, 77. [Google Scholar] [CrossRef]
Liu, H.; Huang, J.; Li, Q.; Guan, X.; Tseng, M. A deep convolutional neural network for the automatic segmentation of glioblastoma brain tumor: Joint spatial pyramid module and attention mechanism network. Artif. Intell. Med. 2024, 148, 102776. [Google Scholar] [CrossRef]
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese, E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura, F.C.; Pati, S. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]

Figure 1. An illustration of the proposed SARFNet for brain tumor image segmentation.

Figure 2. The structure of CAM. Input A and input B are one level apart.

Figure 3. An illustration of the proposed SL_iRF block.

Figure 4. The structure of the proposed AAM module. In this module, rows are constructed first, followed by columns.

Figure 5. Comparison of Dice results of different segmentation methods.

Figure 6. Comparison of HD results of different segmentation methods.

Figure 7. Visualization results of medical cases. From left to right: T1, T1ce, T2, FLAIR, the result segmented by SARFNet, and the ground truth. Green, yellow, and red represent the whole tumor, tumor core, and enhancing tumor, respectively.

Figure 8. Dice result of ablation experiments for axial dimension of AAM module.

Table 1. Model parameter configuration.

Basic Configuration	Value
PyTorch Version	1.10.0
Python	3.8.10
GPU	NVIDIA GeForce RTX 3090(24 G)
Cuda	11.3
Learning Rate	1.00 × 10⁻⁴
Optimizer	ranger
Batch Size	1
Input Size	128 × 128 × 128
Output Size	128 × 128 × 128

Table 2. Comparison of different methods on BraTS2018 validation dataset.

Methods	Dice Score (%)				Hausdorff Distance (mm)
Methods	WT	TC	ET	AVG	WT	TC	ET	AVG
3D U-Net [40] (2016)	88.53	71.77	75.96	78.75	17.10	11.62	6.04	11.59
UNet++ [41] (2020)	88.76	82.05	77.94	82.92	6.31	8.64	4.82	6.59
V-Net [42] (2016)	89.60	81.00	76.60	82.40	6.54	7.82	7.21	7.19
DMFNet [43] (2019)	89.90	83.50	78.10	83.83	4.86	7.74	3.38	5.33
TransUNet [54] (2022)	89.95	82.04	78.38	83.46	7.11	7.67	4.28	6.35
mmformer [55] (2022)	89.56	83.33	78.75	83.88	4.43	8.04	3.27	5.25
GMetaNet [56] (2023)	90.10	84.20	82.00	85.40	5.16	5.26	2.62	4.34
MBANet [44] (2023)	89.80	85.47	80.18	85.15	5.13	5.65	2.47	4.42
HMNet [45] (2023)	90.10	84.30	78.60	84.33	4.73	7.73	2.70	5.05
RFTNet [67] (2023)	90.30	82.15	80.24	84.23	5.97	6.41	3.16	5.18
SPA-Net [68] (2024)	89.63	85.89	79.90	85.14	4.79	5.40	2.77	4.32
ETUNet [57] (2024)	90.00	85.20	81.00	85.40	6.67	7.40	6.01	6.69
SARFNet (Ours)	90.40	85.54	82.15	86.03	5.95	7.30	2.72	5.23

Table 3. Dice result of ablation study of each module in SARFNet.

	SL_iRF	AAM	CAM	DP	WT	TC	ET	AVG
A		√			89.62	81.79	80.98	84.13
B	√	√			90.42	84.00	80.66	85.03
C	√	√	√		90.13	84.19	81.14	85.15
D	√	√		√	90.83	85.09	82.00	85.97
SARFNet	√	√	√	√	90.40	85.54	82.15	86.03

Table 4. Dice results of ablation experiments of selective kernel for SL_iRF module.

	Layer Number:Kernel Size				WT	TC	ET	AVG
A	Layer 1:3	Layer 2:3	Layer 3:3	Layer 4:3	90.32	85.43	80.51	85.42
B	Layer 1:5	Layer 2:5	Layer 3:5	Layer 4:5	89.86	85.45	80.65	85.32
C	Layer 1:7	Layer 2:7	Layer 3:7	Layer 4:7	89.97	85.50	79.37	84.96
SL_iRF	Layer 1:3	Layer 2:5	Layer 3:7	Layer 4:7	90.40	85.54	82.15	86.03

Table 5. Dice results of ablation experiments for CAM module.

	Layer				WT	TC	ET	AVG
	4th	3rd	2nd	1st	WT	TC	ET	AVG
A					90.83	85.09	82.00	85.9
B	√				90.65	85.31	80.00	85.32
C (SARFNet)	√	√			90.40	85.54	82.15	86.03
D	√	√	√		89.24	84.10	81.74	85.02
E	√	√	√	√	89.71	85.53	78.66	84.63

Table 6. Dice results of ablation experiments for axial dimension of AAM module.

Experiment	Dice Score (%)
Experiment	WT	TC	ET	AVG
A	90.61	85.25	79.87	85.25
B	90.20	84.95	78.39	84.52
C (SARFNet)	90.40	85.54	82.15	86.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, B.; Cao, N.; Yang, P.; Zhang, R. SARFNet: Selective Layer and Axial Receptive Field Network for Multimodal Brain Tumor Segmentation. Appl. Sci. 2024, 14, 4233. https://doi.org/10.3390/app14104233

AMA Style

Guo B, Cao N, Yang P, Zhang R. SARFNet: Selective Layer and Axial Receptive Field Network for Multimodal Brain Tumor Segmentation. Applied Sciences. 2024; 14(10):4233. https://doi.org/10.3390/app14104233

Chicago/Turabian Style

Guo, Bin, Ning Cao, Peng Yang, and Ruihao Zhang. 2024. "SARFNet: Selective Layer and Axial Receptive Field Network for Multimodal Brain Tumor Segmentation" Applied Sciences 14, no. 10: 4233. https://doi.org/10.3390/app14104233

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SARFNet: Selective Layer and Axial Receptive Field Network for Multimodal Brain Tumor Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based Methods for Medical Image Segmentation

2.2. The Attention-Based Module for Medical Image Segmentation