1. Introduction
MRI plays a key role in fetal diagnosis due to its high resolution, superb soft-tissue contrast, and 2D capabilities [
1]. Automatic segmentation of the whole fetus provides valuable insights into fetal development and serves as an important tool for advanced diagnostics. Accurate volumetric segmentation of the entire fetal body facilitates understanding of fetal growth rate, early diagnosis of pathologies, and improved surgical planning [
2]. Rapid development in DL over the past decade has given rise to numerous techniques suitable for automatic fetal image analysis.
DL employs convolutional neural networks (CNN) to perform an in-depth analysis of images and multidimensional data forms. These neural networks use a complex series of convolutions, maximum pooling, and up-sampling operations to learn advanced image analysis tasks efficiently and accurately. Previous studies have found that DL algorithms are robust and accurate. They can reduce the time and cost required for segmentation and diagnostic tasks [
3]. This makes DL an ideal tool for automated whole fetal segmentation, due to the diversity in both morphological and image quality in conventional fetal imaging.
Segmentation algorithms have a variety of clinical applications ranging from cardiac segmentation to infant brain segmentation [
4]. DL algorithms contain many variables and functions that can be modified to optimize results, leaving room for improvement in any algorithm. The U-Net architecture is a popular choice of CNN, specifically for medical image segmentation applications. U-Net is unique in that the model performs up-sampling (i.e., interpolation) after convolution instead of the traditional pooling operation. This allows the algorithm to recognize local features and reassess the important details within an image [
5]. U-Net also performs concatenation operations on the output layer of the decoder (i.e., the contracting path) and its parallel up-sampling/encoding (i.e., expanding path) layer. This design creates symmetry in the U-Net model, as the down-sampling layers are being matched to the up-sampling layers. Concatenated data are then synthesized to develop more precise segmentation [
5].
DL has been a popular choice for a variety of fetal imaging applications. Previous studies have described accurate DL algorithms for brain or brain tumor segmentation for fetal MRI [
6,
7]. Existing DL-based fetal segmentation algorithms include: a VGG16-based encoder/decoder type network to segment fetal tissue and amniotic fluid from ultrasound images [
8]; a multiscale CNN on 2D slices of fetal MRI to segment intracranial volumes [
9]; a combination of a CNN with data augmentation to segment intracranial volume and seven brain tissue types [
10]; two CNNs for coarse and fine segmentation and fetal brain reconstruction from MRIs [
11]; dynamic CNNs applied to echocardiography to segment the left fetal ventricle which, when combined with gradient boosting machines, was used to segment the fetal abdomen in ultrasound images [
12]; a CNN for fetal skull segmentation from 3D ultrasound [
13]; and a semi-automated bounding box with a CNN pipeline for segmenting maternal-fetal organs from 2D MRI slices [
14].
In standard CNNs, including U-Net, the weights of each channel all have the same value. The squeeze-and-excitation (SE) block adds adaptive weights to the channel-wise feature maps by modelling the interdependencies between these convolutional channels [
15]. This implemented mechanism allows for feature recalibration using global information to determine which features are emphasized, and which are minimized. The SE block achieves this by squeezing each channel to a single value by using global average pooling. This value then undergoes a series of operations including a fully connected layer, a Rectified Linear Unit (ReLU), and a sigmoid activation function, resulting in the desired nonlinear characteristics. These nonlinear features are mapped back to the original channel, allowing all feature maps to obtain adaptive weights. The SE block offers convenient implementation to any end-to-end network and accounts for only a minor increase of less than 0.5% in computational cost [
16].
The addition of attention mechanisms before the concatenation step can provide greater emphasis on contextual information, which is likely to lead to higher segmentation accuracy. Skip connections can combine features from the contracting pathway into the expanding pathway by concatenating the two. The attention mechanism then combines the two signals from the skip connection and from the previously deconvoluted layer, in order to determine the significance of relevant features [
17,
18,
19]. The result is an attention map that is multiplied by the input to the skip connection to produce the attention block. As such, we hypothesized that there would be synergistic effects between applying adaptive weights to feature channels in the contracting pathway, as well as further emphasis on these weighted features. We anticipated that a network incorporating these features would have improved segmentation results.
In the current study, we take advantage of the benefits offered by DL to develop a new architecture that can automatically and accurately segment the whole fetal volume from the respective MRI slice, using the U-Net architecture as the primitive framework. Given the success of U-Net in previous applications, we hypothesized that our proposed architecture would also perform as an efficient and useful segmentation tool that can improve the evaluation of fetal health and development. This work aims to develop a novel architecture by using previously established building blocks from SE blocks, attention mechanisms, and hyper-parametric changes. Our proposed architecture aims to contribute to improvements in clinical settings and provide a novel architecture for biomedical image segmentation.
Figure 1 illustrates the aim of our proposed work.
The contributions of this work are as follows:
Curation of a manually segmented sagittal fetal MRI dataset;
Design of a novel CNN-based architecture for the automatic segmentation of fetal MRI images, with comparable results to state-of-the-art methods.
Section 2 of this paper covers the research methodology, data used, algorithm architecture, hyper parametric changes, experiments, and statistical analyses.
Section 3 then covers the experimental results and differences between architectures.
Section 4 provides a discussion on the interpretation of the results and the significance of the findings. Finally,
Section 5 summarizes the conclusion and key findings of this paper.
3. Results
The segmentation performances of CASE-Net and the other competitive architectures are shown in
Table 2. CASE-Net achieved a DSC of 87.36% which was significantly higher than the DSC of UNet, USE-Net, ATN-Net, and LinkNet by 2.19% (
p = 0.048), 1.29% (
p = 0.034), 5.09% (
p = 0.019), and 4.64% (
p = 0.021), respectively. With a 91.79% recall, CASE-Net also outperformed the other architectures by 2.55% (
p = 0.022), 1.29% (
p = 0.059), 5.09% (
p = 0.078), and 4.64% (
p = 0.058). Nonetheless, CASE-Net achieved a moderate 95.54% precision score, comparable to the other architectures.
The segmentation performances were also calculated in an ablation study for CASE-Net, as shown in
Table 3. CASE-Net achieved a DSC of 87.36% which was significantly higher than the DSC of a 3 × 3 Kernel CASE-Net, without attention mechanisms and SE blocks by 3.86% (
p = 0.0257), 3.01% (
p = 0.0304), and 6.17% (
p = 0.0331), respectively. CASE-Net also achieved the lowest loss of 0.005, which was lower by 0.047 (
p = 0.1581), 0.126 (
p = 0.0425), and 0.091 (
p = 0.0138). With a 91.79% recall, CASE-Net also outperformed the other networks by 5.22% (
p = 0.0455), 2.64% (
p = 0.0750), and 10.84% (
p = 0.0421). CASE-Net, however, had a relatively low precision score of 0.93% (
p = 0.0118), 0.72% (
p = 0.0498), and 1.17% (
p = 0.0385), respectively.
Figure 5 visually depicts the segmentation performance of our proposed CASE-Net and the other competitive architectures. As shown, CASE-Net had the highest DSC when compared to U-Net, USE-Net, ATN-Net, and Link-Net. Relative to the other networks, CASE-Net demonstrated the highest
and
values. All networks were calibrated using their optimal loss, optimizers, and learning rates from their respective papers. In addition, each network used the same 5-fold validation, batch size, number of epochs, image augmentations, and dataset.
4. Discussion
Our study presents a novel CNN architecture, CASE-Net, which can achieve accurate 2D whole body fetal MRI segmentations when trained on second and third trimester images. The results show that CASE-Net outperforms other prominent architectures. Specifically, the CASE-Net architecture obtained the highest DSC of 87.36% compared to U-Net, USE-Net, ATN-Net, and Link-Net. These results indicate that a combination of attention mechanisms, SE blocks, and hyper-parametric changes can increase automatic segmentation performance. In addition, CASE-Net also demonstrated the highest recall score and moderate precision scores, indicating a low rate of false negatives and a propensity for false positives. The high recall score and low precision score demonstrates that CASE-Net tends to over-segment images.
Though U-Net is the most popular architecture for biomedical segmentation, the model demonstrated lower DSC, recall, and precision scores when compared to our proposed CASE-Net. However, USE-Net, ATN-Net, and Link-Net architectures demonstrated slightly higher precision but lower recall and DSC scores, indicating that these architectures under-segment images.
We conducted our ablation study (outlined in
Table 3) to determine which individual components were responsible for the relative success of CASE-Net. The first experiment demonstrated that the 3 × 3 kernels had lower performance in all metrics except precision. The second experiment removed the attention mechanism from the network, showing a smaller impact in comparison to changing the kernel sizes. Despite the lack of attention mechanism, this architecture demonstrated lower results in DSC and recall, but higher precision. The last experiment did not use SE blocks, which demonstrated the greatest impact on performance, as the DSC decreased to 81.19%. Similar to the previous two experiments, the precision of the model remained at a higher value than CASE-Net. It can be seen that all of the networks, other than CASE-Net, tend to under-segment images. However, when all modules are combined, they achieve improved performance.
Embedding attention mechanisms in the U-Net architecture does not seem to produce desirable results in fetal MRI segmentation applications. However, when paired with SE blocks and integrated into the skip connections, this results in increased emphasis on the features received from the contracting pathway. The low accuracy obtained from the basic ATN-Net may be attributed to the nature of medical images, as the variation amongst slices may create challenges in emphasizing important features.
Our results show that SE blocks lead to improvement in segmentation performance when added to a DL architecture. Since many architectures do not implement adaptive weights to their model, localization of important features is not easily determined, thus leading to weaker performance. USE-Net has a low recall score despite having a high precision score, demonstrating that the model carefully identifies correct segmentations, although sometimes misses pixels. The discrepancy between recall and precision can be attributed to the SE blocks adding weights to important features, as identified by the algorithm.
Figure 6 illustrates the predicted segmentation of the various architectures that were used. It can be seen qualitatively that U-Net and USE-Net tend to under segment the image around the fetal neck and spinal cord. ATN-Net and LinkNet under-segment around the fetal head and incorrectly segment the umbilical cord. On the other hand, CASE-Net produces a good balance between over- and under-segmentations and can segment the whole fetal MRI with a moderate degree of accuracy. In the studied architectures, gestational age did not have any effect on segmentation accuracy.
Figure 7 illustrates the pixel-to-pixel comparison of the ground truth with the predicted segmentations’ masks of the various architectures. It can be seen that CASE-Net has very few FP and FN values compared to the other architectures.
Previous segmentation studies have focused heavily on frameworks using a modified U-Net architecture. These studies tend to obtain DSC values ranging from 84–92%, depending on the application [
16,
17,
18,
19,
20,
21,
22]. Our study obtained DSC values between 87% and 89%, which are comparable. Yet, higher DSC results (i.e., over 90%) were not attained. The slightly lower DSC values in our study are likely due to the increased heterogeneity of the fetal body position and size compared to other biomedical imaging datasets. This leads to increased difficulty in segmentation, resulting in the slightly lower segmentation accuracy presented in this study [
23]. When using our segmented fetal MRI dataset in other segmentation architectures, we did not find a model that yielded higher results than our proposed architecture. We speculate that the lack of emphasis on contextual information in other networks is the reason for our more accurate results.
Our results demonstrated that CASE-Net is a successful architecture that yields promising results. Further improvements to the model would require more computational power. It is likely that concatenating shallow layers with attention gates and SE blocks will further improve segmentation performance, but we were unable to achieve this in our proposed architecture due to limited computational resources. CASE-Net may also benefit from using advanced data augmentation techniques. The added variability would likely improve robustness to translations, rotations, intensity changes, and other forms of variations that occur naturally.
The advantages of the proposed architecture seem to demonstrate that the model performs well with biomedical datasets. The architecture is proficient in addressing the variability and random occurrences that are especially common in biomedical datasets. The architecture could identify both small and abstract anatomy through the addition of SE blocks and attention mechanisms. However, using this architecture in a non-biomedical application would likely be ineffective. For example, in autonomous vehicles when compared to current state-of-the-art models. This is because most non-biomedical datasets typically do not have the large amount of small variability that is present in biomedical datasets.
Similar to our proposed architecture, Schelmper et al. presented attention-gated networks for use in medical images. In their study, their network used 3D-CT abdominal images and 2D fetal ultrasound images, while obtaining results that outperformed the base U-Net architecture [
19]. The higher DSC with lower variance is similar to the results that we obtained in our proposed CASE-Net architecture. Their work demonstrated that the attention-gated networks are able to exploit global and local object localization to improve model performance. The authors concluded that their attention-gated mechanisms perform well for tissue/organ identification and localization in fetal ultrasound. Having similar datasets in the same domain is what led to similar results between our proposed architectures and datasets.
Other fetal segmentation algorithms suffer from the same setbacks of fetal limb motion and position variability. For example, Zhang et al. presented a graph theory automated solution for full body fetal segmentation and obtained good results of only 12% error in their fetal body volume estimates. Nonetheless, their segmentation results were not provided, preventing more comprehensive evaluation of the accuracy. The algorithm in the study by Zhang et al. was only designed with 10 fetuses, while our algorithm was trained, validated, and tested using a total of 34 fetuses, improving our versatility and methodological rigor [
2].
Many of the other segmentation tasks on whole fetal volumes or whole fetal envelopes are done with ultrasound images, which are much more limited because of the restrictions in field of view and resolution, resulting in lower segmentation accuracies ranging from 60% to 70% [
18,
23]. Ravishankar et al. were able to achieve an excellent average DSC of 90% when automatically segmenting the fetal abdomen in MRI. The slightly higher DSC in their study may be attributed to the exclusion of the limbs, hands, feet, and fingers, which typically show the highest amount of motion artifacts [
24]. Other high-performing fetal segmentation algorithms were not fully automated [
12]. The errors in our study could be attributed to the challenge of labeling the finer details of the fetal body. Although interactive segmentation algorithms have their own unique benefits, the completely automated approach is ideal for fast-paced, high-throughput applications. For example, automated algorithms are preferable in medical screening for anatomical abnormalities or health concerns. These segmentations and calculations can be done without the need for a medical professional, allowing important abnormalities detected by the algorithm to be flagged for more immediate review by a clinician.