1. Introduction
Radiation therapy (RT) stands as the primary treatment for head and neck (HN) cancer, either utilized independently or in conjunction with adjuvant treatments such as surgery and/or chemotherapy [
1]. Despite its effectiveness in cancer treatment, RT is not without challenges, as even the partial irradiation of healthy cells can inadvertently damage normal tissues (organs at risk, OARs), leading to complications [
2]. Despite its efficacy in cancer treatment, RT presents challenges, given that even the partial irradiation of healthy cells can inadvertently harm normal tissues, referred to as Organs At Risk (OARs), thereby leading to complications [
3]. Typically, physicians manually perform the determination of tumor and OAR boundaries, utilizing Computed Tomography (CT) images. However, this is a time-consuming and labor-intensive process (since radiologists have to work slice by slice), making any adjustments in treatment planning particularly challenging [
4]. This is further exacerbated due to repeated and swift adaptations required on the treatment plan, particularly in cases that present 20–30% tumor-related volumetric changes between RT sessions [
5,
6]. On this premise, adaptive RT necessitates frequent updates of OARs and target volumes during treatment, making it imperative to delineate these structures quickly and efficiently.
The advent of artificial intelligence (AI) and machine learning has revolutionized the assessment of RT treatment, ranging from computer-aided detection and diagnosis systems to predicting treatment alteration requirements [
7,
8]. However, many studies still involve manual segmentation of parotid glands combined with machine learning for predicting the necessity of RT replanning, which could be further automated [
9,
10,
11].
Specifically, in automatic segmentation, machine learning has proven very effective in contouring OARs within a reasonable timeframe, enhancing the safety and efficiency of treatments [
12,
13,
14]. Despite this, producing atlases for the segmentation task can reduce the model’s ability to generalize on images from different patients due to the large anatomical difference, because an atlas may not be able to represent all variations occurring from the different anatomies and acquisitions [
15]. Recent advancements in deep learning (DL) have shown superior performance in auto-segmentation, offering better feature extraction, capturing local relationships, reducing observer variation, and minimizing contouring time [
16,
17,
18,
19]. However, these studies suffer from limitations that can impede their usage and applicability. Although interactive segmentation can reduce time and errors compared to manual segmentation, fully automatic segmentation can achieve both at a greater level. In addition, 2D or slice-based segmentation cannot capture the 3D anatomical features, which are essential to enhance the DL model’s accuracy. The scarcity of labeled data is another challenge. DL models typically require a large amount of annotated data for training, and obtaining a comprehensive dataset with accurately annotated parotid glands in diverse clinical scenarios can be challenging. Moreover, the potential for class imbalance, where certain anatomical structures or pathologies are underrepresented in the dataset, may affect the model’s performance, leading to biased results [
20].
Several recent studies have evaluated DL-based methods for auto-segmentation, showing promising results. For instance, Gibbons et al. [
21] evaluated the accuracy of atlas-based and DL-based auto-segmentation methods for critical OARs in HN, thoracic, and pelvic cancer. Their results showed significant improvements in DL-based segmentation compared to atlas-based segmentation both on Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, while contour adjustment time was significantly improved compared to manual and atlas-based segmentation. In a similar manner, a recent study [
22] compared the automatically generated OAR delineations acquired from a deep learning auto-segmentation software to manual contours in the HN region created by experts in the field. The software provided good delineation of OAR volumes, with most structures reporting DSC values higher than 0.85, while the HD values indicated a good agreement between the automatic and manual contours. Furthermore, Chen et al. [
23] employed a WBNet automatic segmentation method and compared it with three additional algorithms, utilizing a large number of whole-body CT images. They reported an average DSC greater than or equal to 0.80 for 39 out of 50 OARs, outperforming the other algorithms. Furthermore, Zhan et al. [
24] utilized a Weaving Attention U-net deep learning approach for multi-organ segmentation in HN CT images. Their method integrated convolutional neural networks (CNNs) and a self-attention mechanism and was compared against other state-of-the-art segmentation algorithms, showing superior or similar performance. As such, it was able to generate contours that closely resembled the ground truth for ten OARs, with DSC scores ranging from 0.73 to 0.95. Taking the above into consideration, it is evident that DL-based segmentation methods have the potential to enhance RT treatment for HN cancer patients by improving the precision of organ at risk delineation. However, it should be noted that DL algorithms are data-driven methods whose results can be influenced by sample imbalance, variance among different patients and anatomical regions, and observer bias. To address these, recent studies apply interconnected networks, while employing larger datasets from different hospitals to ensure its effectiveness across different patient populations [
25,
26,
27].
Our study builds on these advancements, employing an AttentionUNet-based DL segmentation method for HN CT images. This method focuses on targeted structures, suppressing non-related structures to reduce false positives. We conducted a comprehensive evaluation on three datasets, including two public and one private dataset, comparing our method with state-of-the-art DL frameworks for automatic segmentation. The results indicate the high performance of our automatic segmentation approach, outperforming other DL methods significantly. As an extension of our segmentation framework, we also investigated the efficiency of our model in RT replanning based on volumetric changes in the parotid gland. Our registration framework, aligning the planning CT with the Cone Beam CT (CBCT) scans from different weeks, demonstrated the validity of the registration schema. The mean volumetric percentage difference in tumor volume between the first and fifth week of treatment sessions, when comparing automated registration with the ground truth, showed very low values, confirming the effectiveness of our framework in cases requiring replanning during RT or between RT sessions. Overall, the contributions of this study can be summarized as follows:
A fast and automatic DL segmentation method based on the AttentionUNet is utilized for the segmentation of the parotid glands from HN CT images. The AttentionUNet-based method achieves both a reduction in false positive regions and the enhancement of the parotid glands’ segmentation.
An extensive evaluation of the segmentation method was performed on two public datasets and one private dataset with varying acquisition parameters. Both qualitative and quantitative results support the model’s ability to generalize effectively in images with anatomical and imaging variability.
A registration-based framework was designed and implemented to transfer the segmentation masks produced by the AttentionUNet model to the CBCT images of the following weeks. The results indicate the effectiveness of our framework in cases where replanning is required during or between RT sessions.
The subsequent sections of the paper are organized as follows.
Section 2 provides a comprehensive overview of the datasets and the DL methodological framework for segmentation and registration. In
Section 3, the results of the proposed schema are separately presented for segmentation and registration. The implications of our methods and results are discussed in
Section 4. The entire paper is summarized in
Section 5.
3. Results
3.1. Training and Validation Loss
To evaluate the performance of a model during the training process, the error or the difference between the model’s predictions and the actual target values in the training data (training loss), as well as the degree of the model’s effectiveness to generalize to new, unseen data (validation loss) was estimated.
Figure 3 presents the learning curves of the proposed neural network model.
Training was performed for a constant number of epochs (600 epochs), which was selected by observing the trajectory of the loss function (reaching its final values approximately after 300 epochs). The final number of epochs was selected to achieve the same trajectory in all models and ablation experiments. To avoid overfitting, the model with the best DSC value in the validation set was saved for each experiment. Training and validation losses present a very close trajectory, which indicates that overfitting is mitigated, and the model can achieve high accuracy on unseen data.
3.2. Segmentation Comparison Experiments
In order to evaluate the model stability, the proposed segmentation DL network is compared with representative state-of-the-art segmentation models (including neural networks with residual connections, transformer-based neural networks, etc.), further emphasizing its performance. Due to the small size of the private dataset, the weights calculated for the public datasets were utilized. In this way, the pre-trained model from the public dataset was also tested for its ability to perform well in an external (private) set with different acquisition parameters. In
Table 4, a short description of each implementation and the parameters used are provided to enhance the reproducibility of our results.
3.3. Segmentation Quantitative Comparison Analysis
The proposed methodology, along with the additional state-of-the-art segmentation networks presented above, are compared in terms of DSC (Overlap), Recall, Precision, and HD. Results of the 5-fold cross-validation for both the public and private datasets are presented in
Table 5 and
Table 6, corresponding to the public and the private datasets, respectively. Notably, the proposed AttentionUNet-based method achieved the best results in the two main metrics, while it performed comparably to the other two metrics in both datasets.
In the public datasets, UNETR achieved the lowest DSC, with a value of 78.82%, and the second highest HD, with 33 mm. Although SwinUNETR achieved a DSC of 81%, the HD value was pretty high (39.75 mm), making it unsuitable for the current application of radiotherapy. This could be explained by the fact that transformer-based networks require a large number of training samples to learn effectively to produce segmentations. UNet presented a low DSC value of 80.81, which indicates low ability to capture the details on the difficult-to-separate regions in the CT image, but achieved a small HD of 10 mm. SegResNet achieved the second-best results in most metrics while the proposed AttentionUNet-based method outperformed the others in the DSC value and HD value. In this regard, it produces segmentations with better overlap with the ground truth and also with the smallest distance (mean value of 6.24 mm and std of 2.47 mm).
Regarding the results on the private dataset (
Table 6), the weights from the public dataset training were used as initialization for all the models, as well as the same cross-validation technique. Due to the differences in the acquisition parameters between the public and the private datasets and the small number of samples in the private dataset, evaluation metrics presented lower values in comparison. Recall values were very close for all models, while Precision was higher in SegResNet, indicating that less false positive voxels were estimated. Similar to the public dataset, the AttentionUNet-based method outperformed all other models in overlap and distance between prediction and ground truth, with higher mean DSC and the most stable metrics, as it can be extracted by the small values of the standard deviation. The HD metric was also small (5.35 mm) and stable. Although SegResNet performed comparably in the first three metrics, the large difference in the HD supports its lower prediction accuracy against the proposed AttentionUNet model.
3.4. Segmentation Qualitative Comparison Analysis
Segmentation examples of the proposed AttentionUNet network and the different comparison models are presented in
Figure 4a for the public dataset and in
Figure 4b for the private dataset. Regarding the public dataset, the AttentionUNet-based models achieved visually better results in comparison with the other DL models that produced slightly rotated masks in the same gland. As such, the proposed framework generated a smoother region and suppressed a fraction of the over-segmented regions of the other networks. By contrast, UNet, SwinUNETR, and SegResNet have over-segmented a small part of the parotid gland on the left and concurrently have produced fewer smooth masks. In a similar manner, SwinUNETR presented a larger difference in the parotid gland when compared with the ground truth.
Concerning the utilization of the private dataset as an external validation, it can be observed that the proposed AttentionUNet model best approximates the ground truth as it reduces over-segmentation, while also producing smoother segmentations. In comparison, UNet failed to outline a large portion of the parotid gland, whereas the other models produced slightly different structures on the left and under-segmented the right parotid. In this example, although AttentionUNet did not capture the whole structure on the right, it reduced false positives and produced very accurate results on the left one.
3.5. Results on the Registration Expansion Framework
Following the segmentation framework, the registration feasibility study was implemented. In
Figure 5, an example output from the registration process is presented. Moving refers to the image to be aligned with the fixed image. Transformed image refers to the results of the registration process. In the final two images of each row, the edges of the fixed image (reference) are presented on top of the moving and moved images in order to qualitatively assess the performance of the utilized registration method according to the edges of the image. The first row shows an example of the first registration step, where the pCT image is registered to the CBCT1 (week 1), while the second row shows an example of the second registration step, where the CBCT1 is registered to the CBCT5 (week 5). In
Figure 5a, the pCT and CBCT images present a larger difference and the number of slices is not the same. Due to this fact, moving and fixed images are far apart and the edges of the fixed image do not match the edges of the moving image. However, registration manages to bring them to approximately the same space, which can be seen both in the similarity between the fixed and the transformed image and in the alignment of the edges of the fixed image when they are superimposed on the transformed image. In
Figure 5b, an example pair of the registration between CBCT1 and CBCT5 shows the smaller difference between them before registration, as well as the ability of the second registration step to align images and their edges.
As presented in
Figure 5a,b, the transformed image is more similar to the fixed one, whereas the edges of the fixed image are closer to the real edges in the transformed image. On the moving image, it can be observed that the edges of the fixed image initially do not match the real edges of the moving one, while after registration, the edges of the fixed image are aligned with the transformed image.
In
Table 7, the values of the metrics for the assessment of the registration between the two images are presented. Mean values of the Mean Squared Error (MSE), Mutual Information (MI), and Normalized Cross-correlation (NCC) are calculated for the pairs of images before (fixed and moving images) and after (fixed and transformed) registration and are presented for both tasks (i.e., pCT to CBCT1 and CBCT1 to CBCT5).
The calculated metrics support the registration accuracy, as the MSE is reduced and CC and MI are increased in both tasks. Of note is that the initial step of pCT to CBCT1 was performed in order to enhance the step of CBCT1 to CBCT5, which was the target of this investigation, providing indices for required RT replanning. As expected, the registration between pCT and CBCT1 was more challenging, confirmed by the large MSE before the registration. After the registration, MSE was reduced and the similarity measures were increased, indicating effective alignment between the different modality images. In the second task, MSE was significantly reduced from 94 to 16, MI was increased from 46 to 66, and NCC from 89 to 96, further demonstrating successful registration.
Subsequently, we calculated the volumetric percentage difference between the first week and fifth week of RT, based on the premise that significant volumetric alterations during RT or between RT sessions necessitate planning adaptations. As such, the parotid volume difference between CBCT1 and CBCT5 as a result of the registration tasks was estimated and compared to the ground truth (i.e., the volume difference as annotated by experts).
Table 8 presents the MAE, the RMSE, and the Pearson Correlation Coefficient (PCC) of the volumetric percentage difference between the registration output and the ground truth of the CBCT1/CBCT5 images.
The volumetric percentage difference demonstrated very small values for both MAE and RMSE, accounting for limited divergence from the ground truth. In fact, the 7.577 value of MAE suggests that the mean percentage difference deviation between the experts’ annotation and the registration output was less than 8%. Moreover, the results show a statistically significant positive correlation coefficient of 0.723 (p-value < 0.002), which indicates that the variables move in the same direction with relatively high correlation.
4. Discussion
Automated segmentation in HN cancer is of great significance for RT planning, as it can reduce time and provide crucial information on the anatomical structures of the region of interest. In this study, a DL segmentation framework is proposed, aiming for fast, automatic, and accurate delineation of the parotid glands. Although automated systems for enhancing adaptive radiotherapy exist (e.g., Varian ETHOS, RT-Plan, etc.), we propose a low-requirement AI-based decision support system that can provide insight and effectively assist radiotherapy planning [
45,
46]. The fact that our study is highly connected to the current clinical workflow combined with further extensive evaluation experiments and technical improvements could provide a tool applicable to real clinical procedure. In this regard, the utilized AttentionUNet architecture achieved high overall performance in both public and private datasets, supporting the potential applicability of the proposed method to the clinical workflow.
Regarding the proposed DL model, the AttentionUNet architecture utilized in this study exploits the addition of an attention mechanism to the UNet structure [
31]. As such, the UNet-like structure with five layers in the encoder enables the network to extract useful context information in the encoder path and then reconstruct the input and the spatial information to produce the final mask in the decoder. In addition, by including learnable weights, the network learns to focus on the regions inside the image that are more relevant to the task, in this case, the parotid glands. In this way, false positive regions are effectively suppressed, especially in the CT images where the structures are not easy to separate [
47]. Moreover, the integration of attention gates in the model’s design allowed us to incorporate features from deeper encoder layers that introduce better context information in each layer of the decoder [
48]. Further enhancement of the segmentation accuracy (especially in the early epochs, the weights could not be pushed in the right direction) was accomplished with a combination of dice and cross-entropy loss.
For a robust and unbiased evaluation of the segmentation framework, the segmentation accuracy was tested on a unified public dataset (consisting of two public subsets) and a private dataset. By including images with different acquisition parameters and settings, we aimed to validate the generalizability of our method, enhancing the model’s ability to work effectively under different clinical protocols [
49]. As such, the model demonstrated high performance on the public dataset by achieving DSC and HD values of 82.65 and 6.24, respectively. The private dataset was also used as an external validation dataset, where delineation was much more difficult (with variations in the acquisition parameters in comparison with the public training datasets further complicating training tasks). However, the proposed method presented satisfying results that support its ability to generalize to other datasets. As such, the proposed model achieved DSC of 78.93 and HD of 5.35 in the private dataset. In this setting, the pre-trained model resulting from the training in the public dataset was applied in a 5-fold cross-validation in the private set, also achieving high metrics, albeit with a small reduction deemed inevitable due to the large difference in the acquisition parameters among the datasets [
50]. It should be noted that the proposed AttentionUNet-based network outperformed many state-of-the-art DL segmentation methods, both in terms of overlap between the predicted and the ground truth masks and in terms of the distance of their edges. On this premise, HD is required to be as low as possible due to its applicability in RT (where the distance between the true region and the delineated regions should be short, to increase tumor coverage and alleviate OAR overdose) [
51].
Subsequent to the segmentation framework, the predicted masks from the pCT of the private dataset were further utilized in a registration schema, assessing the method’s potential use in clinical practice for RT planning adaptations. To achieve this, two image registration tasks were designed and implemented in order to transfer the predicted masks to the CBCT of week 5 of the radiotherapy workflow. The ANT registration method was utilized to align the pCT with the CBCT1, and the calculated transformation was applied to the labels of the parotids to align the labels to the CBCT1. This was an intermediate step, because the distance between pCT and CBCT of week 1 was smaller than that of week 5. This step allowed the registration to focus on the large differences in the acquisition parameters without introducing large differences in the patient’s anatomy during the RT treatment. In the second task, CBCT1 was aligned to CBCT5 (through Fast Symmetric Demons registration) to transform the labels of the parotid glands and calculate their volume in week 5. Of note is that CBCT images were more similar to each other in terms of the acquisition parameters and the intensity values, which in turn enhanced the registration procedure. As presented in the registration results section, the proposed workflow was able to approximate the volumetric changes of the parotid glands with high accuracy. Interestingly, the volumetric percentage difference between CBCT1 and CBCT5 indicated a 7.55 MAE when compared to the experts’ observations. This indicates the high performance of the registration process, taking into account that variations in the registration of cancer CT scans can also occur among different observers. In fact, observers may have individual techniques, preferences, or interpretations when performing registrations, leading to differences in how they align or match the images [
21]. The use of automated registration tools can contribute to reducing variability among observers.
Limitations and Future Recommendations
Some considerations should be given when interpreting the results of this study. First and foremost, this study included a relatively small sample data size. Although three datasets were combined, the significant variance among patients could limit the generalizability of the findings. As such, the framework proposed in this study seems to lack the requisite robustness for practical deployment in real-world scenarios. The performance metrics and results (although encouraging) suggest limitations in the model’s ability to generalize to diverse and complex scenarios for real-world variations. In addition, the segmentation and subsequent registration processes were applied to delineate only the parotid gland, therefore training for shape constraints of small organs and leaving out larger OAR shape characteristics. This might introduce biases and limitations when expanding to other organs. Moreover, the study focused on segmentation and registration algorithmic performance, omitting potential limitations associated with their implementation in clinical practice, such as computational resources, training time, and integration with existing medical imaging systems. As a result, caution should be exercised before deploying the DL model in real-world settings, and further refinements or improvements may be necessary to support the adoption of automated DL segmentation systems. In this regard, we aim to expand our framework to involve additional patient cohorts and expert observations to enhance its efficacy and reliability.