Next Article in Journal
Evaluation of Mutual Information and Feature Selection for SARS-CoV-2 Respiratory Infection
Next Article in Special Issue
Self-Supervision for Medical Image Classification: State-of-the-Art Performance with ~100 Labeled Training Samples per Class
Previous Article in Journal
An Aquaporin Gene (KoPIP2;1) Isolated from Mangrove Plant Kandelia obovata Had Enhanced Cold Tolerance of Transgenic Arabidopsis thaliana
Previous Article in Special Issue
Predicting Recurrence in Pancreatic Ductal Adenocarcinoma after Radical Surgery Using an AX-Unet Pancreas Segmentation Model and Dynamic Nomogram
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prompt-Based Tuning of Transformer Models for Multi-Center Medical Image Segmentation of Head and Neck Cancer

1
Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi 7909, United Arab Emirates
2
Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi 7909, United Arab Emirates
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Bioengineering 2023, 10(7), 879; https://doi.org/10.3390/bioengineering10070879
Submission received: 14 June 2023 / Revised: 7 July 2023 / Accepted: 13 July 2023 / Published: 24 July 2023
(This article belongs to the Special Issue Artificial Intelligence in Biomedical Imaging)

Abstract

:
Medical image segmentation is a vital healthcare endeavor requiring precise and efficient models for appropriate diagnosis and treatment. Vision transformer (ViT)-based segmentation models have shown great performance in accomplishing this task. However, to build a powerful backbone, the self-attention block of ViT requires large-scale pre-training data. The present method of modifying pre-trained models entails updating all or some of the backbone parameters. This paper proposes a novel fine-tuning strategy for adapting a pretrained transformer-based segmentation model on data from a new medical center. This method introduces a small number of learnable parameters, termed prompts, into the input space (less than 1% of model parameters) while keeping the rest of the model parameters frozen. Extensive studies employing data from new unseen medical centers show that the prompt-based fine-tuning of medical segmentation models provides excellent performance regarding the new-center data with a negligible drop regarding the old centers. Additionally, our strategy delivers great accuracy with minimum re-training on new-center data, significantly decreasing the computational and time costs of fine-tuning pre-trained models. Our source code will be made publicly available.

1. Introduction

Recently, several novel segmentation models have been proposed to assist in medical image analysis and understanding, leading to faster and more accurate treatment planning [1,2,3]. Many of these proposed models are increasingly transformer-based, demonstrating excellent performance on several medical datasets. Transformers are a class of neural network topologies distinguished chiefly by their heavy usage of the attention mechanism [4]. In particular, Vision transformers (ViTs) [5] have demonstrated their ability in 3D medical image segmentation [6,7]. However, ViTs exhibit an intrinsic lack of image-specific inductive bias and scaling behavior; nonetheless, this lack is mitigated by utilizing large datasets and large model capacity.
On the other hand, medical datasets are limited in size due to time-consuming and expensive expert annotations, which hinders the use of powerful transformer models with regard to their full capacity. A common approach to handle the limited data size in the medical domain is to use transfer learning [8]. Multiple studies exploited pretrained networks for different downstream tasks such as classification [9], segmentation [10], and progression [11]. This technique aims to reuse model weights or parameters of already trained ViTs on different but related tasks. More specifically, models are first pretrained on a different large dataset; the pretraining weights act as informed initializations of the model [12,13,14]. The pretrained model is then fine-tuned on the target dataset, yielding faster training and a more generalizable model.
However, the limited size of medical datasets is not the only challenge; medical datasets are sourced from different medical centers that use different machines and acquisition protocols, leading to further heterogeneity in the acquired data [15,16]. As a result, a model trained on data obtained from specific medical centers might fail to perform well on data obtained from a new medical center, see Figure 1 (Scenario 1). Conventionally, we can use the transfer learning technique for adapting the pretrained model to the new medical center data. One such effective adaptation strategy is partial/full fine-tuning, in which some/all of the parameters of the pretrained model are fine-tuned on the new center’s data, see Figure 1 (Scenario 2). However, directly fine-tuning a pretrained transformer model on a new center’s data can lead to overfitting (as we have mostly small-size datasets from any new center) and catastrophic forgetting (loss of knowledge learned from the previous centers) [17,18]. Hence, this strategy requires storing and deploying a separate copy of the backbone parameters for every newly acquired medical center data. This strategy is costly and infeasible if the end solution is regularly deployed on new medical centers or the acquisition protocol and/or machines in an existing center change. Particularly, this infeasibility will be more prominent in transformer-based models as they are significantly larger than their convolutional neural network (CNN) counterparts. Another possibility is to re-train the model on samples from old and new centers data and re-deploy it upon inference, see Figure 1 (Scenario 3). This scenario is computationally expensive and infeasible due to the same pitfalls of Scenario 2.
In this work, inspired from [19,20,21] we propose a prompt-based fine-tuning method of ViTs on new medical centers’ data. It is important to note that previous studies have mainly focused on large language models [19,20] and natural images [21]. However, our research is centered around utilizing prompt-based fine-tuning to tackle medical image segmentation tasks. More specifically, we are looking at multi-class segmentation of cancer lesions with multi-center data. Instead of altering or fine-tuning the pretrained transformer, we introduce center-specific learnable token parameters called prompts in the input space of the segmentation model. Only prompts and the output convolutional layer are learnable during the fine-tuning of the model on the new center’s data. The rest of the entire pre-trained transformer model is frozen. Current deployment scenarios as well as our proposed approach (Scenario 4) are depicted in Figure 1.
We show that this method can achieve high accuracy on new centers’ data with a negligible loss regarding the accuracy of the old centers, in contrast to full or partial fine-tuning techniques, where the model accuracy comprises the old-center data. The main contributions of this work are as follows:
  • We propose a new prompt-based fine-tuning technique for the transformer-based medical image segmentation models that reduces the fine-tuning time and the number of learnable parameters (less than 1% of the model parameters) to be stored for the new medical center.
  • The proposed method achieves equivalent accuracy for new-center data compared to the full fine-tuning technique while mostly preserving the accuracy for the old-center data that compromises full fine-tuning.
  • We showcase the efficacy of the proposed method on multi-class segmentation of head and neck cancer tumors using multi-channel computed tomography (CT) and positron emission tomography (PET) scans of patients obtained from multi-center (seven centers) sources.

2. Methodology

Due to differences in how imaging is done, what equipment is used, and who the patients are, the quality and distribution of the data collected by different medical centers might be very different. This heterogeneity represents a barrier to developing precise and robust models that can generalize to new medical center data optimally. In this section, we describe a novel tuning technique, called prompt-based tuning, that overcomes the pitfalls of conventional fine-tuning techniques. In this section, we describe prompt-based tuning for adapting transformer-based medical image segmentation models. Prompt-based fine-tuning technique injects a small number of learnable parameters into the transformer’s input space and keeps the backbone of the trained model frozen during the downstream training stage. The overall framework is presented in Figure 2. We demonstrate two variants of prompt-based tuning, shallow and deep, and compare their performance to the conventional fine-tuning methods such as partial and full fine-tuning. Below, we describe the two prompt-based tuning methods and highlight the differences between the two.

2.1. Shallow Prompt Tuning

In shallow prompt fine-tuning, a set of p continuous prompts of dimension d are introduced in the input space after the embedding layer. These prompts are concatenated with the token embeddings of the volumetric patches of an input image x R H × W × D × C , where H, W, D, and C are the height, width, depth, and channels of the 3D image, respectively. K × K × K represents the dimensions of each patch, and n = H W D / K 3 is the number of patches extracted. The embedding layer projects these patches to a dimension d. The class token is dropped from the ViT [5] as the experiments are for a segmentation task. The resulting concatenated prompts and embeddings are fed to a transformer encoder consisting of L layers, following the same pipeline as the original ViT [5], with normalization, multi-head self-attention (MSA), and multi-layer perceptron. The decoder only uses image patch embeddings as inputs, and prompt embeddings are discarded. The shallow prompt-based fine-tuning is formulated as:
x 0 = E m b e d d i n g ( x ) x 0 R n × d
[ U 1 , x 1 ] = E n c o d e r 1 ( [ P , x 0 ] ) P R p × d
[ U i , x i ] = E n c o d e r i ( [ U i 1 , x i 1 ] ) i = 2 , . . . , L
Y s e g = D e c o d e r ( C o n v T r a n s 3 D ( x i ) ) i = 3 , 6 , 9 , 12
where P is the prompt matrix and C o n v T r a n s 3 D refers to 3D transpose convolution.

2.2. Deep Prompt Tuning

In deep prompt fine-tuning, the prompts can be introduced at the input space of each transformer layer or subset of layers. In our implementation, we add the deep prompts after each skip connection layer:
[ _ , x i ] = E n c o d e r i ( [ P i 1 , x i 1 ] ) i = 1 , . . . , L

3. Experiments

We use the state-of-the-art transformer-based segmentation models, UNETR [6] and Swin-UNETR [22]. In addition, we compare the two variants of the proposed method to partial and full fine-tuning, two prevalent transfer learning protocols used in medical imaging.

3.1. Dataset

The dataset used in this work is multi-center, multi-class, and multi-modal. This dataset comprises head and neck cancer patient scans collected from seven centers. The data consist of CT and PET scans, as well as electronic health records (EHR) of each patient. The PET volume is registered with the CT volume to a common origin, although they each have varying sizes and resolutions. The CT sizes range from (128, 128, 67) to (512, 512, 736), while the PET sizes range from (128, 128, 66) to (256, 256, 543) voxels. The CT resolutions range from (0.488, 0.488, 1.00) to (2.73, 2.73, 2.80), while the PET resolutions range from (2.73, 2.73, 2.00) to (5.47, 5.47, 5.00) mm in the x, y, and z directions. Some scans are of the head and neck regions, while others contain the full body of the patients.
As shown in Figure 3, the PET/CT scans are in the NIFTI format. They have been resampled to 1 × 1 × 1 mm 3 isotropic resolution and cropped to a dimension of 176 × 176 × 176 around the primary tumor and lymph nodes. The CT HU value is clipped to a range of −200 to 200, while the PET is clipped to a maximum of 5 standard uptake values (SUV).
The dataset contains segmentation masks for each patient, including the ground truth of primary gross tumor volumes (GTVp), nodal gross tumor volumes (GTVn), and other clinical information. The annotations were made by medical professionals at the respective centers and are provided with the dataset. The dataset is publicly available on the MICCAI 2022 HEad and neCK TumOR (HECKTOR) challenge website [23]. The complete dataset consists of 524 samples. The detailed distribution of the dataset across different centers is listed in Table 1 along with the type of scanner used to acquire the scans.

3.2. Experimental Setup

The dataset for each of the seven centers is first split into train and test sets with a ratio of 70:30, respectively, for a fair comparison. In all experiments, the model is first pre-trained using the six centers’ training data and then fine-tuned on the seventh center’s training data. We evaluate the performance of the model on (1) the seventh center’s test set (new center) and (2) on the six centers’ test set (old centers). We compare both metrics for the following fine-tuning techniques as shown in Figure 4.
No fine-tuning: In this, the pre-trained model is directly used to infer the test samples without any fine-tuning.
Partial fine-tuning: This technique involves fine-tuning the pre-trained model’s last decoder block using the seventh center’s training set.
Full fine-tuning: This technique involves fine-tuning the entire pre-trained model using the seventh center’s training set.
Shallow prompt fine-tuning: This is a variant of prompt-based fine-tuning, where the prompts are introduced only in the input space. Only the prompts and the final convolutional layer are fine-tuned using the seventh center’s training set, while the rest of the model is frozen.
Deep prompt fine-tuning: This technique is similar to shallow prompt fine-tuning; prompts at each level of the transformer layer are introduced. Thus, at each level, there are new trainable prompts to refine. The prompts and the final convolutional layer are fine-tuned using the seventh center’s training set.

3.3. Implementation Details

We implement all our models using the PyTorch framework and train them on a single NVIDIA Tesla A6000 GPU. The details of the experimental settings for all fine-tuning techniques are listed in Appendix A Table A1.
All images are aligned to the same 3D orientation (anterior–posterior, right–left, and inferior–superior) during training and testing. The CT/PET scans are concatenated to form a 2-channel input, with their intensity values independently normalized based on their respective means and standard deviations. The training augmentations applied to the CT/PET scans include extracting four random crops of size 96 × 96 × 96 , with each having an equal probability of being centered around the primary tumor or lymph node voxels and the background voxels. The images are randomly flipped in the x, y, and z directions, with a probability of 0.2, and are further rotated by 90 degrees in the x and y directions up to 3 times, with a probability of 0.2. These augmentations aim to create more diverse and representative training data, which can help to improve the performance and generalization of deep learning models for medical image analysis tasks. All pre-processing and augmentation details of the data are listed in Appendix A Table A2.

4. Results

Table 2 presents the results of fine-tuning the pre-trained UNETR and Swin-UNETR on the old and new medical center datasets. We conduct our evaluations using a five-fold cross-validation with a total of 290 experiments. The results of all the folds for all the centers can be found in the Supplementary material. We use Dice score [24] to evaluate the performance of segmentation in our experiments. We can observe that:
1.
All the different fine-tuning techniques yield better performance for the new centers than direct inference on the pre-trained models.
2.
Shallow prompt-based fine-tuning achieves a higher or comparable Dice score on the new-center data, with nearly the same number of learnable parameters as partial fine-tuning (see Table 3). However, shallow prompts outperform partial and full fine-tuning techniques on the old-center data for all seven centers.
3.
Deep prompt-based fine-tuning achieves the same Dice score as full fine-tuning on the new-center data but with significantly fewer learnable parameters. In addition, deep prompt-based fine-tuning outperforms the full fine-tuning on old-center data for all seven centers. Thus, even if the storage of model weights is not a concern, prompt-based fine-tuning is still a promising approach for fine-tuning models as it retains more knowledge related to old centers.
4.
The prompt-based fine-tuning of Swin-UNETR exhibits a similar pattern to that of UNETR. However, the loss in performance on old-center data for the conventional fine-tuning methods is less prominent for some centers compared to that of UNETR. This can be explained by the inductive biases in Swin-UNETR, which employs MSA within local shifted windows and merges patch embeddings at deeper layers. Swin-UNETR requires further optimization with regard to prompt position to further improve its performance.
Table 3. Total number of learnable parameters for different fine-tuning techniques.
Table 3. Total number of learnable parameters for different fine-tuning techniques.
ModelFine-TuningNonePartialFullShallow PromptsDeep Prompts
UNETR -0.025 M96 M0.038 M0.15 M
Swin-UNETR -0.055 M62 M0.073 M-

5. Discussion

This work introduces a new method for fine-tuning transformer-based medical segmentation models on new-center data. Our method is more efficient than conventional approaches, requiring fewer parameters at a lower computational cost while achieving the same or better performance on new-center data when compared to conventional methods (Table A3). We show superior performance for prompt-based fine-tuning compared to other techniques, achieving a statistically significant increase in the Dice score for old centers. We note the difference in performance between CHUP and CHUS, which have a similar number of samples but different acquisition machines and origins. CHUP exhibits a larger drop in performance on the old centers than CHUS (nearly 8% in CHUP vs. 1% in CHUS for partial and full fine-tuning). This is likely due to the larger dataset distribution shift in CHUP compared to the rest of the centers. However, if shallow- or deep prompt-based fine-tuning is used, the drop is only 2–3%. We perform a Wilcoxon signed-rank test [25] to assess whether the deep prompt-based tuning of medical segmentation models is significantly better than other fine-tuning techniques on old- and new-center data (the null hypothesis H 0 states that the segmentation performance of deep prompt-based fine-tuning is statistically the same as the other techniques. The alternative hypothesis H 1 states that the deep prompt-based technique outperforms the other methods). Table 4 presents the results of each test; it can be observed that deep prompt-based fine-tuning outperforms full and partial fine-tuning techniques on the old center’s data. Similarly, it outperforms the partial prompt- and shallow prompt-based techniques on the new-center data. However, the test fails on the new center’s data for full fine-tuning. Thus, we proceed to performing a two-tailed t-test and confirm that the performances of deep prompt-based fine-tuning and full fine-tuning on new-center data are statistically the same (p-value < 0.05 ).
In our experiments, we observed that the extra learnable prompts at deeper layers in the deep prompt-based fine-tuning improve the performance compared to shallow prompt-based fine-tuning, which only inserts prompts in the input space after the patch embedding layer. We present the results of ablating different prompt positions and prompt numbers in Table A11, Table A12 and Table A13. Our findings indicate that their specific position does not significantly influence the model’s performance when the number of prompts is fixed. However, for a fixed number of prompts distributed across various layers, incorporating prompts into the skip connection layers adversely affects the model’s performance, while their exclusion leads to performance improvements, as shown in Table A12. Furthermore, the results reveal that increasing the number of prompts initially yields improvements in performance. However, there is a threshold beyond which the model tends to become overparameterized, resulting in a degradation of its performance. These results serve as motivation for our choice to position the deep prompts after the skip connection layers in our design. This suggests that adding too many prompts in the deeper layers can over-parameterize the model, which may result in overfitting on new-center data. Further studies will be conducted to quantify the effect of the number and position of the prompts.

6. Conclusions

We propose a prompt-based fine-tuning framework for the medical image segmentation problem. This method takes advantage of the strength of transformers to handle a variable number of tokens at the input and the deeper layers. We validate our proposed method by training transformer-based segmentation models on head and neck PET/CT scans and compare our results with conventional fine-tuning techniques. Although we were able to show the efficacy of the proposed method on medical image segmentation problems, further investigation is needed to study its scalability to other transformer-based segmentation models in the future. In addition, investigation of prompt-based learning in different tasks, such as classification and prognosis, is needed to assess its efficacy, along with its performance comparison with domain generalization methods.

Author Contributions

Conceptualization, N.S., M.R. and M.Y.; methodology, N.S., M.R. and M.Y.; software, N.S., M.R., R.A.M. and M.Y.; validation, N.S., M.R., R.A.M. and M.Y.; formal analysis, N.S. and M.R.; investigation, N.S., M.R. and R.A.M.; resources, M.Y.; data curation, R.A.M.; writing—original draft preparation, N.S. and M.R.; writing—review and editing, N.S., M.R., R.A.M. and M.Y.; visualization, N.S. and R.A.M.; supervision, M.Y.; project administration, N.S. and M.R.; and funding acquisition, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

M.Y., N.S., M.R. and R.A.M. were funded by MBZUAI research grant (AI8481000001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in this study were obtained from the Head and Neck Tumor Segmentation and Outcome Prediction in the PET/CT Images challenge [23].

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CTComputed tomography
PETPositron emission tomography
ViTVision transformer
CNNConvolutional neural networks
MSAMulti-head self attention

Appendix A

Appendix A.1. Experimental Settings and Augmentations

Table A1. Experimental settings for the different fine-tuning techniques.
Table A1. Experimental settings for the different fine-tuning techniques.
HyperparametersFullShallow PromptDeep Prompt
OptimizerAdamWSGDSGD
lr1 × 10 5 0.050.05
Weight decay1 × 10 3 00
Learning rate scheduler-cosine decaycosine decay
Total epochs100100100
Batch size333
Table A2. Preprocessing and augmentation details.
Table A2. Preprocessing and augmentation details.
AugmentationsAxisProbabilitySize
OrientationPLS--
CT/PET Concatenation1--
Normalization---
Random crop-0.5 96 × 96 × 96
Random flipx, y, z0.2-
Rotate by 90 (up to 3 × )x, y0.2-
Table A3. Comparison of training time and GPU consumption between prompt and non-prompt fine-tuning methods.
Table A3. Comparison of training time and GPU consumption between prompt and non-prompt fine-tuning methods.
Fine-TuningRuntime (min)GPU Consumption (GB)
Partial7515.060
Full10141.763
Shallow prompt7619.275
Deep prompt7819.361

Appendix A.2. Five-Fold Results per Center

Table A4. Five-fold results for UNETR on CHUP center.
Table A4. Five-fold results for UNETR on CHUP center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.61120.68130.63070.63020.6344
20.74420.79170.75960.75190.7449
30.63990.66270.72850.69100.6926
40.69190.72410.75510.74170.7518
50.66630.71650.77530.75230.7751
μ ± σ 0.6708  ± 0.0509 0.7153  ± 0.0496 0.7298  ± 0.0579 0.7134  ± 0.0529 0.7198  ± 0.0564
Table A5. Five-fold results for UNETR on CHUS center.
Table A5. Five-fold results for UNETR on CHUS center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.78610.80610.80580.80350.8150
20.78250.79750.78840.78860.7842
30.68750.69810.69060.69030.6912
40.79470.81960.81250.81030.8176
50.79870.80470.83000.81070.8075
μ ± σ 0.7699  ± 0.0465 0.7852  ± 0.0493 0.7855  ± 0.0551 0.7807  ± 0.0513 0.7831  ± 0.053
Table A6. Five-fold results for UNETR on CHUM center.
Table A6. Five-fold results for UNETR on CHUM center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.80090.80170.80890.80860.8133
20.73870.73430.74140.74100.7392
30.76260.75590.75410.77020.7733
40.76680.76900.76750.76970.7629
50.78820.79950.80770.79810.8048
μ ± σ 0.7714  ± 0.0241 0.7721  ± 0.0288 0.7759  ± 0.0309 0.7775  ± 0.0266 0.7799  ± 0.0294
Table A7. Five-fold results for UNETR on CHUV center.
Table A7. Five-fold results for UNETR on CHUV center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.63680.64960.66330.65550.6454
20.75160.76310.76670.76090.7682
30.78320.80140.80130.80910.8063
40.82430.83400.84220.83390.8413
50.66450.68120.69590.69100.7043
μ ± σ 0.7321  ± 0.0793 0.7459  ± 0.0784 0.7539  ± 0.0738 0.7501  ± 0.0759 0.7531  ± 0.0788
Table A8. Five-fold results for UNETR on MDA center.
Table A8. Five-fold results for UNETR on MDA center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.72580.72840.77500.74920.7571
20.73840.74570.77360.74640.7538
30.68430.69700.71590.69790.7000
40.74260.75000.75840.75040.7515
50.72070.72790.74340.72580.7271
μ ± σ 0.7224  ± 0.0231 0.7298  ± 0.0209 0.7533  ± 0.0245 0.7339  ± 0.0225 0.7379  ± 0.0243
Table A9. Five-fold results for UNETR on HGJ center.
Table A9. Five-fold results for UNETR on HGJ center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.78870.80240.80620.79940.7955
20.75110.75340.76220.75430.7566
30.81000.80310.80750.81250.8114
40.80350.82250.82620.81480.8191
50.78520.79350.78830.78780.7739
μ ± σ 0.7877  ± 0.0229 0.7949  ± 0.0255 0.7981  ± 0.0241 0.7938  ± 0.0246 0.7913  ± 0.0259
Table A10. Five-fold results for UNETR on HMR center.
Table A10. Five-fold results for UNETR on HMR center.
FoldNo FinetuningPartial FinetuningFull FinetuningShallow PromptDeep Prompt
10.66320.69030.71320.72010.7453
20.70010.71880.72650.70760.7194
30.67960.67970.69330.68290.6854
40.59260.62290.64000.62990.6264
50.72030.78170.77620.75540.7632
μ ± σ 0.6712  ± 0.0489 0.6987  ± 0.0580 0.7098  ± 0.0496 0.6992  ± 0.0467 0.708  ± 0.0542

Appendix A.3. Ablation for Prompt Position and Number of Prompts

Table A11. Effect of changing the position of concatenated prompt on the performance of the model on Fold 1 of CHUP center using UNETR.
Table A11. Effect of changing the position of concatenated prompt on the performance of the model on Fold 1 of CHUP center using UNETR.
PositionAvg DiceP-TumorLymph
shallow0.63020.77780.4827
10.63070.77930.4820
20.63050.77750.4834
30.63030.77650.4840
40.63060.77740.4837
50.62930.77750.4811
60.63060.77920.4820
70.62950.77850.4806
80.63030.77830.4823
90.63060.77890.4823
100.63040.77850.4822
110.63040.77890.4819
120.63030.77860.4819
Table A12. Comparing the model performance on Fold 1 of CHUP center while adding prompts on skip connections vs. no prompts on skip connections.
Table A12. Comparing the model performance on Fold 1 of CHUP center while adding prompts on skip connections vs. no prompts on skip connections.
Prompts on Skip ConnectionsAvg DiceP-TumorLymph
0.63420.77530.4931
0.62890.77780.4810
Table A13. Effect of changing the number of concatenated prompts on the performance of the model on Fold 1 of CHUP center using UNETR.
Table A13. Effect of changing the number of concatenated prompts on the performance of the model on Fold 1 of CHUP center using UNETR.
Number of PromptsAvg DiceP-TumorLymph
100.62920.77810.4802
300.62910.77660.4816
500.63020.77780.4827
700.63070.77880.4827
900.63000.77740.4827
1000.62940.77740.4815

References

  1. Alalwan, N.; Abozeid, A.; ElHabshy, A.; Alzahrani, A. Efficient 3D Deep Learning Model for Medical Image Semantic Segmentation. Alex. Eng. J. 2021, 60, 1231–1239. [Google Scholar] [CrossRef]
  2. Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021, Strasbourg, France, 27 September–1 October 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer International Publishing: Cham, Swizterland, 2021; pp. 36–46. [Google Scholar]
  3. Zhou, H.; Guo, J.; Zhang, Y.; Yu, L.; Wang, L.; Yu, Y. nnFormer: Interleaved Transformer for Volumetric Segmentation. arXiv 2021, arXiv:2109.03201. [Google Scholar]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  6. Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; IEEE Computer Society: Los Alamitos, CA, USA, 2022; pp. 1748–1758. [Google Scholar] [CrossRef]
  7. Yan, Q.; Liu, S.; Xu, S.; Dong, C.; Li, Z.; Shi, J.Q.; Zhang, Y.; Dai, D. 3D Medical image segmentation using parallel transformers. Pattern Recognit. 2023, 138, 109432. [Google Scholar] [CrossRef]
  8. Yu, X.; Wang, J.; Hong, Q.Q.; Teku, R.; Wang, S.H.; Zhang, Y.D. Transfer learning for medical images analyses: A survey. Neurocomputing 2022, 489, 230–254. [Google Scholar] [CrossRef]
  9. Yu, Y.; Lin, H.; Meng, J.; Wei, X.; Guo, H.; Zhao, Z. Deep Transfer Learning for Modality Classification of Medical Images. Information 2017, 8, 91. [Google Scholar] [CrossRef] [Green Version]
  10. Karimi, D.; Warfield, S.K.; Gholipour, A. Transfer learning in medical image segmentation: New insights from analysis of the dynamics of model parameters and learned representations. Artif. Intell. Med. 2021, 116, 102078. [Google Scholar] [CrossRef]
  11. Wardi, G.; Carlile, M.; Holder, A.; Shashikumar, S.; Hayden, S.R.; Nemati, S. Predicting Progression to Septic Shock in the Emergency Department Using an Externally Generalizable Machine-Learning Algorithm. Ann. Emerg. Med. 2021, 77, 395–406. [Google Scholar] [CrossRef] [PubMed]
  12. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2002, arXiv:2002.05709. [Google Scholar]
  13. Chen, X.; Fan, H.; Girshick, R.B.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
  14. Lin, K.; Heckel, R. Vision Transformers Enable Fast and Robust Accelerated MRI. In Proceedings of the 5th International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; Konukoglu, E., Menze, B., Venkataraman, A., Baumgartner, C., Dou, Q., Albarqouni, S., Eds.; 2022; Volume 172, pp. 774–795. [Google Scholar]
  15. Glocker, B.; Robinson, R.; Castro, D.C.; Dou, Q.; Konukoglu, E. Machine Learning with Multi-Site Imaging Data: An Empirical Study on the Impact of Scanner Effects. arXiv 2019, arXiv:1910.04597. [Google Scholar] [CrossRef]
  16. Ma, Q.; Zhang, T.; Zanetti, M.V.; Shen, H.; Satterthwaite, T.D.; Wolf, D.H.; Gur, R.E.; Fan, Y.; Hu, D.; Busatto, G.F.; et al. Classification of multi-site MR images in the presence of heterogeneity using multi-task learning. Neuroimage Clin. 2018, 19, 476–486. [Google Scholar] [CrossRef] [PubMed]
  17. Barone, A.V.M.; Haddow, B.; Germann, U.; Sennrich, R. Regularization techniques for fine-tuning in neural machine translation. arXiv 2017, arXiv:1707.09920. [Google Scholar]
  18. Kumar, A.; Raghunathan, A.; Jones, R.; Ma, T.; Liang, P. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. arXiv 2022, arXiv:2202.10054. [Google Scholar] [CrossRef]
  19. Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
  20. Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
  21. Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
  22. Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Granada, Spain, 16 September 2018; Crimi, A., Bakas, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 272–284. [Google Scholar]
  23. Oreiller, V.; Andrearczyk, V.; Jreige, M.; Boughdad, S.; Elhalawani, H.; Castelli, J.; Vallières, M.; Zhu, S.; Xie, J.; Peng, Y.; et al. Head and neck tumor segmentation in PET/CT: The HECKTOR challenge. Med. Image Anal. 2022, 77, 102336. [Google Scholar] [CrossRef] [PubMed]
  24. Bertels, J.; Eelbode, T.; Berman, M.; Vandermeulen, D.; Maes, F.; Bisschops, R.; Blaschko, M.B. Optimizing the Dice Score and Jaccard Index for Medical Image Segmentation: Theory & Practice. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019. [Google Scholar]
  25. Wilcoxon, F. Individual Comparisons by Ranking Methods. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY, USA, 1992; pp. 196–202. [Google Scholar]
Figure 1. The four different scenarios of using the deployed deep learning model with the old and new medical centers’ data. In (Scenario 1), the new-center data is directly inferred through the deployed model trained on old-center data (no finetuning). In (Scenario 2), the model is fully or partially finetuned on the new-center data before being deployed for inference. In (Scenario 3), the model is retrained using both old- and new-center data before deployment. Our proposed method (Scenario 4) utilizes the data solely from the new center to finetune only the prompt while keeping the trained model frozen and then deploying it.
Figure 1. The four different scenarios of using the deployed deep learning model with the old and new medical centers’ data. In (Scenario 1), the new-center data is directly inferred through the deployed model trained on old-center data (no finetuning). In (Scenario 2), the model is fully or partially finetuned on the new-center data before being deployed for inference. In (Scenario 3), the model is retrained using both old- and new-center data before deployment. Our proposed method (Scenario 4) utilizes the data solely from the new center to finetune only the prompt while keeping the trained model frozen and then deploying it.
Bioengineering 10 00879 g001
Figure 2. Overview of the proposed method. Learnable prompts are appended to the embedded tokens in the input space and passed through the transformer encoder but not the decoder during the fine-tuning. In deep prompt-based fine-tuning, the learnable prompts are replaced by new prompts after each transformer layer.
Figure 2. Overview of the proposed method. Learnable prompts are appended to the embedded tokens in the input space and passed through the transformer encoder but not the decoder during the fine-tuning. In deep prompt-based fine-tuning, the learnable prompts are replaced by new prompts after each transformer layer.
Bioengineering 10 00879 g002
Figure 3. A sample of images from the dataset [23]. (a,b) depict the original CT and PET scans, respectively. (c,d) show the cropped CT and PET scans, and (e) shows the cropped ground truth mask.
Figure 3. A sample of images from the dataset [23]. (a,b) depict the original CT and PET scans, respectively. (c,d) show the cropped CT and PET scans, and (e) shows the cropped ground truth mask.
Bioengineering 10 00879 g003
Figure 4. Illustrations of the different fine-tuning methods, including partial and full fine-tuning (conventional) as well as shallow and deep prompt-based (proposed).
Figure 4. Illustrations of the different fine-tuning methods, including partial and full fine-tuning (conventional) as well as shallow and deep prompt-based (proposed).
Bioengineering 10 00879 g004
Table 1. Dataset origin and distribution.
Table 1. Dataset origin and distribution.
CenterCity, CountryPET/CT ScannerNumber of Samples
HGJMontreal, CanadaDiscovery ST, GE Healthcare55
CHUSSherbrooke, CanadaGeminiGXL 16, Philips72
HMRMontreal, CanadaDiscovery STE, GE Healthcare18
CHUMMontreal, CanadaDiscovery STE, GE Healthcare56
CHUVVaud, SwitzerlandDiscovery D690 TOF, GE Healthcare53
CHUPPoitiers, FranceBiograph mCT 40 ToF, Siemens72
MDATexas, USADiscovery HR, RX, ST, and STE (GE Healthcare)197
Table 2. Aggregated five-fold Dice scores of GTVp and GTVn using different fine-tuning techniques with UNETR and Swin-UNETR.
Table 2. Aggregated five-fold Dice scores of GTVp and GTVn using different fine-tuning techniques with UNETR and Swin-UNETR.
ModelFine-TuningNonePartialFullShallow PromptsDeep Prompts
Center(s)OldNew ( μ ± σ )OldNew ( μ ± σ )OldNew ( μ ± σ )OldNew ( μ ± σ )OldNew ( μ ± σ )
CHUP0.78690.6708  ± 0.0529 0.70270.7153  ± 0.0496 0.70480.7298  ± 0.0579 0.75070.7134  ± 0.0529 0.76440.7198  ± 0.0565
CHUS0.76650.7699  ± 0.0465 0.75740.7852  ± 0.0493 0.75740.7855  ± 0.0551 0.76830.7807  ± 0.0513 0.76880.7831  ± 0.0530
HGJ0.76740.7877  ± 0.0229 0.74790.7949  ± 0.0255 0.74390.7981  ± 0.0241 0.76070.7938  ± 0.0246 0.76390.7913  ± 0.0259
UNETRMDA0.76590.7224  ± 0.0231 0.76350.7298  ± 0.0209 0.76090.7533  ± 0.0245 0.76670.7339  ± 0.0225 0.76570.7379  ± 0.0243
CHUV0.77040.7321  ± 0.0793 0.76220.7459  ± 0.0784 0.76650.7539  ± 0.0738 0.77230.7501  ± 0.0759 0.77240.7531  ± 0.0788
CHUM0.77650.7714  ± 0.0241 0.75840.7721  ± 0.0288 0.76230.7759  ± 0.0310 0.77340.7775  ± 0.0266 0.77530.7799  ± 0.0294
HMR0.77310.6712  ± 0.0489 0.76290.6987  ± 0.0580 0.77260.7099  ± 0.0496 0.77690.6992  ± 0.0467 0.77600.7080  ± 0.0542
Swin-CHUS0.75840.7695  ± 0.0519 0.75690.7890  ± 0.0520 0.75410.7905  ± 0.0458 0.76130.7797  ± 0.0498 --
UNETRCHUM0.77630.7684  ± 0.0328 0.76420.7685  ± 0.0370 0.76670.7706  ± 0.0359 0.77190.7698  ± 0.0363 --
CHUP0.78350.6609  ± 0.0602 0.69600.7373  ± 0.0672 0.70260.7419  ± 0.0592 0.73200.7136  ± 0.0662 --
MDA0.76160.7291  ± 0.0257 0.76440.7413  ± 0.0264 0.75410.7522  ± 0.0265 0.75900.7352  ± 0.0263 --
Table 4. Wilcoxon signed-rank test on whether deep prompt-based fine-tuning of UNETR performance is better than the other methods.
Table 4. Wilcoxon signed-rank test on whether deep prompt-based fine-tuning of UNETR performance is better than the other methods.
Partial?Full?Shallow?
Is the performance of deep prompt-based fine-tuned models on old centers statistically better than
Is the performance of deep prompt-based fine-tuned models on new centers statistically better than
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Saeed, N.; Ridzuan, M.; Majzoub, R.A.; Yaqub, M. Prompt-Based Tuning of Transformer Models for Multi-Center Medical Image Segmentation of Head and Neck Cancer. Bioengineering 2023, 10, 879. https://doi.org/10.3390/bioengineering10070879

AMA Style

Saeed N, Ridzuan M, Majzoub RA, Yaqub M. Prompt-Based Tuning of Transformer Models for Multi-Center Medical Image Segmentation of Head and Neck Cancer. Bioengineering. 2023; 10(7):879. https://doi.org/10.3390/bioengineering10070879

Chicago/Turabian Style

Saeed, Numan, Muhammad Ridzuan, Roba Al Majzoub, and Mohammad Yaqub. 2023. "Prompt-Based Tuning of Transformer Models for Multi-Center Medical Image Segmentation of Head and Neck Cancer" Bioengineering 10, no. 7: 879. https://doi.org/10.3390/bioengineering10070879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop