1. Introduction
In 1853, John Adams, a surgeon at the London Hospital, diagnosed a cirrhosis of the prostate gland as an orphan disease, the first described case of Prostate Cancer (PCa). In 2020, the disease was responsible for 7.3% of all cancer deaths in men and was the second most frequent malignancy [
1]. The prostate gland is about the size of a walnut and located in the pelvis surrounding the prostatic urethra and below the bladder. Usually, PCa originates from the peripheral zone of the prostate, adjacent to the rectum [
2]. Silently asymptomatic in an early stage, PCa is usually diagnosed by a Digital Rectal Examination (DRE) and a Prostate Specific Antigen (PSA) blood test.
The aggressiveness of PCa is quantified by the Gleason Grading System [
3]. Based on the glandular architecture of cells, the pathologist assigns a grade of 1 if prostate cells are uniformly packed, up to a grade of 5 depending on pattern irregularity. The most predominant pattern and the second most prevalent are identified and graded accordingly, and finally summed to obtain the Gleason Score (GS), proportional to PCa aggressiveness [
4]. In 2014, Epstein et al. [
5] proposed a new Gleason grading system when several studies showed that a GS 7 = 4 + 3 had a worst prognosis than GS 7 = 3 + 4. A deeper stratification by Grade Group (GG) was then possible, including the most likely prognosis. The low/very low Risk Group (RG) is assigned to patients with a GS
. The intermediate RG includes patients classified with a GS of 7, with both favourable (3+4) and unfavourable (4+3) prognosis. Finally, the high/very high RG includes patients with a GS
.
GS plays an important role in the determination of the type of treatment to follow. A low-grade cancer with a GS 6, together with a low PSA level and a small tumor may be an indication for active surveillance only. PCa evaluation takes into account the PSA levels, GS, TNM, patient history and physical examinations in order to provide a decision baseline for proper treatment and success prediction. Current guidelines suggest External Beam Radiotherapy (EBRT) as a curative option for localised and locally advanced disease and as a palliative option for metastatic low-volume disease [
6]. The effectiveness of the treatment usually relies on the monitoring of PSA blood levels, the current reference biological test. Although a high value is associated with an increased risk of PCa, it is not PCa specific. High values of PSA are also associated with Benign Prostatic Hyperplasia (BPH), an enlarged prostate gland [
7]. With the wide availability of PSA tests and the known long-term effects of definitive therapy, overdiagnosis and overtreatment have become a major issue. PCa treatments may cause sexual dysfunction, infertility, bowel and urinary problems [
6], and so more conservative approaches such as active surveillance or watchful waiting have been adopted [
8]. These are valid options, even for intermediate risk patients with a favourable prognosis (GS = 3 + 4).
During the EBRT workflow, many imaging modalities are available. The Computed Tomography (CT) provides the Hounsfield Unit (HU) values critical for dose estimations. For PCa, Magnetic Ressonance Imaging (MRI), providing superior soft-tissue resolution, is used for volume delineation. Positron Emission Tomography (PET) provides tumour cells’ metabolic insights and finally, the Cone Beam Computed Tomography (CBCT), acquired during EBRT sessions, is used for patient positioning and setup verifications. Quantitative analysis of medical images may have a similar prognosis power to phenotypes and gene protein signatures [
9]. This is the hypothesis behind radiomics (the extraction of features from radiographic images using data-characterization algorithms). As an emerging field in medicine, radiomics provides the quantification of phenotypic characteristics in medical imaging [
10]. Traditionally, image analysis and characterization of shape, texture or patterns is performed by highly trained human observers, but radiomics can provide quantitative image analysis without inter/intra observer variability.
Radiomic studies have recently triggered the interest of the research community, and for PCa are mainly focused on several predictive outcomes such as staging, grading, detection, Biochemical Recurrence (BCR) or aggressiveness. Furthermore, the most used imaging modality is MRI, which is usually performed in the initial stages and is critical for volume delineation. With the predictive and phenotypic power of radiomics, other imaging modalities besides MRI may provide valuable insights. Mendes et al. [
11] evaluated CT based radiomics to predict PCa aggressiveness with promising results. In a novel attempt, Bosetti et al. [
12] evaluated the use of CBCT radiomics to address tumour staging, GS, PSA, RG and BCR. Monitoring and classifying the outcome prognosis during treatment may help to avoid extra-invasive procedures or another MRI. This work intends to evaluate a model in borderline favourable vs. unfavourable (3+4 vs. 4+3) PCa cases, providing a tool that may trigger a more conservative approach, avoiding over-treatment and reducing the side effects of radiation exposure.
Following this introductory section is a summary of some of the work done in PCa radiomics using multiple imaging modalities. The idea is to gather information on the most-used feature selection and classification techniques.
Section 3 describes the dataset and the methods used to build the evaluated pipelines.
Section 4 presents the obtained results, highlighting the six best pipelines (from 98 evaluated). Finally,
Section 5 presents the main conclusions drawn.
2. Related Work
Radiomics, the extraction of quantitative features from medical images using data characterization algorithms, has the potential to provide more relevant information, improve decision outcomes and avoid overdiagnosis and overtreatment. A full radiomic study follows a pipeline initially proposed by Lambin et al. [
13] involving several steps, as exemplified in
Figure 1.
Many imaging modalities are of great value in screening PCa and improving diagnosis and prognosis outcomes [
14]. MRI provides superior soft-tissue contrast resolution when compared to other imaging modalities. It is the selected imaging modality by Prostate Imaging Reporting and Data System (PIRADS). Most radiomic studies are focused on MRI with the PCa clinical significance as a model endpoint [
15,
16,
17,
18,
19,
20,
21]. The combination of T1 and T2 weighted sequences (multi-parametric Magnetic Ressonance Imaging (mpMRI)) allows us to overcome the poor correlation between MRI signal intensities and tissue properties. With this in mind, several authors evaluated the use of MRI for PCa patient stratification with promising results [
16,
17,
18,
19], although the introduction of clinical outcomes such as PSA or GS introduced some issues [
20]. Abraham and Nair [
22] topped the PROSTATEx-2 2017 challenge with a quadratic-weighted kappa score of 0.2772, developing a new feature selection method for PCa aggressiveness assessment. Algohary et al. [
23] sought a model to evaluate Intensity Modulated Radiation Therapy (IMRT) PCa treatment responses using T2w and Apparent Diffusion Coefficient (ADC) maps in an attempt to personalize the PCa treatment evaluation framework.
PET provides insights into the pathological responses to some types of cancers with the addition of a radio-tracer and a viable tool for diagnosing, staging and grading. For radiomic PCa studies, researchers focused on evaluating lymph node involvement, metastasis, GS and extra-capsular extension [
24]. Alongi et al. [
25] evaluated tumour heterogeneity with 18F-Cho-PET/CT radiomics and introduced a novel feature selection method (a mixed descriptive-inferential sequential approach).
CT seems to be a poor candidate for radiomic studies since it lacks metabolic manifestation and soft-tissue contrast. However, the spatial distribution provided by the CT could be used as a virtual biopsy for patient risk stratification [
26]. In a recent work, Mendes et al. [
11] evaluated the use of CT based radiomics for PCa aggressiveness assessment. With a dataset of 44 PCa patients, Mendes et al. [
11] extracted features using pyradiomics [
10] and Local Image Features Extraction (LIFEx) [
27]. Unable to find a radiomic signature for RG stratification, they used Principal Component Analysis (PCA) and evaluated several kernels to build a model with a Support Vector Machine (SVM). The best results were obtained with pyradiomics with a maximum Area Under the Receiver Operating Characteristic (AUROC) value of 0.88 for both low/very low and high/very high RG.
CBCT is used for patient positioning verification procedures before EBRT treatment and therefore is freely available. Bosetti et al. [
12] were the first to study the use of CBCT radiomics to build models predicting tumour staging, GS, PSA levels, risk category and biochemical recurrence with promising results. In this work, we intend to evaluate the use of CBCT radiomics to distinguish between favourable and unfavourable PCa cases. An unevaluated scenario that may provide an EBRT treatment effectiveness monitoring tool is used.
3. Materials and Methods
This work is a retrospective piece of research that uses treatment plans available at Instituto Português de Oncologia do Porto Francisco Gentil (IPO-PORTO). All patients had an initial CT scan, required for dose estimation and volume delineations, and CBCT of the first EBRT treatment session, used for patient positioning setup verification. The full analysis was performed on an Intel(R) Core(TM) i7-6500U CPU@2.50GHz 2.60 GHzz, with 16 Gb of RAM and an Nvidia GeForce 930M (2Gb DDR3).
3.1. Study Population
This study includes a subset of patients from a dataset containing CT and CBCT images from 70 patients. All studies ranged from 2019 to 2021 with curative intent, and patients were between 51 and 89 years old. CT images were acquired in a 16 slice CT scanner from General Electric, GE Optima 580 available at the IPO-PORTO for EBRT, acquired at 120 kVp, with 2.500 mm of slice thickness, automatic tube current modulation, a pixel spacing of
and 16 bits of pixel depth. For the same patients CBCT images were acquired from the on-board imaging devices installed in NOVALIS, TRUEBEAM and TRILOGY Medical Linear Accelerators (LINACs) from Varian. Acquisition settings depend on the LINAC and are shown on
Table 1.
Clinical information such as the GS, PSA levels, TNM and also staging, when available, was collected. The dataset was grouped following a 3-fold GS risk group stratification as proposed by Epstein et al. [
5]. The selected subset included only patients classified as intermediate risk. The idea was to develop a tool capable of distinguishing favourable and unfavourable clinical outcomes and also a baseline for a EBRT effectiveness assessment. Following these criteria, this study included 22 patients with a favourable outcome (GS = 3 + 4 and GG = 2) and 24 patients with an unfavourable outcome (GS = 4 + 3 and GG = 3), as shown in
Table 2 (in bold font).
3.2. Image Registration
In EBRT, in the treatment planning stage, tumor and tissue related volumes are defined by the International Commission on Radiation Units and Measurements (ICRU). The Gross Tumor Volume (GTV) is the gross demonstrable extent and location of the tumor which may also include metastatic regional nodes and distant metastasis if they are indistinguishable from the primary tumor. The Clinical Target Volume (CTV) is a volume that contains a GTV and a margin that reflects the probability of subclinical disease occurrence. The prescribed dose must be delivered to the CTV, plus a clinically acceptable margin that may include organ motion and setup variations [
28].
The initial CT scan is essential for the EBRT planning systems to provide a baseline for volume delineation and the HU values needed for dose estimations. For PCa, an MRI or PET can also be considered in this task. All the volumes were drawn manually, or using semi-automatic tools from the available EBRT planning system at the institution, by medical experts. During the treatment, a CBCT is usually acquired for patient positioning verification. In other words, to verify that the prostate is within the CTV. This visual inspection is validated by a medical doctor. In this work, the included CBCT was the one from the first treatment session.
To transpose the structures defined in the CT to the CBCT, an elastic registration was performed using the Elastix toolbox [
29,
30] available as an extension from 3DSlicer [
31]. The fixed volume was the CT, while the moving volume was the CBCT. The extension computed the displacement field which is then applied to the CBCT to transform the volume.
Figure 2a shows an example CT image, and
Figure 2b the registered CBCT and the manually defined CTV structure on both.
3.3. Feature Extraction
The extraction of radiomic features from the CTV was performed using the python library pyradiomics [
10], a highly tested and maintained open-source platform and also available as an extension from 3DSlicer. Most PyRadiomics features are in compliance with the Image Biomarker Standardisation Initiative (IBSI), an independent international collaboration that aims at standardizing the extraction of image biomarkers for high-throughput quantitative image analysis (radiomics) [
32]. A total of 107 features were extracted volume-wise with the default settings: only the original images (no filter applied), minimum Region Of Interest (ROI) dimensions of 2, a pad distance of 5, a bin width of 25 and a cubic voxel size of 1 mm, since the voxel size will vary as it depends on the acquisition settings of each LINAC, as shown in
Table 1.
Table 3 shows the radiomic feature classes extracted from the dataset.
The extracted features were then saved to a tab-separated values file for later processing in a python environment.
3.4. Feature Selection
For high-dimensional datasets, feature selection plays a crucial role, improving performance and estimators accuracy scores [
33]. Current radiomic studies use several feature selection methods for model building. In this study, seven of the most used methods found in the literature were selected to build a classification pipeline for radiomics in CBCT.
The variance threshold provides a baseline approach by removing all zero-variance features. Univariate methods were also used by several authors such as Bernatz et al. [
18], Li et al. [
19], Bleker et al. [
21], Bourbonne et al. [
34]. Cysouw et al. [
24], Chen et al. [
35], who also used Analysis Of Variance (ANOVA) method with good results. Another one is the Recursive Feature Elimination (RFE) with an SVM estimator [
16,
36]. The goal of RFE is to select features by recursively considering smaller and smaller sets [
33]. An approach also evaluated is the introduction of cross-validation in the RFE. A different approach is to select the features that maximize an estimator’s metric (selected from the model). Wildeboer et al. [
37] considered an SVM and Abdollahi et al. [
38] a Linear Regression (LR) model. The selected feature selection methods and used parameters during training are shown in
Table 4.
To implement these methods, scikit-learn library was used [
33] and the chosen parameters are the default ones. Each of the feature selection methods was evaluated with different classifiers (see
Section 3.5) in multiple pipelines. The idea is not to evaluate the best feature selection method or classifier but the pipeline or pipelines that best suit this particular dataset and the task of distinguishing favourable/unfavourable prognoses for EBRT PCa patients from CBCT images. Furthermore, no parameter optimization was performed but that will be considered for future work.
3.5. Model Building
The number of extracted features can increase dramatically in radiomics and can even surpass the number of samples, thereby reducing effectiveness and increasing the probability of an overfitting scenario. Thus, feature selection is critical, but once features are chosen or recombined, the classifier or prediction model is ready to be developed.
The most used methods are SVM [
16,
17,
18,
20,
37,
39,
40,
41,
42,
43] and LR [
12,
15,
17,
19,
35,
36,
41,
42,
44,
45], followed by Random Forest (RF) [
17,
18,
21,
35,
38,
40,
41,
43]. Castillo T et al. [
41], Stanzione et al. [
43] also included Bayesian Network (BN) estimators while Woźnicki et al. [
17], Bleker et al. [
21] used Extreme Gradient Boosting (XGB). Finally, Abdollahi et al. [
38], Stanzione et al. [
43] used Tree Based (TB) and K-nearest Neighbors (KNN) classifiers, and the only approach using a Neural Network (NN) was from Bernatz et al. [
18]. In order to add more diversity, a bagging strategy was applied with a Support Vector Classifier (SVC) and a TB estimator (bagging fits the base classifier on random subsets and aggregates predictions [
33]).
Table 5 shows the evaluated 14 estimators considered.
The chosen parameters are the default ones provided by the scikit-learn library. To build the pipelines, features were first standardized (removing the mean and scaling to unit variance calculated on the training dataset). Each of the seven feature selection methods was combined with each of the 14 classifiers for a total of 98 classification pipelines. For comparison, the AUROC, accuracy scores and corresponding standard deviations were computed following a stratified (in order to preserve class distribution) 5-fold cross validation scheme and a training/validation split of 0.75/0.25.
Figure 3 summarizes the followed methodology.
4. Results and Discussion
The evaluation of the 98 pipelines was performed in a python environment, computing the AUROC and accuracy scores.
Figure 4 shows the obtained AUROC.
Figure 5 shows the obtained accuracy scores, and both present the corresponding mean standard deviations for the five fold cross-validation scheme (lighter colors mean a higher AUROC value as indicated in the color bar scale).
The pipelines were built with a very specific subset and goal - to distinguish favourable and unfavourable cases from PCa patients classified as intermediate risk. The specificity of this approach may provide EBRT with a tool capable of monitoring the true effectiveness of the treatment. An initial unfavourable outcome may become, during treatment, a favourable case, possibly leading to adjustments in the treatment workflow.
From the obtained results, a few pipelines do present good performance. In radiomic studies, there is always an issue with the reproducibility of the features. In the future, other CBCTs of each patient will be included in order to overcome this issue, as already performed by Bosetti et al. [
12].
Still, results seem to suggest that some classifiers are not suited to this particular task. The HistGradientBoosting obtained an AUROC of 0.50 for every feature selection method. From the evaluated feature selection methods, the KBest (ANOVA) and RFECV (SVR) provided poor performances. On the other hand, the percentile feature selection method seems to have an overall decent performance, obtaining its best results when combined with an AdaBoost classifier with an AUROC of and its worst when combined with an LR classifier with an AUROC of .
Considering a threshold value of AUROC of 0.79, six pipelines were capable of obtaining good performance. Selecting features from an LR model and combining it with a Bagging (SVC) classifier or GaussianNB provided a pipeline with an AUROC value of
and
, respectively. Furthermore, the univariate selection method combined with an SVC obtained an AUROC of
and, with an ExtraTrees, of
. The Adaboost classifier combined with the baseline feature selection method, the variance threshold, provided an AUROC of
, and when combined with the percentile feature selection method, an AUROC of
.
Table 6 highlights the selected pipelines obtained results.
The presented values are the mean values obtained for the five folds in the cross-validation. For these six pipelines, a deeper analysis was performed in order to evaluate the performance of each.
Figure 6 presents the AUROC curves and corresponding values for each fold as well as the mean value. Furthermore, in grey, are the
standard deviation curves and, in dashed red, the 0.5 AUROC line.
In binary classification models, other metrics may provide valuable information on the performance of a classifier. The precision is defined as a ratio of the number of true positives divided by the sum of true positives and false positives. It is also referred to as the positive predictive power, this is, the ability of the model to predict true positives, which in this work, are the unfavourable cases (GG = 3). The recall (sensitivity) is calculated as the ratio of the number of true positives divided by the sum of true positives and false negatives. The f1-score is the harmonic mean of the precision and recall, while the support is the number of occurrences of each class in the ground truth labels [
33].
Table 7 presents these parameters for the selected pipelines and the mean number of features selected in each fold in the cross-validation.
The threshold = 0, in the variance threshold method, is not enough to perform any feature selection because the preprocessing standardization removes the mean and scales features to unit variance. The AdaBoost classifier fits additional copies on the same dataset and re-adjusts the weights accordingly. This behaviour may overcome the lack of feature selection when both are combined. Besides the percentile–Adaboost combination, most pipelines present a low recall value for Class 2 (favourable), returning very few results. Themodel (LR)-bagging (SVC) pipeline has a high precision for Class 3 (unfavourable), suggesting those few results are well classified. This combination, however, has the opposite behaviour for Class 3. A high recall returns many results, but a low precision means those results are poorly classified.
The bagging (SVC) classifier aggregates the predictions of several SVCs, reducing the variance of the final output and improving accuracy. When using an SVC classifier, we obtained an accuracy of
, but with a bagging strategy of
. The GaussianNB classifier assumes a Gaussian likelihood with a naive assumption of pair-wise features’ conditional independence. By reducing the number of features to 43, we are also eliminating highly correlated features, a factor that degrades the classifier performance. The decision tree creates a piecewise constant approximation to the decision curve, but when using the entire feature set, the set is known to provide some overfitting. The ExtraTrees introduces randomized decision trees on several subsets allowing control of the overfitting and increasing accuracy [
33].
Percentile feature selection method with an AdaBoost classifier, the obtained results seem to be satisfactory, with relatively high values of precision and recall for both classes. The main advantage of this model is that the number of features used for classification is reduced to 11. This may be a step further in providing explainability and interpretability. In fact, the selected features in each fold of the cross-validation varies, although some are frequently selected.
Figure 7 shows the features that were more frequently selected.
All evaluated feature selection methods select the skewness. Complexity features are selected using a univariate or percentile approach, while the zone variance, business and surface to volume ratio are selected when learning from an LR model. The percentile-AdaBoost pipeline is used the complexity and skewness features in the cross-validation. Increasing the number of folds in the cross-validation scheme may provide more valuable insights.
5. Conclusions
For PCa, EBRT is a curative option for localized and locally advanced disease and a palliative option for metastatic low-volume disease [
6]. Currently, the only triggers for recurrence or treatment effectiveness monitoring are the PSA blood test or redoing a Transrectal Ultrasound Guided Biopsy (TRUS). With radiomics, a quantitative analysis of medical images may have similar prognostic power to phenotypes and gene protein signatures [
9]. Interest in radiomics has been increasing and for PCa, it has been mainly focused on MRI in the initial staging and grading. However, during the EBRT, CBCTs are freely available, as they are used for patient positioning and setup verifications.
The value of CBCT-based radiomics was evaluated to distinguish between favourable and unfavourable prognosis for patients initially classified as intermediate risk. Such a tool may provide added value to monitor and trigger possible changes in EBRT outcomes. Following the current methods of feature selection and classification for PCa radiomics, 98 pipelines were evaluated. The results seem to suggest that selecting features from an LR model, combined with a bagging (SVC) classifier, provided good performance. Although using 43 features, it lacks the potential to offer explainability. In this sense, a better approach seems to be using a percentile feature selection method and an AdaBoost classifier. This pipeline presents an AUROC of and an accuracy of and high values of precision and recall, being the most balanced of the evaluated pipelines.
The skewness seems to be the most frequently selected feature, considering each fold in the cross-validation scheme. Although its true importance is yet to be evaluated, the fact it was selected in every fold of every feature selection method suggests it may provide some insights. Furthermore, the fact that only one CBCT was considered for each patient did not allow the evaluation of features reproducibility.
Although the obtained results are promising, some improvements need to be made for a deeper evaluation. A grid-search cross-validation may provide fine-tuning of parameters and improved results. Furthermore, the used subset is quite small, and may benefit in the future from the inclusion of high-risk patients and the assessment of other models.
CBCT-based radiomics may provide a baseline for an EBRT effectiveness assessment framework on ongoing treatment, improving outcomes and lowering recurrence rates regardless of the several limitations.