Deep Reinforcement Learning Method for 3D-CT Nasopharyngeal Cancer Localization with Prior Knowledge

Han, Guanghui; Kong, Yuhao; Wu, Huixin; Li, Haojiang

doi:10.3390/app13147999

Open AccessArticle

Deep Reinforcement Learning Method for 3D-CT Nasopharyngeal Cancer Localization with Prior Knowledge

¹

School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

²

School of Biomedical Engineering, Sun Yat-sen University, Shenzhen 518107, China

³

State Key Laboratory of Oncology in South China, Sun Yat-sen University Cancer Center, Guangzhou 510060, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 7999; https://doi.org/10.3390/app13147999

Submission received: 28 May 2023 / Revised: 4 July 2023 / Accepted: 6 July 2023 / Published: 8 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

Fast and accurate lesion localization is an important step in medical image analysis. The current supervised deep learning methods have obvious limitations in the application of radiology, as they require a large number of manually annotated images. In response to the above issues, we introduced a deep reinforcement learning (DRL)-based method to locate nasopharyngeal carcinoma lesions in 3D-CT scans. The proposed method uses prior knowledge to guide the agent to reasonably reduce the search space and promote the convergence rate of the model. Furthermore, the multi-scale processing technique is also used to promote the localization of small objects. We trained the proposed model with 3D-CT scans of 50 patients and evaluated it with 3D-CT scans of 30 patients. The experimental results showed that the proposed model has strong robustness, and its accuracy was improved by more than 1 mm on average under the premise of using a smaller dataset compared with the DQN models in recent studies. The proposed model could effectively locate the lesion area of nasopharyngeal carcinoma in 3D-CT scans.

Keywords:

deep reinforcement learning; multi-scale processing; prior knowledge; nasopharyngeal cancer localization

1. Introduction

Artificial intelligence has developed rapidly in recent years. It is one of the important development directions in the medical imaging industry to apply AI’s sensory cognition and deep learning technology to the field of medical imaging so as to improve the accuracy and efficiency of radiologists’ diagnoses and reduce the misdiagnosis rate. In medical image analysis, effective localization of lesion areas is a key step for subsequent diagnosis, treatment, and other processes. This requires a robust and accurate method to locate feature points. Traditional supervised deep learning has several drawbacks in this regard: It requires a large amount of manually annotated data, which requires high costs. Moreover, it is very sensitive to subtle variations in image quality and lacks robustness. Goodfellow et al. showed that adding only a small amount of imperceptible image noise can cause deviations in SDL predictions [1]. Finally, traditional supervised learning algorithms have poor transparency, and their principles are more difficult for doctors to understand [2].

To address the aforementioned shortcomings, reinforcement learning can provide some solutions. First, since the RL learning environment has a richer structure, the RL model can achieve good results on fewer annotated images [3]. Second, the goal of RL learning is provided through a reward system, which makes it more stable during the training process. Finally, the reward structure provides the rationale for algorithms to make certain predictions. It also provides an opportunity for imaging medical experts to exploit their domain knowledge to help craft algorithms [3]. Nasopharyngeal cancer is one of the high incidence malignant tumors in China. Studies [4] have shown that Chinese nasopharyngeal cancer accounts for 38.29% and 40.14% of the global nasopharyngeal cancer incidence and mortality, respectively, with a 5-year relative survival rate of 43.8%. Li et al. [5] investigated a new potential mechanism for increasing the risk of distant metastasis in nasopharyngeal carcinoma patients to achieve more accurate personalized management of nasopharyngeal carcinoma cases. Therefore, for the problems faced by medical image analysis and nasopharyngeal carcinoma analysis, the reinforcement learning method is a promising development direction.

Related work: Criminisi et al. [6] proposed a regression forest-based method for detecting anatomical markers in whole-body CT scans. Although this method is fast, its accuracy is low when dealing with large organs. Gaureau et al. [7] extended Criminisi et al. [6]’s work by adding statistical shape prior values obtained through cascading regression from segmented masks. Ma et al. [8] proposed a novel approach to automatically align the range image of the patient with preoperative CT scans, leveraging the contextual information of medical images to resolve data ambiguities and improve robustness.

In order to address the limitations of previous detection work, autonomous feature detection and localization based on reinforcement learning have received increasing attention in recent years. Jie et al. [9] showed that DRL can be used for object detection and localization. Maicas et al. [10] proposed a method for detecting breast lesions using dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI). Dou et al. [11] investigated how a DRL agent was used to locate two standard transthalamic (TT) and transcerebellar (TC) planes in the three-dimensional ultrasound volume of the fetal head. They exploited prior information about the standard planes from the atlas and provided the agent with an effective warm starting point. Ghesu et al. [12] proposed a new learning method that follows different paradigms by modeling both appearance and parameter search strategies as behavioral tasks for artificial intelligence agents. This method combines the advantages of behavioral learning achieved through reinforcement learning and effective hierarchical feature extraction achieved through deep learning. Experiments have shown that, given a series of annotated images, the agent can learn the best path to converge to the marked position. In response to the issue of artifacts in 3D fetal MRI images, Zhang et al. [13] deployed 15 agents based on DRL, simultaneously detecting 15 feature markers, and set additional rewards based on the distance between the agent and the fetal body nodes to improve detection accuracy. Pesce et al. [14] studied how to locate lung lesions on chest radiographs. In one of these methods, reinforcement learning is used to learn a cyclic attention model with annotated feedback (RAMAF) to observe short sequences of image patches. Zhang et al. [15] proposed a deep reinforcement learning method for tracking and locating blood vessels in 3D images. Instead of an exhaustive search, they use artificial agents to interact with their surroundings and collect rewards from the interactions. Stember et al. [3] trained DQN on 70 T1-weighted two-dimensional image slices to predict the location of brain tumors. This study shows that supervised deep learning is seriously overfitting on the training set, with low accuracy, while reinforcement learning may make meaningful predictions based on very small datasets. Based on this work, Stember et al. [16] trained a Deep Q network on 30 2D image slices from the BraTS brain tumor database, which showed good accuracy (70% accuracy) compared to the supervised deep learning network (11% accuracy) on the testing set. Navarro et al. [17] proposed a DRL method for CT organ localization for exhaustive or regional suggestion search strategies that require large amounts of annotated data.

Another research direction is the detection of lesion areas and landmarks. Jain et al. [18] proposed a multi landmark detection method to improve time efficiency and enhance robustness for missing data. Liu et al. [19] proposed using DRL models for lung cancer detection and tested multiple deep reinforcement learning models, such as DQN. By combining DRL methods for the early detection and diagnosis of lung cancer, treatment effectiveness can be significantly improved. Ali et al. [20] studied a reinforcement learning model based on deep artificial neural networks for the early detection of lung nodules in chest CT scans. Alansary et al. [21] proposed a fully automatic recognition method for standardized visual planes in 3D image acquisition and conducted experimental evaluations in isotropic brain and heart MRI images. AI and Yun et al. [22] followed the formula of Ghesu et al. [23] and learned the RL agent for landmark localization in 3D medical images. In addition, in order to speed up learning and achieve more robust localization, multiple local strategies are learned on different sub-action spaces instead of a single complex strategy on the original action space. Ghesu et al. [24] redefined the detection problem as a behavioral learning task for artificial intelligence agents, using the capabilities of DRL and scale space theory to learn the optimal search strategy for finding anatomical structures based on multi-scale image information for 3D landmark detection. They conducted experimental evaluations on 3D-CT scans, and the results showed that the method had good accuracy and robustness. Leroy et al. [25] proposed a novel communicative multi-agent reinforcement learning (C-MARL) system to automatically detect landmarks in medical scans. The architecture of their proposed C-MARL model takes, as an input, a tensor of size number agents × 4 × 45 × 45 × 45 and consists of four 3D convolutional and three 3D max pooling layers, followed by four FC layers. In their architecture, each agent has its own FC layer, but the convolutional layers are shared between all the agents. They evaluated the proposed approach using two brain imaging datasets from adult magnetic resonance imaging (MRI) and fetal ultrasound scans.

Contributions: (1) This article proposes a deep reinforcement learning method that can be used to effectively locate feature points of nasopharyngeal carcinoma lesions, thereby achieving detection of the lesion area of nasopharyngeal carcinomas. (2) We conducted preprocessing operations, such as organizing and resampling patient data, to make it applicable for experimental training. (3) During the experiment, we enriched the search mechanism of the agent, adopted a multi-scale search strategy, and designed initial positions for the agent, simplifying the neural network. Moreover, we compared and evaluated our model with other relevant DQN models; the experimental results show that our model has good accuracy and robustness on a limited dataset, with an average detection accuracy of 3 mm+.

2. Background

RL is inspired by behavioral psychology and neuroscience. RL agents learn strategies by interacting with the environment e. In each state s., agents make a single decision from a set of discrete actions and select an action a. Each effective action selection generates a related scalar reward, defined as the reward signal r. The main goal of the agent is to learn a strategy π that not only maximizes immediate rewards but also maximizes future rewards in the future. Due to similarities in the interactions between agents and the environment, reinforcement learning can be considered as a universal learning framework that may solve the problems of general artificial intelligence.

Q-Learning: The selection of strategies was obtained through the action-value function Q(s,a), which is defined as the sum of future discounted expected rewards

E [r_{t + 1} + γ r_{t + 2} + \dots + γ^{n - 1} r_{t + n} | s, a]

; this function can be recursively expanded through the Bellman equation [26] to iteratively solve

Q_{i + 1} (s, a) = E [r + γ \max_{a^{'}} Q_{i} (s^{'}, a^{'})]

(1)

Among

γ \in (0, 1)

is a discount factor, which is used to weight the future reward accordingly. It represents the uncertainty in the agent’s environment;

a^{'}

and

s^{'}

represent the next action to be taken and the next state after taking that action, respectively. The optimal strategy

π^{*}

of the agent is obtained based on the optimal action function:

Q^{*} (s, a) = \max_{π} E \{R_{t}| s_{t} = s, a_{t} = a, π\}, π^{*} (s) = \underset{a \in A}{argmax} Q^{*} (s, a)

(2)

Deep Q-Learning: Mnih et al. [27] proposed the Deep Q-Network (DQN). It introduces a target network

Q^{-} (S_{t + 1}, a; θ^{'})

based on a Q network

Q (S_{t + 1}, a; θ)

. The target network has the same structure and the same initial weight as the Q network. After

c

iterations of training, the parameters of the Q network are assigned to the target network

Q^{-}

, thereby reducing overestimation of the Q value. The loss function of the DQN is defined as:

L (θ) = E_{s, a, r, s'} [{(r + γ \max_{a^{'}} Q_{t a r g e t} (s^{'}, a^{'}; θ^{'}) - Q_{n e t} (s, a; θ))}^{2}]

(3)

where

γ

is the discount factor determining the agent’s horizon,

θ

are the parameters of the Q-network, and

θ^{'}

are the parameters of the target network.

In recent years, researchers have gradually applied DRL algorithms to the field of medical image analysis. Ghesu’s team [12,24,28] applied the DRL method to the anatomical landmark detection problem in medical images, which mainly solved the difficulty of multi-scale and incomplete data. Alansary’s team [21,25,29,30] used DRL methods for anatomical landmark detection and automatic view planning problems, and they brought the multi-agent approach to their research work. Some researchers [10,31] have applied the DRL method to other medical image analysis problems, such as the location of human organs (such as the breast and pancreas). Based on the existing research work in the research community, we tried to use the DRL method to solve the problem of cancer localization in nasopharyngeal carcinoma imaging. However, cancer localization and anatomical landmark detection tasks are different, such as: (1) there is no multi-target localization problem, so the direct use of the multi-agent method is not appropriate; and (2) the size variation of the cancer area is greater than that of a single anatomical landmark. In order to explore the problem of cancer localization suitable for the nasopharyngeal carcinoma 3D-CT scene, we tried to improve the existing methods in this paper.

3. Method

The framework of reinforcement learning consists of many basic elements, which are defined in this work as follows:

State: The environment e is a three-dimensional CT scan, where each state is defined as a region of interest (ROI) centered around the agent. In order to improve the robustness of the experiment, the latest 4 frame states are stored in the frame history.

DQN is difficult to converge in the early stage of training because agents usually need to be based on trial and error in a large exploration space. In this paper, we introduce prior knowledge to guide the agent to reasonably reduce the search space and promote the convergence rate of the model. Specifically, instead of randomly setting the starting point, we set the starting point based on the average relative location of the nasopharyngeal carcinoma lesion. The average relative position of the target lesion in CT scan volumes was calculated based on the training set (i.e., 50 CT scans), and it was used as the starting position of the agent for searching in training and inference.

The starting search point for each agent at the beginning of each episode was calculated on the training set. For N CT scans

I_{1}, I_{2}, I_{3} \dots I_{N}

, we defined

\vec{r} (d)

as:

\vec{r} (d) = \frac{1}{N} \sum_{k = 1}^{N} \frac{{g t}_{I_{k}}^{d}}{{s i z e}_{I_{k}}^{d}}

(4)

Among them,

{s i z e}_{I_{k}}^{d}

represents the size (i.e., the voxel number) of the k-th CT scans in dimension

d

, and

{g t}_{I_{k}}^{d}

represents the coordinates of the ground truth feature point in dimension

d

of the k-th CT scans. The starting position

p_{0}

of the agent is:

p_{0} = \vec{r} ⊙ s i z e [I]

(5)

In Formula (5), the

⊙

means multiplication of the corresponding elements for each dimension. Compared to randomly generating search starting points, this method can reduce errors while improving training robustness.

Actions: The agent interacts with the environment through actions

(a \in A)

, and the action space A is composed of six actions in the Cartesian coordinate system, namely

{\pm x, \pm y, \pm z}

, where

‘ + ’

represents to move one unit distance along the positive direction of the axis at the current scale, and

‘ - ’

represents moving one unit distance along the negative direction of the axis at the current scale. We adopted a multi-scale search strategy using three levels of scales: 3 mm, 2 mm, and 1 mm. Initially, the agent searches at the maximum scale. As the action progresses, the search step size and ROI size are reduced as the agent approaches the target until the agent reaches the termination state at the minimum scale.

Reward: The reward function

R_{D}

drives the agent’s action, and in this work, we define it as the difference in the Euclidean distances between the two states and the target point:

R_{D} = D (p_{i - 1}, p_{t}) - D (p_{i}, p_{t})

(6)

R = \{\begin{matrix} + 1 & R_{D} > 0, t h e a g e n t a p p r o a c h e s t h e t a r g e t \\ - 1 & R_{D} < 0, t h e a g e n t i s m o v i n g a w a y f r o m t h e t a r g e t \end{matrix}

(7)

D

represents the Euclidean distance function between two points,

p_{i}

represents the location of the agent at step

i

, and

p_{t}

represents the ground truth position of the target. According to the reward function, if the agent moves away from the target, the system gives a reward signal of −1 as a punishment. Otherwise, the system gives a positive reward signal.

Terminal Condition: In this experiment, the position of the target feature points was set as the physical center point of the ground truth in the lesion area. In the training phase, the agent terminates the search when it oscillates at the minimum scale or when it reaches the maximum number of steps (200 steps in our experiment). If the agent oscillates at a larger scale, it will proceed to the next smaller scale level to continue the search. Oscillation at the minimum scale usually occurs near the target point so that the occurrence of oscillations often means that the agent has found the target point, i.e., the agent has learned the appropriate policy for locating cancer areas, so it can find the lesion target on the unlabeled data. In the testing phase, the agent also has two termination modes; either position oscillation occurs or a preset maximum number of iterations is reached. However, in the testing phase, the agent does not use annotated information from the experts, which is only used to evaluate the inference results.

Network Architecture: Our network structure is based on the network structure proposed by Leroy et al. [25] and has been modified accordingly. The network structure in this work consists of four 3D convolutional layers, three maximum pooling layers, and three fully connected layers. The output size of the last fully connected layer is the same as the action space size. Compared to the original network structure, the modified network structure has a lighter weight characteristic, while accelerating the convergence speed of model training and improving training effectiveness and robustness in our mission scenarios.

Figure 1 shows our network structure. The input is a

4 \times 45 \times 45 \times 45

tensor, the output is 6 Q values (equivalent to the number of agent actions), and then the agent selects the action with the highest Q value for the operation to search for feature points.

Experience Replay: Due to the continuous frames of input data, there is a significant correlation between data samples, which is not conducive to the learning of neural networks. Therefore, the experience replay is introduced, which stores past

t r a n s i t i o n s (s, a, r, s')

. We used the method in reference [32] to randomly sample experienced transitions from the experience replay for learning, which can improve data utilization. More importantly, this method can solve the problem of correlated data and non-stationary distribution of empirical data, which would make training more stable.

4. Experiments

Dataset: The CT scan dataset of nasopharyngeal carcinoma in the experiment was obtained from the Sun Yat-sen University Cancer Center (SYSUCC), which covered the patient’s area from the chest to the head. These CT scan volumes came from the CT localization examination stage of patients with nasopharyngeal carcinoma before further radiotherapy. Such a CT localization examination is a routine procedure in nasopharyngeal cancer radiotherapy, without additional risk and burden to patients. In addition, this study was retrospective in nature, and the original patient data were anonymized and not disclosed. This study is part of another large research project [5] that received ethical approval from the Institutional Review Board of SYSUCC (approval number: B2019-222-01).

The original image dimensions of each CT scan (patient) were

512 \times 512 \times Z (Z \in (87, 132))

;

Z

represents the height of the upper and lower spatial images, and the size

Z

of each CT scan is not the same. The voxel space size is anisotropic. In the preprocessing stage, we mainly use SimpleITK (https://simpleitk.org/, accessed on 1 April 2023) to process images, which is usually used to process some common medical image format files. We resampled the experimental image to achieve isotropy. After processing, the voxel space size and image dimensions will be changed, as shown in Table 1. Figure 2 shows the effect of the processed image. The software we use for visualizing the entire data is ITK-Snap [33]. The dataset contains 80 CT scans from 80 patients, i.e., one CT scan for one patient. To demonstrate that reinforcement learning can learn from limited data, we trained with 50 CT scans (from 50 patients), and the remaining 30 CT scans (from the other 30 patients) were used for testing. For the test set, the labels are used for algorithm evaluation, not for the inference procedure.

Train: To demonstrate the effectiveness of this work, we conducted experiments on three-dimensional nasopharyngeal carcinoma CT scans and analyzed the experimental results. The experimental process is shown in Figure 3.

The red dots in Figure 3 represent the target feature points, the blue dots represent the agent position, the yellow box represents the search scale size, and the size of the red area around the feature points represents the distance between the agent and the feature points in the z-axis direction. This figure shows the state of the agent search during the training process. From Figure 3a–c, it can be seen that the agent is searching at the maximum scale. From Figure 3d–f, it can be seen that after oscillation occurs (during approaching the target), the agent starts to reduce the search step scale until the minimum scale, and at this point, the agent and feature point positions basically coincide.

The actions of the agent in the training process follow an

ε - g r e e d y

strategy, and the value of

ε

gradually decreases from 1 to 0.1, which means that the agent gradually shifts from exploration to exploitation with the training. The agent starts the search from a calculated starting point (explained in Section 3) and samples a

45 \times 45 \times 45

voxel size ROI around the point. The episode ends when the agent oscillates around the target on a minimum scale or reaches a predefined maximum number of steps during the search procedure. The baseline model was based on the model of Leroy et al. [25]. We transformed this multi-agent method into a single-agent method for our nasopharyngeal carcinoma dataset and then took this model as the baseline model for this work. Our improved comparison was based on the baseline model results. As described in Section 3, we optimized the experimental model for our mission scenarios.

Figure 4 shows a comparison of the two models. The horizontal axis coordinate direction represents the training time (i.e., episodes), and the vertical axis coordinate represents the mean error distance between the agent and the target feature points. They are trained on the same dataset, and it can be seen that our model has a lower error, converging at approximately 5500–6000 episodes to achieve the minimum error distance. On the other hand, the training time was 1/3 shorter than the baseline model. Meanwhile, Figure 5 shows the comparison of the loss for the two models in the training process. It can be seen that although the loss of our model fluctuates at the beginning of the experiment, it tends to be stable on the whole and is better than the original model. This indicates that our model has better robustness.

Test: The test set data came from different patients than the training set. During the test procedure, the agent uses the greedy strategy of ε = 0, which means that the agent performs exploitation fully without exploration. We used the oscillation property to terminate the search procedure. Specifically, the agent follows the previously explored optimal strategy to search the space during the inference procedure until oscillation occurs (i.e., constantly changing between two adjacent states or staying in the same state). Similar to Alansary et al. [29], their research showed that the closer the agent is to the target point, the lower the Q-values will be, and the farther it is from the target point, the higher the Q-values will be. Therefore, we chose a state with a relatively low Q-value as the termination state. If oscillation does not occur, the search procedure will be terminated when the agent reaches the maximum number of steps.

Result: We conducted the experiment on a dataset consisting of CT scans of 30 patients, on which we also performed resampling and other preprocessing operations. In the reasoning process, the agent samples the environment and actions and updates the state in turn until it reaches the termination state.

We compared the effectiveness of previous single-agent landmark detection DQN models. The first two comparison models came from the DQN models of Alansary et al. [29] and Vlontzos et al. [30], which were used for landmark detection of MRI images, such as those of the brain and heart. Table 2 shows the experimental results. The error distance represents the range of the distance error. It can be seen that our model has a small error, with an average value of approximately 3.52 mm, good robustness, and a standard deviation of approximately 1.18 mm. Although the types of medical images used in the data are different, the accuracy of our model is improved by more than 1 mm on average. This indicates that the agent has learned the appropriate strategy and can find the lesion target on the unlabeled data, which proves the effectiveness of this model. At the same time, Alansary et al. [29] used the scans of 72 patients for training, while Vlontzos et al. [30] used a training set for different detection objects to contain the scans of 728, 364, and 51 patients, respectively. Compared to these methods, our model requires much less data.

In addition, to evaluate the effectiveness of the improvement work, we conducted an ablation experiment. As shown in Table 3, we compared and evaluated the baseline model, the model with only network improvement (Model A), the model with only prior knowledge (Model B), and our final model. The results show that the fusion of prior knowledge can significantly reduce the error distance and improve the effectiveness of the model inference evaluation. This proves the effectiveness of our work.

We also compared and evaluated the hyperparameters used in the experiment to ensure that the model has the best effect. We trained five different combinations of hyperparameters and compared the experimental results of the model with the average loss function. The experimental results are shown in Table 4, where

γ

is the discount factor, which is used to control the weight of future rewards, as shown in Section 2. In each episode,

δ

is the step size from exploration to exploitation, which indicates the degree of decay in the exploration strategy in each episode (

ε = ε - δ

). We fixed

γ

and changed

δ

and then fixed the

δ

value of the optimal performance to change

γ

. It can be seen from the results in the table that the model achieves the best effect when

γ = 0.9

and

{δ = 1 \times 10}^{- 4}

.

Implementation environment: The project was developed using the deep learning framework PyTorch v1.7.1 and the programming language Python v3.8. The training and testing of the model were based on the Ubuntu 18.04.6 operating system with 128 G RAM. We also used an NVIDIA GeForce RTX 3090 card (24 G, CUDA v11.2) for training acceleration, and each episode ran approximately 3 s, with a training time of approximately 6 h for a model configuration.

5. Discussion and Conclusions

We propose a single-agent reinforcement learning method based on small sample data for the localization of nasopharyngeal carcinoma lesion areas in 3D-CT scanning. Due to the approximate location of each patient’s cancerous area, we found that the baseline model’s agent explored some useless space when searching, which affected the accuracy and speed of positioning to some extent. Therefore, prior knowledge is added to our method, which reasonably reduces the searching space required by the agent and speeds up the convergence of the model. Specifically, the average relative position of the target lesion in the CT scan volumes is calculated based on the training set, and it is used as the starting position of the agent for searching in training and inference. At the same time, we use multi-scale processing technology to facilitate the localization of small targets.

In the experimental section, we compare our model with the previous single-agent method and the baseline model. The experimental results show that the proposed model has good robustness and accuracy. On the premise of using a smaller data set, the accuracy of the DQN model is improved by more than 1 mm on average compared with the current study. This indicates that the model can effectively locate the nasopharyngeal carcinoma lesion area in 3D-CT scanning. At the same time, we also conducted ablation experiments, and the experimental results show that the fusion of prior knowledge can indeed improve the accuracy of the model, which indicates the effectiveness of our work.

Future work: On the basis of existing DQN models, we may improve the DRL model performance with Double DQN or Dueling DQN and systematically evaluate the advantages and disadvantages of different models. In addition, the dataset in this experiment is limited in size, and more data should be obtained later in order to improve the accuracy of prior knowledge results and further improve the robustness of the model. Finally, the proposed method can be combined with the medical image segmentation method to form a two-stage nasopharyngeal carcinoma focal segmentation model so as to achieve more accurate extraction of focal areas.

Author Contributions

Conceptualization, H.W. and H.L.; Formal analysis, G.H. and Y.K.; Funding acquisition, G.H.; Methodology, G.H., H.W. and H.L.; Project administration, G.H. and H.L.; Resources, H.W. and H.L.; Supervision, H.W. and H.L.; Validation, G.H. and Y.K.; Visualization, G.H. and Y.K.; Writing—original draft, G.H. and Y.K.; Writing—review and editing, G.H., Y.K., H.W. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (61901533), the Shenzhen Fundamental Research Program of China (JCYJ20190807154601663), and the High-level Talents Research Project of NCWU (202101002).

Institutional Review Board Statement

This retrospective study was conducted in accordance with the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. Ethics approval was obtained from the Institutional Review Board of SYSUCC (approval number: B2019-222-01). Furthermore, according to Article 32 of the “Measures for the Ethical Review of Life Science and Medical Research Involving Humans”(national legislation of China [34]), research using anonymized information data can be exempted from the ethical review.

Informed Consent Statement

Owing to the retrospective nature and anonymized information used in this study, the requirement for informed consent was waived by our Institutional Review Board (approval number: B2019-222-01) and the national legislation of China [34].

Data Availability Statement

The CT scans used to support the findings of this study are available from the corresponding author upon request. The data are not publicly available due to patient privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Buhrmester, V.; Münch, D.; Arens, M. Analysis of explainers of black box deep neural networks for computer vision: A survey. Mach. Learn. Knowl. Extr. 2021, 3, 966–989. [Google Scholar] [CrossRef]
Stember, J.; Shalu, H. Deep reinforcement learning to detect brain lesions on MRI: A proof-of-concept application of reinforcement learning to medical images. arXiv 2020, arXiv:2008.02708. [Google Scholar]
Zeng, H.M.; Zheng, R.S.; Guo, Y.M.; Zhang, S.; Zou, X.; Wang, N.; Zhang, L.; Tang, J.; Chen, J.; Wei, K.; et al. Cancer survival in China, 2003–2005: A population-based study. Int. J. Cancer 2015, 136, 1921–1930. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, H.; Cao, D.; Li, S.; Chen, B.; Zhang, Y.; Zhu, Y.; Luo, C.; Lin, W.; Huang, W.; Ruan, G.; et al. Synergistic Association of Hepatitis B Surface Antigen and Plasma Epstein-Barr Virus DNA Load on Distant Metastasis in Patients with Nasopharyngeal Carcinoma. JAMA Netw. Open 2023, 6, e2253832. [Google Scholar] [CrossRef] [PubMed]
Criminisi, A.; Shotton, J.; Robertson, D.; Konukoglu, E. Regression forests for efficient anatomy detection and localization in ct studies. In Proceedings of the International MICCAI Workshop on Medical Computer Vision, Beijing, China, 20 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 106–117. [Google Scholar]
Gauriau, R.; Cuingnet, R.; Lesage, D.; Bloch, I. Multi-organ localization combining global-to-local regression and confidence maps. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Boston, MA, USA, 14–18 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 337–344. [Google Scholar]
Ma, K.; Wang, J.; Singh, V.; Tamersoy, B.; Chang, Y.-J.; Wimmer, A.; Chen, T. Multimodal image registration with deep context reinforcement learning. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Quebec City, QC, Canada, 11–13 September 2017; Springer: Cham, Switzerland, 2017; pp. 240–248. [Google Scholar]
Jie, Z.; Liang, X.; Feng, J.; Jin, X.; Lu, W.; Yan, S. Tree-Structured Reinforcement Learning for Sequential Object Localization. arXiv 2016, arXiv:1703.02710. [Google Scholar]
Maicas, G.; Carneiro, G.; Bradley, A.P.; Nascimento, J.C.; Reid, I. Deep reinforcement learning for active breast lesion detection from dce-mri. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Quebec City, QC, Canada, 11–13 September 2017. [Google Scholar]
Dou, H.; Yang, X.; Qian, J.; Xue, W.; Ni, D. Agent with Warm Start and Active Termination for Plane Localization in 3D Ultrasound. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019. [Google Scholar]
Ghesu, F.C.; Georgescu, B.; Mansi, T.; Neumann, D.; Hornegger, J.; Comaniciu, D. An artificial agent for anatomical landmark detection in medical images. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Athens, Greece, 17–21 October 2016; Springer: Cham, Switzerland, 2016; pp. 229–237. [Google Scholar]
Zhang, M.; Xu, J.; Abaci Turk, E.; Grant, P.E.; Golland, P.; Adalsteinsson, E. Enhanced detection of fetal pose in 3D MRI by deep reinforcement learning with physical structure priors on anatomy. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; Springer: Cham, Switzerland, 2020; pp. 396–405. [Google Scholar]
Pesce, E.; Ypsilantis, P.P.; Withey, S.; Bakewell, R.; Goh, V.; Montana, G. Learning to detect chest radiographs containing lung nodules using visual attention networks. arXiv 2017, arXiv:1712.00996. [Google Scholar]
Zhang, P.; Wang, F.; Zheng, Y. Deep Reinforcement Learning for Vessel Centerline Tracing in Multi-modality 3D Volumes. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
Stember, J.N.; Shalu, H. Reinforcement learning using Deep Q Networks and Q learning accurately localizes brain tumors on MRI with very small training sets. BMC Med. Imaging 2022, 22, 224. [Google Scholar] [CrossRef] [PubMed]
Navarro, F.; Sekuboyina, A.; Waldmannstetter, D.; Peeken, J.C.; Combs, S.E.; Menze, B.H. Deep reinforcement learning for organ localization in CT. In Proceedings of the Third Conference on Medical Imaging with Deep Learning, Montréal, QC, Canada, 6–9 July 2020; pp. 544–554. [Google Scholar]
Jain, A.; Powers, A.; Johnson, H.J. Robust automatic multiple landmark detection. In Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 1178–1182. [Google Scholar]
Liu, Z.; Yao, C.; Yu, H.; Wu, T. Deep reinforcement learning with its application for lung cancer detection in medical Internet of Things. Future Gener. Comput. Syst. 2019, 97, 1–9. [Google Scholar] [CrossRef]
Ali, I.; Hart, G.R.; Gunabushanam, G.; Liang, Y.; Muhammad, W.; Nartowt, B.; Kane, M.; Ma, X.; Deng, J. Lung nodule detection via deep reinforcement learning. Front. Oncol. 2018, 8, 108. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Alansary, A.; Le Folgoc, L.; Vaillant, G.; Oktay, O.; Li, Y.; Bai, W.; Passerat-Palmbach, J.; Guerrero, R.; Kamnitsas, K.; Hou, B.; et al. Automatic View Planning with Multi-scale Deep Reinforcement Learning Agents. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018. [Google Scholar]
Al, W.A.; Yun, D., II. Partial Policy-Based Reinforcement Learning for Anatomical Landmark Localization in 3D Medical Images. IEEE Trans. Med. Imaging 2020, 39, 1245–1255. [Google Scholar]
Ghesu, F.C.; Georgescu, B.; Grbic, S.; Maier, A.K.; Hornegger, J.; Comaniciu, D. Robust Multi-scale Anatomical Landmark Detection in Incomplete 3D-CT Data. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Quebec City, QC, Canada, 11–13 September 2017. [Google Scholar]
Ghesu, F.-C.; Georgescu, B.; Zheng, Y.; Grbic, S.; Maier, A.; Hornegger, J.; Comaniciu, D. Multi-scale deep reinforcement learning for real-time 3D-landmark detection in CT scans. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 176–189. [Google Scholar] [CrossRef] [PubMed]
Leroy, G.; Rueckert, D.; Alansary, A. Communicative Reinforcement Learning Agents for Landmark Detection in Brain Images. In Proceedings of the Machine Learning in Clinical Neuroimaging and Radiogenomics in Neuro-Oncology: Third International Workshop, MLCN 2020, and Second International Workshop, RNO-AI, Lima, Peru, 4–8 October 2020. [Google Scholar]
Bellman, R. Dynamic Programming; Courier Corporation: Chelmsford, MA, USA, 2013. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef] [PubMed]
Ghesu, F.C.; Georgescu, B.; Grbic, S.; Maier, A.; Hornegger, J.; Comaniciu, D. Towards Intelligent Robust Detection of Anatomical Structures in Incomplete Volumetric Data. Med. Image Anal. 2018, 48, 203–213. [Google Scholar] [CrossRef]
Alansary, A.; Oktay, O.; Li, Y.; Folgoc, L.L.; Hou, B.; Vaillant, G.; Kamnitsas, K.; Vlontzos, A.; Glocker, B.; Kainz, B.; et al. Evaluating reinforcement learning agents for anatomical landmark detection. Med. Image Anal. 2019, 53, 156–164. [Google Scholar] [CrossRef] [Green Version]
Vlontzos, A.; Alansary, A.; Kamnitsas, K.; Rueckert, D.; Kainz, B. Multiple landmark detection using multi-agent reinforcement learning. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Cham, Switzerland, 2019; pp. 262–270. [Google Scholar]
Man, Y.; Huang, Y.; Feng, J.; Li, X.; Wu, F. Deep Q learning driven CT pancreas segmentation with geometry-aware U-Net. IEEE Trans. Med. Imaging 2019, 38, 1971–1980. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Yushkevich, P.A.; Piven, J.; Hazlett, H.C.; Smith, R.G.; Ho, S.; Gee, J.C.; Gerig, G. User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. NeuroImage 2006, 31, 1116–1128. [Google Scholar] [CrossRef] [PubMed] [Green Version]
National Health Commission of the PRC; Ministry of Education; Ministry of Science and Technology; Bureau of Traditional Chinese Medicine. Measures for Ethical Review of Life Sciences and Medical Research Involving Human Subjects. Available online: https://www.gov.cn/zhengce/zhengceku/2023-02/28/content_5743658.htm (accessed on 1 April 2023).

Figure 1. Schematic diagram of the network structure of this experiment.

Figure 2. Schematic diagram of data processing.

Figure 3. Visualization diagram of the experimental process. The figure shows the agent search process. Starting from the starting position after adding prior knowledge, the agent searches at the maximum scale in (a–c), oscillations occur in (d–f), and the agent gradually reduces the search step size to the minimum.

Figure 4. Comparison of model training effectiveness before and after optimization.

Figure 5. Comparison of model training losses before and after optimization.

Table 1. Comparison of image attributes before and after preprocessing for a specific CT scan.

	Dimensions	Voxel Spacing
Before preprocessing	512 × 512 × 109	1.05273 × 1.05273 × 3
After preprocessing	539 × 539 × 327	1 × 1 × 1

Table 2. Experimental results and comparison.

Method	Error Distance (mm)
Alansary et al. [29]	3.66 ± 2.11
Vlontzos et al. [30]	4.47 ± 2.64
Baseline model	5.26 ± 1.15
Ours	3.52 ± 1.18

Table 3. Ablation experiment.

Method	Error Distance (mm)
Baseline model	5.26 ± 1.15
Model A	4.59 ± 1.56
Model B	4.54 ± 1.31
Ours	3.52 ± 1.18

Table 4. Model performance under different hyperparameters.

$γ$	$δ$	Error Distance (mm)	Average Loss
0.9	2 × 10⁻⁴	4.36 ± 1.22	0.269
0.9	1 × 10⁻⁴	3.52 ± 1.18	0.134
0.9	5 × 10⁻⁵	5.06 ± 1.94	0.332
0.99	1 × 10⁻⁴	26.45 ± 2.44	16.08
0.8	1 × 10⁻⁴	4.19 ± 0.92	0.147

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, G.; Kong, Y.; Wu, H.; Li, H. Deep Reinforcement Learning Method for 3D-CT Nasopharyngeal Cancer Localization with Prior Knowledge. Appl. Sci. 2023, 13, 7999. https://doi.org/10.3390/app13147999

AMA Style

Han G, Kong Y, Wu H, Li H. Deep Reinforcement Learning Method for 3D-CT Nasopharyngeal Cancer Localization with Prior Knowledge. Applied Sciences. 2023; 13(14):7999. https://doi.org/10.3390/app13147999

Chicago/Turabian Style

Han, Guanghui, Yuhao Kong, Huixin Wu, and Haojiang Li. 2023. "Deep Reinforcement Learning Method for 3D-CT Nasopharyngeal Cancer Localization with Prior Knowledge" Applied Sciences 13, no. 14: 7999. https://doi.org/10.3390/app13147999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Method for 3D-CT Nasopharyngeal Cancer Localization with Prior Knowledge

Abstract

1. Introduction

2. Background

3. Method

4. Experiments

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI