1. Introduction
Retinopathy of prematurity (ROP) is a disease that can potentially cause blindness in preterm infants. ROP is caused by the pathological neovascularization in the retinal fundus of premature infants [
1]. ROP continues to be a major, preventable cause of blindness and visual impairment in children both in developing and developed countries [
2]. ROP occurs in babies born prematurely after 32 weeks and with low birth weight (less than 1.5 kg) [
3,
4]. Globally, 19 million children are estimated to suffer from visual impairment [
5]. Over 1.84 million of these children were likely to have developed ROP at any stage, of which approximately 11% would have become totally blind or severely visually impaired and 7% would have developed mild/moderate visual impairment because of ROP [
6]. The incidence of ROP in developed and developing countries is estimated to be 9% and 12%, respectively. ROP, like any other disease, can progress from mild to severe stages [
7]. Abnormal growth of the retinal blood vessels is observed in ROP-affected infants. Blindness can also occur because of retinal detachment, unless treated in the initial stages [
2]. Laser treatment, anti-VEGF therapy, surgical treatment, or treatment with drugs have proven to be effective in treating ROP [
8,
9,
10]. ROP is categorized from mild to severe (Stage 1 to Stage 5), depending on the severity [
4,
11,
12]. In brief, Stage 1 is the initial stage, where abnormal growth of the blood vessels occurs due to the occurrence of a thin flat whitish line known as the demarcation line, which separates the retinal regions in the eye. This demarcation line prevents the supply of blood to the outer edges of the retina. In Stage 2, this thin demarcation line transforms into a ridgeline, which means that the thin whitish line becomes broader and is raised and changes in color from white to pinkish. In Stage 3, the ridge demarcation line increases in dimension, and new abnormal blood vessels grow internally (
Figure 1). In Stage 4, partial retinal detachment occurs, which may result in complete retinal detachment. Finally, in Stage 5, the person may become blind or suffer from permanent loss of vision [
4,
12].
Studies have shown that the condition of an infant with Stage 2 ROP may improve without treatment. However, if the disease has progressed to Stage 3, diagnosis and treatment are crucial to prevent the disease from progressing to later stages. Various strategies for treating ROP are available [
13,
14]. Regular screening of preterm infants is crucial because distinctive features of ROP could be associated with sequential syndromes such as astigmatism, myopia, glaucoma, cataracts, anisometropia, amblyopia, strabismus, and retinal detachment. ROP can be detected by either pediatric ophthalmologists or retinopathy specialists. However, while the number of cases of ROP is on the rise, the number of ophthalmologists capable of ROP screening is on the decline [
15,
16]. In rural areas, in particular, the detection of ROP is not easy owing to a lack of ROP specialists. Approximately 36% of neonatologists in the USA were unable to transfer children with ROP to a neonatal intensive care unit for screening owing to a lack of specialists at the care unit [
17]. Alternate strategies such as telemedicine computer-aided diagnosis (CAD) of diseases must be adopted to diagnose ROP in patients. Telemedicine has been found to be effective in the diagnosis of ROP [
18], and the CAD of ocular diseases has made considerable progress; data reveal its high potential for future breakthroughs [
19,
20].
The use of artificial intelligence (AI) in the field of medicine has increased in recent years owing to advancements in AI technologies. Deep learning models have made incredible progress in the field of medical diagnosis and have been employed practically in various computer vision tasks, including image classification, object detection, image segmentation, and disease diagnosis. Owing to the advancement in deep network architectures and access to big data, the use of AI has been proposed to reduce the burden on medical experts. Traditional machine learning algorithms such as logistic regression, support vector machines, and fuzzy decision trees have been used in the field of image recognition and classification. However, other techniques such as feature extraction and dimensionality reduction are required to accomplish the task, which is time-consuming. Moreover, the conversion of the image matrix to a one-dimensional vector leads to the loss of some critical information from the image, which could lower the performance of the models. In the case of a convolutional neural network (CNN), classification is accomplished by extracting features from raw input images by the convolutional layers followed by dimensionality reduction by the pooling layer.
Transfer learning is a useful concept in CNNs, which use previously acquired knowledge and skills and apply them to a different but related problem. Pretrained models such as VGG16 and InceptionV3 have been trained using rich data sources such as ImageNet, which contains 1.2 million natural images with more than 1000 categories [
21]. These models are built from scratch using substantial computational resources. These models have learned features such as edges, shapes, lighting, rotation, and spatial information. This knowledge is useful for extracting features from images in a different domain. Thus, the availability of vast training datasets is essential for a model to achieve high performance; training the data with a small dataset may lead to underperformance or overfitting, which can be overcome by transfer learning. Thus, transfer learning is particularly useful in classification tasks; it improves the generalization ability of a model when the training dataset is small (not even in the thousands) [
22]. This strategy is useful for classifying images and predicting disease where the dataset is small, such as the dataset used in the present study.
CNNs have been used in image classification, and since 2012, they have exhibited high performance in the diagnosis of diseases [
23]. CNNs have been successfully used in the diagnosis of lung cancer [
9], glioma [
24], pneumonia [
25], skin cancer [
26], brain tumor [
27], and other medical conditions [
28]. Recently, deep learning was also used for the accurate diagnosis of the COVID-19 symptoms by using CT images [
29]. Deep learning has also been used for the diagnosis of eye diseases such as diabetic retinopathy [
30,
31] and glaucoma [
32], which are eye diseases associated with ROP. Studies have developed a deep learning algorithm for the automated diagnosis of plus disease by using fundus images [
33]. Transfer learning has been used to pretrain models for classifying ROP images [
34]. Studies have also employed a CAD system for plus disease and the measurement of tortuosity from retinal fundus images [
35]. Owing to the excellent results achieved with CNNs in the medical image processing field, researchers proposed a novel CNN architecture for diagnosing plus disease in ROP by using a pretrained GoogLeNet to visualize feature maps of pathologies learned directly from the data [
36]. The field has advanced with the use of two CNN methods to diagnose plus disease in ROP [
37]. Recently, ROP was screened using deep neural networks (DNNs) [
38,
39,
40]. In these studies, retinal fundus images were used to train and test fundus images for detecting ROP.
A robust and reliable automated ROP detection system is currently required to diagnose ROP in the initial stages of development. To this end, the present study aimed to achieve high accuracy in the diagnosis of ROP by using RetCam fundus images captured from preterm infants. The system was trained with a dataset, and it tested eye-based diseases to predict the classification performance. We also applied transfer learning to the deep CNNs. The first step was identifying whether the eye condition was normal (NOROP) or abnormal (ROP). Furthermore, based on the severity of the abnormal condition, we classified it as either mild-ROP or severe-ROP. Blindness due to ROP in infants can be prevented through early diagnosis. Therefore, early identification of the disease is crucial for administering proper treatment to prematurely born infants to prevent blindness. Detection in the initial stages of ROP development is essential for understanding the progression of the disease. This study presents an automated diagnosis of ROP by using various classification models. Our findings have the potential to assist ophthalmologists in diagnosing the disease at an early stage. The purpose of the present study was to provide a CAD system in a clinical setting for diagnosing ROP. We applied transfer learning to the deep CNN models and achieved high accuracy in the prediction of eye-based cases. Moreover, the different stages of ROP were accurately classified based on the severity of the disease.
In this study, we applied transfer learning to deep CNN models and compared their capabilities in the detection of ROP by using retinal fundus images. We aimed to determine the absence or presence of ROP (NOROP or ROP) in a preterm infant as well as the severity of the disease (mild-ROP or severe-ROP). We used five pretrained models with different architectures, namely VGG19, VGG16, InceptionV3, DenseNet, and MobileNet. The major contributions of the study are as follows:
We investigated a large variety of backbone models of different architectures; these models differed in the number of convolutional layers they had. The pretrained models and their number of convolutional layers are listed in
Table 1.
We comprehensively explored different backbone architectures in terms of performance. We demonstrated significant variation in performance across backbone models.
Owing to the variation in performance across the different backbones in this domain, our work becomes significant as it indicates the necessity to improve on backbone models selection and provides clear benchmarks to assist it.
We achieved the optimal results with the VGG19 model in terms of classifying ROP and NOROP and identifying the severity of ROP with high sensitivity and specificity.
We performed 5-fold cross-validation on the datasets to evaluate the performance of the VGG19 model.
The rest of the paper is organized as follows: In
Section 2, we provide a brief description of the dataset, an overview of the training of classification models, and the evaluation method. In
Section 3, we present our approach and its results on the performance of the classification models in the diagnosis of ROP, along with a discussion. In
Section 4, we summarize our findings, draw some conclusions, and state directions for future work.
2. Materials and Methods
In this study, we aimed to predict the occurrence of ROP in a preterm infant’s eyes. We examined the retinal fundus images from patients’ eyes, which indicated the absence or presence of ROP.
2.1. Dataset
All the fundus images were captured by expert technicians using the RetCam imaging system (Clarity Medical System, Pleasanton, CA, USA). The datasets were procured from the neonatal intensive care units of (1) Chang Gung Memorial Hospital, Linkou, Taiwan, and (2) Osaka Women’s and Children’s Hospital, Japan. They are specialized hospitals and have been providing ROP screening services for several years. A total of 5–22 images were collected during each ROP screening session, and the dataset from each patient was split into two eye cases such as NOROP or ROP. The patients’ demographic datasets were captured before July 2019. The patients had to satisfy at least one of the following criteria in order to be selected in this study: the babies had to be born within 37 weeks of gestation and/or had to weigh ≤ 1500 g at birth.
2.2. Image Labeling
Three senior ophthalmologists who had over 10 years of experience working with patients with ROP were involved in the study. These experts labeled the fundus images as NOROP (normal/without disease) or ROP (with the disease) according to the guidelines set by the International Classification of Retinopathy of Prematurity. Furthermore, the different stages of ROP were classified as Stage 1, Stage 2, and Stage 3. The three ophthalmologists first labeled the images independently; the images were then compared to identify any inconsistency in the labeling process (i.e., to identify whether a particular image was assigned different labels by the experts). Subsequently, the labels were sorted collectively after a discussion among the experts and a label was assigned to such images. The ophthalmologists defined the severity of the disease as mild-ROP, if the eye cases belonged to Stage 1 and Stage 2 ROP, or severe-ROP, if the eye cases belonged to Stage 3 ROP. A description of the different ROP stages can be found in the literature [
11,
41].
First, the present study aimed to identify from fundus images whether an infant had ROP. Then, the images that indicated the presence of ROP were further classified as mild-ROP or severe-ROP. All the different test cases from the patients were manually labeled by these experts and compared with the DNN model predictions.
2.3. Dataset Description and Preprocessing
The resolutions of the multiple fundus images of infants’ eyes were 1600 × 1200 for the Taiwanese dataset and 640 × 480 for the Japanese dataset. A total of 6500 images of left and right eyes of 210 infants were collected. We used data of 106 patients for training the ROP/NOROP model. The unclear images, blurred images, dark images, etc., were omitted from the analysis. We considered the fundus images showing the different stages of ROP in the same infant for analysis, ensuring no overlaps between the patients from the training dataset and test dataset.
2.3.1. Image Normalization
All the data were first subjected to preprocessing to run the classification models. The preprocessing step included resizing the images to 224 pixels × 224 pixels × 3 pixels for the MobileNet, DenseNet, VGG16, and VGG19 models and 299 pixels × 299 pixels × 3 pixels for the InceptionV3 model. These images were then loaded using “OpenCV,” resized, and converted to a NumPy array. Normalization was further carried out on the input images, where they were rescaled to have pixel values between 0 and 1 by dividing all the pixel values with the highest pixel value of 255.
Using preprocessing tools with the Keras API of the ImageDataGenerator, we performed data augmentation and loaded the model with weights on convolutional layers.
2.3.2. Data Augmentation
Training the model with a small amount of data can lead to overfitting during training. To overcome this issue, we employed data augmentation to create new retinal fundus images from the existing training dataset. Data augmentation was used to generate more datasets. In this study, we used various augmentation techniques that included rotation_range [−3, 3], width_shift_range [−0.1, 0.1], height_shift_range [−0.1, 0.1], zoom_range [0.85, 1.15], and horizontal_flip. The training dataset was augmented seven times, resulting in a total of 18,808 images for training. From our initial tests, as expected, we observed that data augmentation was useful in increasing the prediction accuracies of the ROP and NOROP datasets in the case of the VGG19 and VGG16 models. Hence, we applied the augmentation techniques to all the classification models used in the present study.
Blurry or bright images or images that were not clear were filtered out from the image datasets of all the patients. For the NOROP training dataset, we selected 108 eye cases from 54 patients and obtained a total of 1222 images. For training the ROP dataset, which included Stage 1, Stage 2, and Stage 3 ROP cases, we selected a similar number of patients and images randomly to balance with the NOROP dataset. Overall, for training the ROP dataset, we selected 159 cases from 52 patients and obtained 1129 images. For testing the accuracy of the classification models, data from 25 patients were used. Details on the number of patients and eye cases used in the training set and test set are presented in
Table 2. The training set and test set for identifying the severity of the disease as mild-ROP or severe-ROP are presented in
Table 3.
2.4. Classification Model Training
We applied transfer learning to the models for ROP identification and ROP severity classification. Transfer learning was performed by freezing the initial layers of the pretrained model and replacing the three fully connected (FC) layers with the final layer as a classification layer. The weights from the convolution layers were copied instead of weights of the entire network with FC layers. An illustration of the model is shown in
Figure 2. In the present study, we confirmed the relevant parameters of the FC layers through testing with different layer sizes of 100–600 to obtain optimal results (i.e., high accuracy on the validation set, low error rate, and no overfitting). The results of a comparison of the models are shown as a confusion matrix.
In the present study, we covered a large variety of backbone models by selecting them from different architecture types such as the VGG family (VGG11, VGG13, VGG16, and VGG19), MobileNet group (MobileNet, MobileNetv2, ShuffleNet, and FD-MobileNet), Inception (Inception, InceptionV1, InceptionV2, and InceptionV3), and DenseNet group (DenseNet, HarDNet, and S-Net). In this study, we selected two models from the VGG group and one each from the remaining groups. In total, we selected five different classification models, with each having a different number of layers (ranging from 13 to 103). The models VGG19 and VGG16 [
42], which belong to the same family, have 16 and 13 convolutional layers, respectively. The InceptionV3 [
43], DenseNet [
44], and MobileNet [
45] models have 48, 28, and 103 convolutional layers, respectively. All these five DNN models were selected to achieve our primary aim of identifying ROP. From the results, the performance of the models was then evaluated, and two models that exhibited the best performance were chosen for identifying the severity of the disease. Our method included the loading of weights of the pretrained model provided by Keras. We added our classifiers by replacing the FC layers of the model with four dense layers and fine-tuned them. In the VGG16 and VGG19 models, the first and second FC layers had a size of 200 each, and the dropout layers had a 50% drop rate. The third FC layer had a size of 64, and the third dropout layer had a 50% drop rate. The final layer was a softmax layer, which was stacked at the end for classifying the fundus images, followed by the FC layer output to determine whether the image should be classified as ROP or NOROP. In the Inception V3, MobileNet, and DenseNet models, the first and second FC layers had a size of 100 and 64, respectively. The first, second, and third dropout layers had a drop rate of 50%, and the final layer was a softmax layer. In the identification of the severity of ROP (mild-ROP or severe-ROP), the optimal results were obtained with two FC layers, with the third layer as the classification layer. Here, the first and second FC layers had a size of 512 and 200, respectively, and the dropout layers had a 50% dropout rate. The Adam optimizer was used at a learning rate of 2 × 10
−5, categorical cross-entropy was used as the loss function, and the batch size was set to 10.
2.5. Model Evaluation
The findings of the classification models are represented as a confusion matrix. In binary classification, a confusion matrix represents information of the classes with a number of instances/cases in true positives (TPs—instances correctly predicted to the class of interest), true negatives (TNs—instances correctly predicted that belong to the other class of interest), false positives (FPs—instances assigned to the class of interest but do not belong to it), and false negatives (FNs—instances assigned to the class of interest but belong to the complementary class). A conventional illustration of the confusion matrix is given in
Figure 3.
We evaluated and compared the performance of the five models by calculating the sensitivity, specificity, precision, accuracy, true positive rate, and false positive rate using the equations given below. In brief, sensitivity refers to the percentage of TP that are correctly predicted by the classification model that performs the testing of the test cases, whereas specificity refers to the percentage of TN that are correctly identified by the model [
46]. Precision is a measure of the percentage of instances where a classifier is labeled as positive to the total predictive positive cases [
47].
Figure 4 shows the schematic of the entire workflow of the classification process. The entire dataset was first divided into training and test datasets. These datasets then underwent preprocessing and normalization. The preprocessed training data were then subjected to augmentation. After model testing and hyperparameter tuning to obtain the optimal results on the validation dataset, the model was deployed to the test dataset for binary classification. The model performance was then evaluated in terms of prediction accuracy, sensitivity, and specificity. Additionally, we calculated the area under the curve (AUC) for evaluating the performance of the models and visualized the problems presented by different models in classifying the stages of ROP.
2.6. 5-Fold Cross-Validation
Cross-validation was used to improve the accuracy and reliability of the model using the training and test samples multiple times. We evaluated the performance of the VGG19 model through 5-fold cross-validation on ROP/NOROP data. All the patients’ datasets were combined and divided into 5 folds: 80% (4 folds) as a training dataset and 20% (1 fold) as a test dataset. We validated the performance of the model on each of the 5 folds. To train the ROP and NOROP data, 75 and 78 patients were used, respectively; the number of eye cases was 150 and 218, respectively Each fold contained the data of at least 15 patients. Similarly, we performed 5-fold cross-validation on mild-ROP and severe-ROP patient data. 70 and 65 patients were used to train the mild-ROP and severe-ROP data, respectively; the number of eye cases was 209 and 130, respectively.