Next Article in Journal
Adaptive Quantization Parameter Estimation for HEVC Based Surveillance Scalable Video Coding
Next Article in Special Issue
A Synchronized Multi-Unit Wireless Platform for Long-Term Activity Monitoring
Previous Article in Journal
Delta Multi-Stage Interconnection Networks for Scalable Wireless On-Chip Communication
Previous Article in Special Issue
Realizing an Integrated Multistage Support Vector Machine Model for Augmented Recognition of Unipolar Depression
 
 
Article
Peer-Review Record

Blended Multi-Modal Deep ConvNet Features for Diabetic Retinopathy Severity Prediction

Electronics 2020, 9(6), 914; https://doi.org/10.3390/electronics9060914
by Jyostna Devi Bodapati 1, Veeranjaneyulu Naralasetti 2, Shaik Nagur Shareef 1, Saqib Hakak 3, Muhammad Bilal 4, Praveen Kumar Reddy Maddikunta 5 and Ohyun Jo 6,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2020, 9(6), 914; https://doi.org/10.3390/electronics9060914
Submission received: 3 May 2020 / Revised: 25 May 2020 / Accepted: 28 May 2020 / Published: 30 May 2020
(This article belongs to the Special Issue Computational Intelligence in Healthcare)

Round 1

Reviewer 1 Report

This is an interesting paper that aims to propose different fusion methods to blend the multi-modal features for Diabetic Retinopathy (DR) severity prediction, which is valuable in practice. There are several main contributions in this paper: First of all, this paper blends the prominent features extracted from multiple pre-trained ConvNets to obtain stronger representation of fundus images. Besides, when designing the deep neural network architecture for prediction, dropout is introduced at the input layer to avoid model from over-fitting. Last but not the least, comparing with the existing methods, the prediction accuracy of the proposed method is improved. There are several suggestions that can be referred to highlight this paper.

 

  1. It is noted that your manuscript needs careful editing with expertise in technical English editing, and paying particular attention to English grammar, spelling, and sentence structure so that the goals and results of the study are clear to the reader.

 

  1. In the Section 1 “INTRODUCTION”, two tasks of DR diagnosis (checking the presence or absence of DR and identifying the severity level of the DR disease) are stated. Therefore, there should be more description about the reason why the severity levels in task 2 include “No DR”.

 

  1. As for the Section 3 “PROPOSED METHODOLOGY”, the pre-trained convolutional neural network architectures are used to extract the features from color fundus images. Thus, the description about how the pre-trained architectures are obtained should be given so that the methods of this paper can be reproduced by readers.

 

  1. In the Section 3.3 “USING MULTI-MODAL DEEP FEATURES TO TRAIN THE MODEL”, only the features extracted from VGG16 and Xception Nets are selected to blend. However, in the section 3.2 “USING UNI-MODAL DEEP FEATURES TO TRAIN THE MODEL”, VGG16, Xception, NASNET and ResNetV2 are used to extract features. Therefore, there should be more explanations for the ConvNet selection. In addition, it is also important to explain why 1-D max pooling and average cross pooling are selected to blend features in section 3.3, so as to enhance the rationality of the proposed methods.

 

  1. In section 4 “EXPERIMENTAL RESULTS”, the prediction results of checking the severity level can be presented more intuitively, such as drawing the confusion matrix to facilitate reading. Furthermore, it is also recommended to discuss the results from the perspective of practical significance, thereby highlighting the contributions of this paper.

 

  1. In the reference section, the authors may consider adding some recent literature. For example, “Real-Time Fault Diagnosis for Gas Turbine Blade Based on Output-Hidden Feedback Elman Neural Network” (2018); “An ensemble framework based on convolutional bi-directional LSTM with multiple time windows for remaining useful life estimation” (2020).

 

Overall, the paper is interesting, well-motivated and developed with complete experimental study. I believe if the authors make a serious effort to account for all the comments provided, the paper will be ready for publication. I deem this paper a Minor Revision.

Author Response

We would like to thank the Editor and reviewers for their helpful comments and suggestions. We have updated the manuscript to address all the comments supplied by the reviewers. We feel that by incorporating these suggestions, the quality of the paper has improved substantially. This material below addresses each issue raised by the editors and the reviewers.

Comment: This is an interesting paper that aims to propose different fusion methods to blend the multi-modal features for Diabetic Retinopathy (DR) severity prediction, which is valuable in practice.  There are several main contributions in this paper: First of all, this paper blends the prominent features extracted from multiple pre-trained ConvNets to obtain stronger representation of fundus images. Besides, when designing the deep neural network architecture for prediction, dropout is introduced at the input layer to avoid model from over-fitting. Last but not the least, comparing with the existing methods, the prediction accuracy of the proposed method is improved. There are several suggestions that can be referred to highlight this paper.

Response:

We thank and appreciate the positive feedback from the reviewer.

 

Comment 1: It is noted that your manuscript needs careful editing with expertise in technical English editing, and paying particular attention to English grammar, spelling, and sentence structure so that the goals and results of the study are clear to the reader.

Response: As suggested by the reviewer, we have edited the entire manuscript and enough care has been taken in improving technical English.

 

Comment 2: In the Section 1 “INTRODUCTION”, two tasks of DR diagnosis (checking the presence or absence of DR and identifying the severity level of the DR disease) are stated. Therefore, there should be more description about the reason why the severity levels in task 2 include “No DR”.

Response: As suggested by the reviewer, we addressed this issue in section 4.4 by including the justification to “why the severity levels in task 2 include “No DR”.

 

Comment 3: As for the Section 3 “PROPOSED METHODOLOGY”, the pre-trained convolutional neural network architectures are used to extract the features from color fundus images. Thus, the description about how the pre-trained architectures are obtained should be given so that the methods of this paper can be reproduced by readers.

Response: As suggested by the reviewer, we address this issue in section 3.1  by including the description about how the pre-trained architectures are used to extract deep features from images

 

Comment 4: In the Section 3.3 “USING MULTI-MODAL DEEP FEATURES TO TRAIN THE MODEL”, only the features extracted from VGG16 and Xception Nets are selected to blend. However, in the section 3.2 “USING UNI-MODAL DEEP FEATURES TO TRAIN THE MODEL”, VGG16, Xception, NASNET and ResNetV2 are used to extract features. Therefore, there should be more explanations for the ConvNet selection. In addition, it is also important to explain why 1-D max pooling and average cross pooling are selected to blend features in section 3.3, so as to enhance the rationality of the proposed methods.

Response: To address this issue, we explained the rationale behind choosing the VGG16 and Xception. Figure5 is included to support our claims

 

Comment 5: In section 4 “EXPERIMENTAL RESULTS”, the prediction results of checking the severity level can be presented more intuitively, such as drawing the confusion matrix to facilitate reading. Furthermore, it is also recommended to discuss the results from the perspective of practical significance, thereby highlighting the contributions of this paper.

Response: As suggested by the reviewer, confusion matrix is included as Figure 8 to facilitate reading

 

Comment 6: In the reference section, the authors may consider adding some recent literature. For example, “Real-Time Fault Diagnosis for Gas Turbine Blade Based on Output-Hidden Feedback Elman Neural Network” (2018); “An ensemble framework based on convolutional bi-directional LSTM with multiple time windows for remaining useful life estimation” (2020)

Response: As suggested by the reviewer both the papers are cited in the paper

 

Comment : Overall, the paper is interesting, well-motivated and developed with complete experimental study. I believe if the authors make a serious effort to account for all the comments provided, the paper will be ready for publication. I deem this paper a Minor Revision.

Response: We thank the reviewer for valuable inputs. As per the suggestions given by the reviewer, we have updated the manuscript. All the comments by Reviwer1 are addressed.

 

 

Reviewer 2 Report

This article presents a deep learning approach combining multiple pre-trained convolutional network models for an image-based automatic identification and classification of  severity levels of Diabetic Retinopathy (DR). Features are extracted from color photographs of the retina by multiple pre-trained Convolutional Networks (CN), then pooled. Resulting features are then fed into a Deep Learning Neural Network (DLNN) trained to detect and classify severity levels of DR. Benchmarking shows that pooling yields better representation of critical features compared with any single CN approach. The DLNN is suggested to yield better classification in terms of Kappa scores, not so much in terms of classification accuracy, in comparison with conventional machine learning. The multi-modal fusion allows the proposed classification model to converge more rapidly compared with models that use deep features from any single CN. Unfortunately, the rationale underlying the methodologcal choices made for this study here are not made clear in the text, whihc simply lists what was done. It remains largely unclear what exactly the supposedly "prominent diagnostic features" (i.e. image criteria and their clinical relevance) clinicians are looking for in fundus images of DR would be, and why the proposed method is a particularly good choice for extracting them. The manuscript needs extensive major and minor revisions before it can be recommended for publication.

Major

The introduction refers to current state of the art and is acceptably well written. The goal of this paper is to present a simple and robust model for classifying the severity of DR on the basis of critical features in colored fundus images, with the objective to extract the most descriptive and discriminative features. To accomplish this goal, the authors use deep features extracted by several pretrained CNN architectures, then use them to train a Deep Neural Network with automatic selective dropout in the early processing layers.

The methodology part requires substantial revisions to clarify the rationale underlying the choices made by the authors; for example:
- lines 176-177: "features extracted from pre-trained deep models are proved to be better than the hand crafted features as these models can represent the images efficiently" - this sentence is not clear, and such lack of clarity persist in the text later-on. Please define with precision what you understand by "pre-trained deep models", "hand crafted features", and "represent the images efficiently".

The rationale for a presumed advantage of using combinations of different features from different pre-trained CN or CNN architectures is acceptable, but not well enough explained later-on in the relevant text parts. For example:

- in lines 192-201: "We extracted most significant features of color fundus images using various  pre-trained models (VGG16, NASNet, Xception and InceptionResNetV2). We reshaped the images according to the fixed input dimensions accepted by these models. For each fundus image we extract deep features and following are the details: Using VGG16, we collect features from first fully connected layer (fc1) and get a feature vector of
4096 dimensions. Using the same VGG16, we collect features from the second fully connected layer (fc2) and get a feature vector of 4096 dimensions NASNet has huge number of layers when compared with other models and the length of feature vector is 4032 after applying global average pooling operation at the end. Global average pooling layer of Xception architecture is used to extract 2048 features. Using Inception ResnetV2 architecture we obtain 1536 features from global average pooling." This whole text relating to the methodology of the approach proposed by the authors does not provide a clear rationale for the choice of these network architectures, it simply lists them. This is not enough. The author need to explain why the choices made here are pertinent for extracting different, clinically complementary and diagnostically relevant, image features. The description as given here is void of either clincially driven or theory-driven insights and appears free of any particular hypothesis. The reader gets the impression that the authors went "fishing in the dark" on the basis of an apparently rational idea (combine several deep feature models instead of using a single one and see what gives) and got lucky. The methodological choices made need to be justified on the basis of a clear rationale and explicit citeria. What is clinically relevant in the images, and how do the CNN combinations used here allow extracting the clinically relevant image information?

This problem of lack in clarity occurs again later on:

- in lines 218-220: "in case of multi-modal approaches, we need a fusion module as we the algorithm takes multiple deep features extracted from VGG16(fc1), VGG16 (fc2), and Xception Nets" - . Again, it remains unclear on which diagnostically relevant image criteria the fusion algorithm is applied and, therefore, the presumed advantage of the proposed multi-modal fusion approach in comparison with what the authors call "hand crafted" feature extraction cannot be understood. Without a clear rationale in the text, describing the logical choices underlying the metholodogy adopted here, the figures provided for illustration are not useful.

The significance of the results from this study, i.e. the advantage of the method proposed in comparison with others, cannot be assessed without a clear justification of the methodology and the aims pursued by the choices made as stated.

Minor

Essential revision of the English needs to be performed on the whole manuscript text; just to give a few examples from the abstract only:

  • "DR is one of the leading sources for visual impairment and blindness across the world" needs to be rephrased into "...is one of the major causes of visual impairment..."
  • "requires" instead of "require experienced clinicians"
  • "...each ConVnet extracts different features.." instead of "extract"

there are many more errors of this type, in the abstract and in the text later on; I recommend seeking advice from a native English speaking colleague to fix this problem.

Author Response

We would like to thank the Editor and reviewers for their helpful comments and suggestions. We have updated the manuscript to address all the comments supplied by the reviewers. We feel that by incorporating these suggestions, the quality of the paper has improved substantially. This material below addresses each issue raised by the editors and the reviewers.

Comment 1: The introduction refers to current state of the art and is acceptably well written. The goal of this paper is to present a simple and robust model for classifying the severity of DR on the basis of critical features in colored fundus images, with the objective to extract the most descriptive and discriminative features. To accomplish this goal, the authors use deep features extracted by several pretrained CNN architectures, then use them to train a Deep Neural Network with automatic selective dropout in the early processing layers.

Response:  We would like to thank the reviewer for careful and thorough reading of this manuscript and for the thoughtful comments and constructive suggestions, which help to improve the quality of this manuscript. We thank the reviewer for the positive feedback on our work. 

Comment 2: The methodology part requires substantial revisions to clarify the rationale underlying the choices made by the authors; for example: - lines 176-177: "features extracted from pre-trained deep models are proved to be better than the hand crafted features as these models can represent the images efficiently" - this sentence is not clear, and such lack of clarity persist in the text later-on. Please define with precision what you understand by "pre-trained deep models", "hand crafted features", and "represent the images efficiently".

Response:  To address this issue raised by the reviewer, a detailed explanation on hand crafted features in provided in section 2 (lines 108 – 112).

Comment 3: 

The rationale for a presumed advantage of using combinations of different features from different pre-trained CN or CNN architectures is acceptable, but not well enough explained later-on in the relevant text parts. For example:

- in lines 192-201: "We extracted most significant features of color fundus images using various  pre-trained models (VGG16, NASNet, Xception and InceptionResNetV2). We reshaped the images according to the fixed input dimensions accepted by these models. For each fundus image we extract deep features and following are the details: Using VGG16, we collect features from first fully connected layer (fc1) and get a feature vector of 4096 dimensions. Using the same VGG16, we collect features from the second fully connected layer (fc2) and get a feature vector of 4096 dimensions NASNet has huge number of layers when compared with other models and the length of feature vector is 4032 after applying global average pooling operation at the end. Global average pooling layer of Xception architecture is used to extract 2048 features. Using Inception ResnetV2 architecture we obtain 1536 features from global average pooling." This whole text relating to the methodology of the approach proposed by the authors does not provide a clear rationale for the choice of these network architectures, it simply lists them. This is not enough. The author need to explain why the choices made here are pertinent for extracting different, clinically complementary and diagnostically relevant, image features. The description as given here is void of either clincially driven or theory-driven insights and appears free of any particular hypothesis. The reader gets the impression that the authors went "fishing in the dark" on the basis of an apparently rational idea (combine several deep feature models instead of using a single one and see what gives) and got lucky. The methodological choices made need to be justified on the basis of a clear rationale and explicit citeria. What is clinically relevant in the images, and how do the CNN combinations used here allow extracting the clinically relevant image information?

Response: To address this issue, we justified the selection of VGG16 and Xception for deep feature extraction.   Sufficient details are provided in section 3.3 (lines 205 to 215) Figure5 is included to support the claims.

Comment 4: This problem of lack in clarity occurs again later on:

- in lines 218-220: "in case of multi-modal approaches, we need a fusion module as we the algorithm takes multiple deep features extracted from VGG16(fc1), VGG16 (fc2), and Xception Nets" - . Again, it remains unclear on which diagnostically relevant image criteria the fusion algorithm is applied and, therefore, the presumed advantage of the proposed multi-modal fusion approach in comparison with what the authors call "hand crafted" feature extraction cannot be understood. Without a clear rationale in the text, describing the logical choices underlying the metholodogy adopted here, the figures provided for illustration are not useful.

Response:  To address this issue raised by the reviewer, a detailed explanation on hand crafted features in provided in section 2 (lines 108 – 112). The ambiguity in the lines 218- 220 is addressed in section 3 lines 195 – 212.

Comment 5: The significance of the results from this study, i.e. the advantage of the method proposed in comparison with others, cannot be assessed without a clear justification of the methodology and the aims pursued by the choices made as stated.

Response: Justification of the model choices is clearly explained in the section 3.3 (lines  205-212)

Comment 6: 

Minor Essential revision of the English needs to be performed on the whole manuscript text; just to give a few examples from the abstract only:

"DR is one of the leading sources for visual impairment and blindness across the world" needs to be rephrased into "...is one of the major causes of visual impairment..."

"requires" instead of "require experienced clinicians"

"...each ConVnet extracts different features.." instead of "extract"

there are many more errors of this type, in the abstract and in the text later on; I recommend seeking advice from a native English speaking colleague to fix this problem. 

Response: 

As suggested by the reviewer, we have edited the entire manuscript and enough care has been taken to improve linguistical quality of the paper along with useful technical details.

All the suggestions and inputs given by the reviewer are being considered and updated the manuscript accordingly. We thank the reviewer for the detailed and thoughtful comments that are helpful to improve the manuscript

 

Reviewer 3 Report

The present manuscript deals with the problem of Diabetic Retinopathy severity prediction, addressed with a combination of features extracted by pre-trained deep learning models and traditional machine learning algorithms fed with these features. Experimental results carried out on a challenge dataset show the effectiveness of the proposed approach. Unfortunately, the authors provided only a marginal, incremental contribution to the current-state-of-the-art on the topic, already populated by a large corpus of studies using similar approaches. There is no substantial novelty. In addition, the manuscript should have been more carefully out proofread before submission, as it contains several errors and typos.

Author Response

We would like to thank the Editor and reviewers for their helpful comments and suggestions. We have updated the manuscript to address all the comments supplied by the reviewers. We feel that by incorporating these suggestions, the quality of the paper has improved substantially. This material below addresses each issue raised by the editors and the reviewers.

Comment: The present manuscript deals with the problem of Diabetic Retinopathy severity prediction, addressed with a combination of features extracted by pre-trained deep learning models and traditional machine learning algorithms fed with these features. Experimental results carried out on a challenge dataset show the effectiveness of the proposed approach. Unfortunately, the authors provided only a marginal, incremental contribution to the current-state-of-the-art on the topic, already populated by a large corpus of studies using similar approaches. There is no substantial novelty. In addition, the manuscript should have been more carefully out proofread before submission, as it contains several errors and typos.

Response: 

We Thank the reviewer for the appreciations on the effectiveness of the proposed approach.

To address the comments given by the reviewer, we have edited the entire manuscript and enough care has been taken to improve linguistical quality of the paper along with sufficient technical details.

Round 2

Reviewer 2 Report

The authors have resubmitted a carefully revised version of their manuscript and addressed all my previous concerns; the paper can now be recommended for publication. Please check again for remaining spelling mistakes. Example (in the legend of Figure 1): "affected" instead of "effected" - a few more small mistakes of a similar kind type are detectable subsequently in the text. The publisher's English editing team could maybe deal with this.

Reviewer 3 Report

The manuscript substantially improved and the contribution provided to the existing literature can be now better appreciated. As a future work, you may discuss the approach followed in a recent work I read (https://doi.org/10.1016/j.patrec.2019.08.018), in which the authors fuse the features coming from ConvNets by using an ensemble strategy.

Back to TopTop