Next Article in Journal
Candidate Set Expansion for Entity and Relation Linking Based on Mutual Entity–Relation Interaction
Next Article in Special Issue
Sentiment Analysis and Text Analysis of the Public Discourse on Twitter about COVID-19 and MPox
Previous Article in Journal
Recognizing Road Surface Traffic Signs Based on Yolo Models Considering Image Flips
Previous Article in Special Issue
Deep Clustering-Based Anomaly Detection and Health Monitoring for Satellite Telemetry
 
 
Article
Peer-Review Record

Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study

Big Data Cogn. Comput. 2023, 7(1), 55; https://doi.org/10.3390/bdcc7010055
by Menna Ibrahim Gabr 1,*,†, Yehia Mostafa Helmy 1 and Doaa Saad Elzanfaly 2,3,†
Reviewer 2:
Reviewer 4: Anonymous
Big Data Cogn. Comput. 2023, 7(1), 55; https://doi.org/10.3390/bdcc7010055
Submission received: 13 February 2023 / Revised: 11 March 2023 / Accepted: 15 March 2023 / Published: 22 March 2023
(This article belongs to the Special Issue Machine Learning in Data Mining for Knowledge Discovery)

Round 1

Reviewer 1 Report

 

This paper presents a study the effect of missing values in binary and multiclass classification based on experimental analysis. This analysis is focuses on 6 datasets.

No simulation study or critical analysis of imputation methods is presented. Comparative advantages and disadvantages between methods are not discussed. Authors refer lines 213-214 that “In Groups B and C, each experiment is executed three times, as each one gives different performance, and the average performance is reported.”. Why 3? Why not 500 or 1000 to have variability and approximate to a simulation study, although always based on a pre-established data set?

There are several technical aspects that are not explained in the article such as the five diverse classifiers. They are only mentioned without any critical analysis.

The various graphs presented are not informative for the reader as no information or results can be extracted from them (Figure 2-5).

Although the study is of interest to researchers wishing to apply these simple approaches, the paper does not present any methodological innovation.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Please see the review in the separate file

Comments for author File: Comments.pdf

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

Paper deals with important task. The authors assessed the effect of incomplete datasets on the performance of five classification models.

In general, the paper is good, but I have several suggestions:

1. Introduction section should be extended using motivation of this work. 

2. It would be good to add main contribution of this paper in the Introduction section

3. Related works section should be extended using similiar works. Also it would be good to take into account research DOI 10.1016/j.jestch.2020.10.005

4. The authors should argue they choise on Missing Values Imputation Techniques in this paper

5. Table 1 should be extended by imbalance ration of the existing datasets as it effect on the accuracy of the ML methods

6. The quality of Fig 1 should be improved

7. Fig 2-5 should be biggests. In current for the readers can see anything on it

 

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 4 Report

This paper seeks to address the  effect of missing values in datasets on different classification models and also evaluates the performance using various metrics such as MCC, F1, and accuracy. A quantitative comparison between several classical machine learning models is presented. The authors made a few conclusions regarding the imputation approaches used in training the models and the appropriate metrics used in the case of unbalanced datasets. 

There are a few comments that may need the authors to address

1. A major factor affecting the performance of machine learning models is the model capacity. It is well known that deep learning models such as convolutional neural networks or Transformers have a much larger capacity than classical ML models used in the paper, certainly at the cost of higher computational complexities. It is of much interest if the authors could carry out numerical experiments using a CNN model such as VGG-16, which should be less sensitive to the data pattern or imputation methods, especially when using certain loss functions like focal loss. 

2. The datasets used in the experiments are very small, which leaves much room for improvements. The authors are advised to use larger datasets, for instance, those downloaded from Kaggle Competitions, to verify their findings.  

3. The authors did not state what data augmentation (DA) methods they used in the experiments. It is known that DA has a significant impact on the performance, e.g., by introducing small perturbations to training data to make datasets more balanced. 

4. The parameters of ML models can be optimized by using random search or grid search methods. Such fine-tuning has a visible effect on the models with a larger number of models, e.g., RF models. Did the authors use such optimization methods in their experiments? 

 

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The text has been improved according to the comments.

Reviewer 4 Report

The authors have responded adequately to my comments. 

Back to TopTop