Next Article in Journal
Generating Paraphrase Using Simulated Annealing for Citation Sentences
Next Article in Special Issue
Predicting the Risk of Alzheimer’s Disease and Related Dementia in Patients with Mild Cognitive Impairment Using a Semi-Competing Risk Approach
Previous Article in Journal
Affective Design Analysis of Explainable Artificial Intelligence (XAI): A User-Centric Perspective
Previous Article in Special Issue
Effectiveness of Telemedicine in Diabetes Management: A Retrospective Study in an Urban Medically Underserved Population Area (UMUPA)
 
 
Article
Peer-Review Record

Development and Internal Validation of an Interpretable Machine Learning Model to Predict Readmissions in a United States Healthcare System

Informatics 2023, 10(2), 33; https://doi.org/10.3390/informatics10020033
by Amanda L. Luo 1, Akshay Ravi 2, Simone Arvisais-Anhalt 3, Anoop N. Muniyappa 2, Xinran Liu 2,*,† and Shan Wang 1,4,†
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Informatics 2023, 10(2), 33; https://doi.org/10.3390/informatics10020033
Submission received: 30 January 2023 / Revised: 10 March 2023 / Accepted: 17 March 2023 / Published: 27 March 2023
(This article belongs to the Special Issue Feature Papers in Medical and Clinical Informatics)

Round 1

Reviewer 1 Report

Few questions:

1) How was class imbalance handled?

2) How is days since last admission being calculated? Need to be careful with data leak during training.

3) Majority of discharges happen during mid day time(11am - 2pm), so how is the bias corrected for discharge time of the day?

4) In iteration 3 the validation set is falling under first phase of covid19 and similarly test set is falling under second phase of covid19, since in training set the feature or diagnosis for covid 19 did not existed, how was it accounted during model validation?

5) Were discharge disposition = AMA eliminated from the dataset?

 

Author Response

Thank you for taking the time to review our paper. We appreciate the time and thought that you contributed to helping to make this paper better, and we hope that we were able to address your questions and concerns with our edits and responses. Please do let us know if you have further questions or concerns, and we would be happy to address them in the next review.

1) How was class imbalance handled?

Response: The main method we used to address class imbalance was to use the scale_pos_weight parameter, which is native to the XGBoost model. This parameter balances the weights of the minority and majority classes. We set this parameter to the recommended value - the ratio of our majority class to our minority class. Otherwise, we did not employ additional methods, as our results in the validation and test sets were both good. 

2) How is days since last admission being calculated? Need to be careful with data leak during training.

Response: We completely agree about being careful about data leakage. During training, we made a concentrated effort to ensure that all the data used for training would be available at the time the model is utilized, if it was utilized in real life. We reviewed our code for this feature, and can confirm that it calculates the difference in days between when the patient is admitted to the hospital for the current encounter, and when the same patient was discharged from his/her last prior hospitalization.

3) Majority of discharges happen during mid day time(11am - 2pm), so how is the bias corrected for discharge time of the day?

Response: Thank you for this question. We are not sure if we understand it correctly, but will do our best to answer it. Please do correct us if we understood the question incorrectly. We are inferring that when you say bias here, you mean that times of the day that do not have a lot of samples might be more sensitive to random noises. Potentially you could create a categorical feature instead, that captures whether or not the patient was admitted during a specific timeframe (e.g. morning, afternoon, night). However, we are not sure if this is necessary. Our thinking is that we are less interested in the bias among and between features, and more interested in whether or not those combinations of features can lead to good, consistent predictions. Our algorithm performed well between our validation and test sets, which suggests that even if there were issues with bias related to the discharge time of day feature (in terms of data scarcity during certain periods), it did not seem to affect the model’s ability to make good predictions. 

4) In iteration 3 the validation set is falling under first phase of covid19 and similarly test set is falling under second phase of covid19, since in training set the feature or diagnosis for covid 19 did not existed, how was it accounted during model validation?

Response: The dataset that we used was a deidentified dataset. As part of the deidentification process, dates related to a specific patient were randomly reassigned dates 1-365 days in the past. As a result, the date periods in our study will not line up with expected periods and waves of covid19. We have added this in section 2.1, as well as in the limitations section. For this reason, covid19 related diagnoses can show up in either training, validation, or test sets as a categorical value for the following categorical features: primarychiefcomplaintname, primaryeddiagnosisname, principalproblemdiagnosisname. If a specific covid19 related diagnosis is not present in the training set, it would be transformed to a bucket value of “other” in the validation and test sets if present there. We acknowledge that the distribution of covid19 diagnoses within training, validation, and test sets will not be representative as a result, but unfortunately this is not something we can correct for given the nature of the of the deidentified dataset we used. Overall though, diagnoses were not one of most important features in our model, so we anticipate the impact of this to be low.

5) Were discharge disposition = AMA eliminated from the dataset?

Response: AMAs were not removed from this initial study. We acknowledge that AMAs are excluded from CMS’s definition for readmissions, and that ideally they should be excluded. However, there were only 1365 instances of AMAs in our dataset out of ~150,000 encounters (<1%), which is a very small proportion of the dataset. In our work to operationalize this prototype model in clinical practice, we have excluded AMAs, and plan to publish an updated paper in the future. We have added this as a limitation in the limitations section of the paper.

Reviewer 2 Report

The paper proposed a machine learning model to predict all-causes 30-days adult readmission to a United Healthcare System using data within 24 hours of discharge. The paper compared four different algorithms and selected XGBoost as final model, based on predictive performance and training time. Moreover, features importance were extracted by using Shap. 

The contribution is relevant to the journal as it presents an experimental paper regarding the application of existing machine learning models to a current challenge in the field of healthcare. 

The contribution is an incremental improvement over existing works in literature and the main contribution is the development of a model suited to a cohort of patients from the United States Healthcare system.  

The paper is overall well written although the experiments need to be improved and better described. 

1. Feature selection (and missing value imputation) was performed before splitting train and test set. This is not correct, the test set can not be involved in any of the model building process. The feature selection should be embedded in the 3-fold cross validation, per each fold, and should only be based on the training set.

2. The feature selection step is not clearly explained.  Which is the baseline method employed to perform feature selection? If it is gradient boosting, it is natural that the performance of such model is better compared to the others, as feature selection was tailored for it.  

3. The use of the word embedding is not clear: the authors state that they do not provide better performance but in the final model they are included. What is the motivation to include them in the final model? Some discussion needs to be added, also regarding the feature importance of these embeddings in the final model.

4. Given the high number of features I would expect the use of lasso/elasticnet instead of plain logistic regression. It would be more fair.

5. Which minimization function was used for XGBoost? What is the difference between gradient boosting and XGBoost mentioned in the experiment? Mention parameters for each method or add reference to the library in case of default parameters used.

Author Response

Thank you for taking the time to review our paper. We appreciate the time and thought that you contributed to helping to make this paper better, and we hope that we were able to address your questions and concerns with our edits and responses. Please do let us know if you have further questions or concerns, and we would be happy to address them in the next review.

  1. Feature selection (and missing value imputation) was performed before splitting train and test set. This is not correct, the test set can not be involved in any of the model building process. The feature selection should be embedded in the 3-fold cross validation, per each fold, and should only be based on the training set.

Response: Thank you for bringing this up. We completely agree with this comment, and can confirm that this is the methodology that we followed (not use test set for model building/optimization). We have clarified this in the 2.3 section of the paper to make this process more clear.

  1. The feature selection step is not clearly explained. Which is the baseline method employed to perform feature selection? If it is gradient boosting, it is natural that the performance of such model is better compared to the others, as feature selection was tailored for it.

Response: Feature selection was done through the drop column feature importance method detailed at the end of section 2.3, with AUC-PR as our evaluation metric. We tried to edit this section so that it is more clear that this paragraph is addressing feature selection.

  1. The use of the word embedding is not clear: the authors state that they do not provide better performance but in the final model they are included. What is the motivation to include them in the final model? Some discussion needs to be added, also regarding the feature importance of these embeddings in the final model.

Response: Thank you for pointing this out, how we wrote this was not sufficiently clear. There are five diagnosis related features. For each of these features we tried two different methodologies, one of which is word embeddings, and the other was using the most common diagnosis names within each feature as a category. When we dropped the word embedding feature columns, but kept the category type diagnosis columns, we found that there was a negligible change in performance, and so decided to drop the word embedding columns entirely in the final model. So diagnosis related features are included in the final model, but as categorical features, not as word embeddings. We have clarified this in section 2.3, as well as in the results and discussion sections.

  1. Given the high number of features I would expect the use of lasso/elasticnet instead of plain logistic regression. It would be more fair.

Response: Thank you for this comment. We agree in principle with this. We have added this as a limitation in our limitations section. Unfortunately the author who wrote the original code is no longer with our organization, and no longer has access to the data and code. It would take quite some time for us to be able to add this into the paper. However, even with lasso/elasticnet, it is unlikely that the performance would match that of newer tree based models, and not significantly change our results. 

  1. Which minimization function was used for XGBoost? What is the difference between gradient boosting and XGBoost mentioned in the experiment? Mention parameters for each method or add reference to the library in case of default parameters used.

Response: We used the standard loss function in XGBoost for binary classification, specifically the binary cross-entropy log loss function. XGBoost is an optimized form of gradient boosting that optimizes space search when creating a new branch, enabling much quicker execution times. Furthermore, XGBoost also incorporates regularization, which improves model performance and generalization. Compared to the Gradient Boosting model, XGBoost includes additional parameters, such as the "scale_pos_weight" parameter. We have added the parameters we used in our models into Appendix B

Reviewer 3 Report

Interesting work, I would like to thank the authors for this contribution. The methodology is largely clear, and the manuscript is well-written as well. It is also reassuring to observe that the authors have acknowledged potential limitations. However, I would like to offer some suggestions for improvement in the next version, please.

 

(1)

Additional clarification is necessary regarding the motivation for this study. I am curious if the models developed in this research could be feasibly deployed within the US healthcare system, and if so, it would be advantageous to highlight this point. This would further emphasize the significance of the study's outcomes."

(2)

With regard to the related work, it would be beneficial to reference studies that have made efforts to combine structured data with text to predict hospital admissions or readmissions. One such example would be: https://doi.org/10.1109/BigData50022.2020.9378073

Including these references should contribute to a more comprehensive understanding of the existing literature related to the methodology applied by the study.

 

(3)

As for future work, it would be worthwhile to explore more advanced models for generating word embeddings from the text notes. For instance, there are several BERT models available that could be utilized for this purpose. BERT models have gained popularity in NLP tasks due to their ability to capture contextual relationships between words in a sentence. Incorporating these more sophisticated models could potentially improve the accuracy and effectiveness of the study's results.

Author Response

Thank you for taking the time to review our paper. We appreciate the time and thought that you contributed to helping to make this paper better, and we hope that we were able to address your questions and concerns with our edits and responses. Please do let us know if you have further questions or concerns, and we would be happy to address them in the next review.

(1) Additional clarification is necessary regarding the motivation for this study. I am curious if the models developed in this research could be feasibly deployed within the US healthcare system, and if so, it would be advantageous to highlight this point. This would further emphasize the significance of the study's outcomes."

Response: Thank you for kind support. Yes, the goal of this initial work was to build a proof of concept model using de-identified data. As a result, we have already gotten approval and are in the process of recreating this model using live EHR data, with the goal of implementing it in order to help determine which patients should receive what types of post discharge outreach in order to decrease 30-day hospital readmissions. We have added this information into the background and discussion parts of the paper.

(2) With regard to the related work, it would be beneficial to reference studies that have made efforts to combine structured data with text to predict hospital admissions or readmissions. One such example would be: https://doi.org/10.1109/BigData50022.2020.9378073

Including these references should contribute to a more comprehensive understanding of the existing literature related to the methodology applied by the study.

Response: Thank you for sharing this paper, have added it as a reference.

(3) As for future work, it would be worthwhile to explore more advanced models for generating word embeddings from the text notes. For instance, there are several BERT models available that could be utilized for this purpose. BERT models have gained popularity in NLP tasks due to their ability to capture contextual relationships between words in a sentence. Incorporating these more sophisticated models could potentially improve the accuracy and effectiveness of the study's results.

Response: Thank you for suggesting this, it is a good idea. We have added this in the discussion section as possible future work to try.

Back to TopTop