Next Article in Journal
A Unique Approach to Hydrological Behavior along the Bednja River (Croatia) Watercourse
Next Article in Special Issue
Water Quality Analysis of a Tropical Reservoir Based on Temperature and Dissolved Oxygen Modeling by CE-QUAL-W2
Previous Article in Journal
Development of Water Level Prediction Improvement Method Using Multivariate Time Series Data by GRU Model
 
 
Article
Peer-Review Record

A Machine Learning Approach to Predict Watershed Health Indices for Sediments and Nutrients at Ungauged Basins

Water 2023, 15(3), 586; https://doi.org/10.3390/w15030586
Reviewer 1: Tapas Karmaker
Reviewer 2: Anonymous
Water 2023, 15(3), 586; https://doi.org/10.3390/w15030586
Received: 21 December 2022 / Revised: 25 January 2023 / Accepted: 28 January 2023 / Published: 2 February 2023
(This article belongs to the Special Issue Water Quality Assessment and Modelling)

Round 1

Reviewer 1 Report

  There are still the same errors in several places from page 9 onwards.  The authors have presented an approach to determine the watershed health using ML technique specifically for the ungauged watershed. The manuscript is well organised and can be of interest to the scientific community. References are appropriate. My specific comments are as follows:   However, the authors mention that the model works quite well with sufficient training data. So, how can this approach be satisfactorily used for the ungauged watershed for WH? The authors also should highlight the novelty in the approach.

 

Author Response

Response to Comments from Reviewer 1

The authors mention that the model works quite well with sufficient training data. So how can this approach be satisfactorily used for the ungauged watershed for WH? The authors should highlight the novelty in the approach.

We thank the reviewer for asking us this clarification and encouraging us to highlight the novelty of this work.

Watersheds where water quality and streamflow data are not collected at their outlets are referred to as ungauged watersheds.

The machine learning models used in the study are first developed and trained using data available at gauging and sampling stations. The trained models are then used to predict watershed health with respect to suspended sediments and nutrients at ungauged outlets using inputs and attributes similar to those used to construct the models, such as precipitation, temperature, land use, soil properties, watershed characteristics available over these ungauged watersheds. That said, our study found that the machine learning models showed better accuracy in predicting watershed health when sufficient training data were available to train the ML models. The accuracy of the models was verified at stations that were part of the test set.

The following lines in the Conclusion section of the original manuscript highlights this finding:

“Individual ML models perform well when there were enough data during the training phase (e.g. SSC) over the calibration watersheds. However, when data were limited (e.g., Nitrogen and Orthophosphate), individual model performance dropped, and ensemble averaging techniques helped to boost the performance.”

See lines 607-609 in the Summary and Conclusions section of the manuscript.

The novelty of this study is that this is the first to establish the utility of machine learning models for estimating watershed health using WH index (Mallya et al., 2018) in ungauged basins. We have chosen three major Midwest River Basins in the US to demonstrate this concept.

Author Response File: Author Response.docx

Reviewer 2 Report

The manuscript is aligned with the journal's aims and scope. I read the paper with interest and think upon revision, the manuscript has the potential for good readership and impact. However, several key issues require to be addressed. Following the below revisions the manuscript can be reconsidered for publication. 

Abstract: you need to add some quantitative figures with regard to your results. The key contribution and significance of the work can be better highlighted in the abstract. 

keywords: you need additional keywords which describe the exact ML techniques used for the study. 

Introduction: follows a logical structure, but the literature review is not complete and some new references can enhance the discussions. L35-38, when describing pollution discharge to environments e.g. nearshore and coastal area (10.9753/icce.v34.waves.49). When describing agricultural activities and fertilizers, not only streams will be polluted but also the intensified level of N in the streams can pose risks to public health and ecosystem function, including carcinogenic risks to human lives (10.1016/j.jclepro.2022.132432; 10.1016/j.jtice.2021.01.030), which can be included in the discussions.  
L54-63, when discussing ML-based models, recent literature should be included, for example when discussing the following point in L54-63 you should include recent literature: sediment transport modeling (10.3390/hydrology9020036), ecological modeling (10.3390/hydrology10010016), water quality assessment (10.1038/s41598-022-08417-4), flood prediction (10.1016/j.watres.2022.119100), and ML-based satellite imagery analysis (doi.org/10.1155/2022/8451812).
Study objectives require to get enhanced/, you need to be clear about the specific ML techniques used and justify the appropriateness (i.e. OBJ1).
L83-85 can be deleted as the subheadings are detailed enough and the text is not contributing to the manuscript.

2. Study area and datasets used: 'used' should be deleted in the subheading title. 
L88 -  (see Error! Reference source not found.) needs correction.
Provide links to the dataset used for the case study area. 

3. Methodology: some of the Machine Learning techniques are not well-described, e.g. 3.2.4 and 3.2.5. I appreciate that you dont want to end up with a very long manuscript, but you may give reference to key literature related to the model for a better understanding of the model. 
where possible, try to use more up-to-date references rather than e.g. Friedman et al., 2001.
Your ML model section lacks detail about execution and how these models were adopted. You also need to talk about your methods for uncertainty quantification, how the models were evaluated, and with what statistical indexes. 

Results: lots of references are in the form of '(see Error! Reference source not found.)'
statistical analyzes of models are missing. You need to provide tables showing the indexes used and a comparison of statistics between different models. 
In terms of structure, the results section can benefit from subheadings, so the author can see the results of different models in separate sub-sections. This makes it easier to comprehend and evaluate the study.  

Conclusions: should highlight study significance and new knowledge/ contributions. You also need to elaborate on the uncertainty quantification of the results and the limitations of your ML models, e.g. ML will predict poorly for data outside the range of training/ calibration. 

Finally, upon revision of the above points, you are required to carefully proofread the manuscript and improve the writing. 

Author Response

Response to Comments from Reviewer 2

The manuscript is aligned with the journal's aims and scope. I read the paper with interest and think upon revision, the manuscript has the potential for good readership and impact. However, several key issues require to be addressed. Following the below revisions the manuscript can be reconsidered for publication. 

We thank the reviewer for taking time to go through the manuscript and provide constructive feedback to improve the manuscript. We have addressed the points raised by the reviewer in what follows.

Abstract: you need to add some quantitative figures with regard to your results. The key contribution and significance of the work can be better highlighted in the abstract. 

In the revision,  statistic during testing stage for ML models is mentioned in the abstract. The novelty of the work “this is first such study where machine learning models are used to estimate watershed health using WH metrics (Mallya et al., 2018)” is included. See lines 607-609.

keywords: you need additional keywords which describe the exact ML techniques used for the study. 

We have now added additional keywords to describe the ML techniques used in the study.

Introduction: follows a logical structure, but the literature review is not complete and some new references can enhance the discussions. L35-38, when describing pollution discharge to environments e.g. nearshore and coastal area (10.9753/icce.v34.waves.49). When describing agricultural activities and fertilizers, not only streams will be polluted but also the intensified level of N in the streams can pose risks to public health and ecosystem function, including carcinogenic risks to human lives (10.1016/j.jclepro.2022.132432; 10.1016/j.jtice.2021.01.030), which can be included in the discussions.  

To improve the literature review, we have included new references (Lines 54-59):

Fecal indicator bacteria such as Escherichia coli can be part of runoff pollution and contaminate waters in the nearshore and coastal regions (Abolfathi and Pearson, 2014). When the levels of Nitrate + Nitrite are above permissible limits in the streams they can pose both non-carcinogenic and carcinogenic risks to public health and affect general ecosystem functioning (Noori et al., 2022a).

References:

Abolfathi, S. and Pearson, J.M., 2014. Solute dispersion in the nearshore due to oblique waves. In Proceedings of 14th International Conference on Coastal Engineering, pp. 2156-1028.

Noori, R., Farahani, F., Jun, C., Aradpour, S., Bateni, S.M., Ghazban, F., Hosseinzadeh, M., Maghrebi, M., Naseh, M.R.V. and Abolfathi, S., 2022a. A non-threshold model to estimate carcinogenic risk of nitrate-nitrite in drinking water. Journal of Cleaner Production, 363, p.132432.

L54-63, when discussing ML-based models, recent literature should be included, for example when discussing the following point in L54-63 you should include recent literature: sediment transport modeling (10.3390/hydrology9020036), ecological modeling (10.3390/hydrology10010016), water quality assessment (10.1038/s41598-022-08417-4), flood prediction (10.1016/j.watres.2022.119100), and ML-based satellite imagery analysis (doi.org/10.1155/2022/8451812).

We have included recent literature to our introduction section as suggested by the reviewer (Lines 74-83).

“sediment transport modeling (Bhattacharya, B. et al., 2007; Noori et al., 2022b; Sharafati et al., 2020), ecological modeling (Cutler et al., 2007; Džeroski, 2001; Malekmohammadi et al., 2023; Tuia et al., 2022; Vincenzi et al., 2011), water quality assessment (Ahmed et al., 2019; Azrour et al., 2022; Ghiasi et al., 2022; Hollister et al., 2016; Khullar and Singh, 2022; Kim et al., 2014; Lee et al., 2018; Mohammadpour et al., 2015; Nasir et al., 2022; Qianqian and Ying, 2015; Singh et al., 2017, 2011; Tan et al., 2012; Walley and Džeroski, 1996; Walsh et al., 2017) , flood prediction (Donnelly et al., 2022; Mosavi et al., 2018), and ML-based satellite image analysis (McAllister et al., 2022; Yeganeh-Bakhtiary et al., 2022).”

References:

Donnelly, J., Abolfathi, S., Pearson, J., Chatrabgoun, O. and Daneshkhah, A., 2022. Gaussian process emulation of spatio-temporal outputs of a 2D inland flood model. Water Research, 225, p.119100.

Khullar, S. and Singh, N., 2022. Water quality assessment of a river using deep learning Bi-LSTM methodology: forecasting and validation. Environmental Science and Pollution Research, 29(9), pp.12875-12889.

Ghiasi, B., Noori, R., Sheikhian, H., Zeynolabedin, A., Sun, Y., Jun, C., Hamouda, M., Bateni, S.M. and Abolfathi, S., 2022. Uncertainty quantification of granular computing-neural network model for prediction of pollutant longitudinal dispersion coefficient in aquatic streams. Scientific Reports, 12(1), pp.1-15.

Noori, R., Ghiasi, B., Salehi, S., Esmaeili Bidhendi, M., Raeisi, A., Partani, S., Meysami, R., Mahdian, M., Hosseinzadeh, M. and Abolfathi, S., 2022b. An Efficient Data Driven-Based Model for Prediction of the Total Sediment Load in Rivers. Hydrology, 9(2), p.36.

Sharafati, A., Haji Seyed Asadollah, S.B., Motta, D. and Yaseen, Z.M., 2020. Application of newly developed ensemble machine learning models for daily suspended sediment load prediction and related uncertainty analysis. Hydrological Sciences Journal, 65(12), pp.2022-2042.

Malekmohammadi, B., Uvo, C.B., Moghadam, N.T., Noori, R. and Abolfathi, S., 2023. Environmental Risk Assessment of Wetland Ecosystems Using Bayesian Belief Networks. Hydrology, 10(1), p.16.

McAllister, E., Payo, A., Novellino, A., Dolphin, T. and Medina-Lopez, E., 2022. Multispectral satellite imagery and machine learning for the extraction of shoreline indicators. Coastal Engineering, p.104102.

Tuia, D., Kellenberger, B., Beery, S., Costelloe, B.R., Zuffi, S., Risse, B., Mathis, A., Mathis, M.W., van Langevelde, F., Burghardt, T. and Kays, R., 2022. Perspectives in machine learning for wildlife conservation. Nature communications, 13(1), pp.1-15.

Yeganeh-Bakhtiary, A., EyvazOghli, H., Shabakhty, N., Kamranzad, B. and Abolfathi, S., 2022. Machine Learning as a Downscaling Approach for Prediction of Wind Characteristics under Future Climate Change Scenarios. Complexity, 2022.


Study objectives require to get enhanced/, you need to be clear about the specific ML techniques used and justify the appropriateness (i.e. OBJ1).

We have provided this clarification. Please see lines 94-98.

ML models, namely Random Forest, AdaBoost, Gradient Boosting Regressor, and Bayesian Ridge regression, were chosen in this study because these models do not make any assumptions about input data distributions, they work well with high dimensional datasets, and they avoid overfitting by using random combinations of predictor variables to develop uncorrelated set of models.


L83-85 can be deleted as the subheadings are detailed enough and the text is not contributing to the manuscript.

We have removed the following lines from the manuscript:

“The remainder of this paper is organized as follows: Study area and datasets used are introduced in Section 2. In Section 3, the machine learning models are described. In Section 4 we present the results and discuss the findings, followed by conclusions in Section 5.”

  1. Study area and datasets used: 'used' should be deleted in the subheading title. 
    L88 -  (see Error! Reference source not found.) needs correction.
    Provide links to the dataset used for the case study area. 
  • ‘used’ has been removed from the subheading title.
  • Not able to replicate/see error on L88. The error may have manifested during the pdf conversion process in the editorial manager. We will closely monitor to address this issue when submitting revisions.
  • Links to datasets have been provided (Lines 114-125)


  1. Methodology: some of the Machine Learning techniques are not well-described, e.g. 3.2.4 and 3.2.5. I appreciate that you dont want to end up with a very long manuscript, but you may give reference to key literature related to the model for a better understanding of the model. 
    where possible, try to use more up-to-date references rather than e.g. Friedman et al., 2001.

Updated sections 3.2.4 and 3.2.5. See Lines 255-287

We have also updated the citation Friedman et al., 2001 to a more recent 2009 edition.

References:

Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp.1189-1232.

Tipping, M.E., 2001. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research, June 2001, pp.211-244.

Hastie, T., Tibshirani, R., and Friedman, J.H., 2009. The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.



Your ML model section lacks detail about execution and how these models were adopted. You also need to talk about your methods for uncertainty quantification, how the models were evaluated, and with what statistical indexes. 

Machine learning models from the Scikit-learn machine learning toolbox for Python (Pedregosa et al., 2011) was used in the study. For models that use decision trees as weak learners, i.e., Random Forests, AdaBoost, and Gradient Boosting Regressor the value of the max_features parameter, where max_features are the number of predictor variables to consider when looking for the best split of each decision tree, was found to be optimum at  using the grid search approach (Geurts et al., 2006). Here  refers to the total number of attributes or explanatory variables in the input dataset. See Lines 283-287.

Uncertainty quantification in the study is present through the computation of 90% prediction interval for each model. See updated Figure 5 and lines 411-430.

Models were evaluated using  statistic for training and testing sets separately. The  values are presented in Table 1, and figures are presented in the supplementary section (Figures S8 to S11).

Reference:

Geurts, P., Ernst, D. and Wehenkel, L., 2006. Extremely randomized trees. Machine learning, 63(1), pp.3-42.

 

Results: lots of references are in the form of '(see Error! Reference source not found.)'

We were not able to replicate this error as we suspect this may be due to incompatible versions of Word/PDF processor and the error may have likely manifested during the pdf conversion process in the editorial manager. We will closely work with the journal’s editing department to avoid such errors/broken links and references.


statistical analyzes of models are missing. You need to provide tables showing the indexes used and a comparison of statistics between different models. 

Goodness of fit measures ( statistic) for training and test sets for all ML models are presented in Table 1 of the manuscript. Figures S8 to S11 in the supplementary section show the comparison of reference WH metric (with respect to SSC) versus predicted WH from individual ML models at all the test sites.

In terms of structure, the results section can benefit from subheadings, so the author can see the results of different models in separate sub-sections. This makes it easier to comprehend and evaluate the study.  

We have now created separate sub-sections as suggested by the reviewer. Sections 4.1.2.1 and 4.1.2.2 added.

Conclusions: should highlight study significance and new knowledge/ contributions. You also need to elaborate on the uncertainty quantification of the results and the limitations of your ML models, e.g. ML will predict poorly for data outside the range of training/ calibration. 

In this study uncertainty in the results of ML models is presented in the form of prediction intervals. When analyzing the 90% prediction intervals at test stations, except Bayesian Ridge regressor, the remaining four ML models performed well. The 90% prediction intervals were found to be wider in case of ML models for nutrients, which could be attributed to smaller number of training samples. See lines 629-634.

The WH metric predictions obtained from ML models are dependent on the quality of input data. However, in this study we did not analyze the sensitivity of the results to variations in input data values or in the event of missing data. Also, if the input data to ML models are outside the training data range, the output from the ML model would exhibit greater uncertainty. See lines 635-638.

 

Finally, upon revision of the above points, you are required to carefully proofread the manuscript and improve the writing.

We have carefully proofread the manuscript so that some of the highlighted issues such as grammar, broken links/references are fixed. We have also divided the manuscript into subsections to make it easier to read and comprehend the study.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

 

The authors have revised the manuscript according to the comments. The work is very informative and provides a comprehensive overview of the existing ML-based options for predicting sediment and nutrients in ungauged basins. I think the manuscript is now ready for publication. However, language edits and proofreading will be required. Also, some of the references are not following standard formatting, I think adding DOI for all the references will be necessary to ensure easy access to the literature for the readers. 

Back to TopTop