# Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

^{2}) values, for both training and testing datasets. This was accomplished by an in-depth investigation and model comparison. In contrast to CatBoost, XGBoost and LGBM demonstrated a higher percentage of data points with prediction errors exceeding 35% for moderate inflow numbers above 10,000. CatBoost established itself as a reliable method for hydrological time-series modelling, easily managing both categorical and continuous variables, and thereby greatly enhancing prediction accuracy. The results of this study highlight the value and promise of widely used machine learning algorithms in hydrology and offer valuable insights for academics and industry professionals.

## 1. Introduction

#### 1.1. Literature Review

#### 1.1.1. Traditional Methods for River Inflow Prediction

#### 1.1.2. Machine Learning Approaches for River Inflow Prediction

- Comparative Evaluation: the study provides a comprehensive comparative evaluation of multiple machine learning models for predicting river inflow. While previous studies have explored individual models, this research systematically compares the performance of CatBoost, ElasticNet, KNN, Lasso, LGBM, Linear Regression, MLP, Random Forest, Ridge, SGD, and XGBoost. Such a comprehensive comparative analysis is novel in the context of river inflow prediction.
- Time Series Analysis: the study specifically focuses on time series analysis for river inflow prediction. Time series data present unique challenges, due to temporal dependencies. By applying different machine learning techniques to this specific domain, the research contributes to the advancement of time series prediction methodologies in the context of water resource management.
- Application to River Inflow Prediction: while machine learning models have been applied in various domains, their application to river inflow prediction is of significant importance for water resource management. Predicting river inflow accurately is crucial for making informed decisions regarding water allocation, flood management, and hydropower generation.
- Performance Evaluation on Multiple Datasets: the study evaluates the performance of the models on multiple datasets, including training, validation, and testing data. This comprehensive evaluation provides a robust assessment of the models’ performance and their ability to generalize to unseen data, contributing to the understanding of their efficacy in real-world scenarios.

#### 1.2. Objectives of the Study

## 2. Methodology and Methods

^{2}), are used to quantify the model’s performance.

#### 2.1. CatBoostRegressor Algorithm

#### 2.2. k-Nearest Neighbors

- Prepare the training data with input features and target values.
- Determine the value of k, the number of nearest neighbors to consider.
- Calculate the distance between the new data point and the training data points.
- Select the k nearest neighbors, based on the distances.
- Calculate the target values’ average among the k closest neighbors. Use the average value as the new data point’s estimated goal value.

#### 2.3. Light Gradient-Boosting Machine Regressor (LGBM)

#### 2.4. Linear Regression (LR)

#### 2.5. Multilayer Perceptron

- (a)
- Assign random weights to the connections between the neurons as part of the initialization process.
- (b)
- The input layer: Take in input data and send them to the top-most hidden layer.
- (c)
- Hidden layers: Each hidden layer neuron computes the weighted sum of its inputs using the current weights and then applies an activation function (such as a sigmoid) to the sum.
- (d)
- Output layer: The neurons in the output layer compute the same activation function and weighted sum as the neurons in the hidden layers.
- (e)
- The MLP’s final output is derived from the neurons in the output layer.

#### 2.6. Random Forest

- Random Subset Selection: a random subset of data points is chosen from the training set. This subset typically contains a fraction of the total data points, denoted by ‘p’.
- Construction of a Decision Tree: using the subset of data points that was chosen, a decision tree is built. This procedure is repeated using various subsets of the data for a total of ‘N’ trees.
- Prediction Aggregation: each of the ‘N’ decision trees predicts the value of the target variable for a new data point. The outcomes of all the predictions from the trees are averaged to provide the final forecast.

#### 2.7. Lasso

#### 2.8. Ridge

#### 2.9. ElasticNet

#### 2.10. Stochastic Gradient Descent (SGD) Regressor

#### 2.11. Extreme Gradient-Boosting Regression Model (XGBoost)

- Choosing the XGBoost model’s parameters, such as the learning rate, the number of trees, the maximum depth, and the feature fraction, is the step-one process. These variables can be altered to improve performance and regulate how the model behaves.
- Create the model and train it: the XGBoost model is produced by the construction of several decision trees. A gradient-based optimization technique that minimizes the loss function is used to build each tree. The ensemble of trees is continuously expanded throughout the training phase, and predictions are updated in line with gradients in the loss function.
- After model training, the model may be used to make predictions about fresh data points. The XGBoost method incorporates the predictions from each tree in the ensemble to obtain the final regression prediction. The particular method for combining the predictions is determined by the loss function that is used.

## 3. Model Training and Validation

- Data Split: a training set, a validation set, and a test set are each provided as separate datasets. The model is trained using the training set. The validation set is used to fine-tune the model and assess model performance throughout training, whereas the test set is used to measure the trained model’s final performance on unseen data.
- Model Selection: select the most effective model architecture or machine learning technique for the particular job. The kind of data, the task (classification, regression, etc.), and the resources available are all factors in the model selection process.
- Model Training: develop the selected model using the training dataset. During the training phase, the model parameters are frequently repeatedly improved in order to minimize a chosen loss or error function. In order to do this, training data are fed into the model, predictions are generated and compared to actual values, and model parameters are updated, depending on computed errors. This procedure continues until a convergence requirement is satisfied, after a certain number of epochs.
- Model Evaluation: using the validation dataset, evaluate how well the trained model performed. The validation data is used to generate predictions, which are then compared to the actual results. There are several assessment measures employed, including mean squared error (MSE), mean absolute error (MAE), root mean square error (RMSE), root mean square percent error (RMSPE), and R-squared (R
^{2}) [75].

- 5.
- Iterative Refinement: to enhance performance, modify the model architecture or data preparation stages based on the evaluation findings. Until a suitable performance is attained, this iterative procedure is continued.
- 6.
- Final Assessment: after the model has been adjusted, its performance is evaluated using the test dataset, which simulates unseen data. This offers a neutral assessment of how well the model performs in realistic situations.

## 4. Study Area, Data Collection and Preprocessing

#### 4.1. Study Area

#### 4.2. Data Collection

#### 4.3. Techniques for Preprocessing Data

^{3}/s, demonstrating that the river flow displays significant patterns and changes throughout the year. Understanding seasonality can aid in forecasting future flow patterns and preparing for the demands placed on water resources throughout particular seasons. The residuals are the variations between the values that were seen and those that were anticipated by the trend and seasonality components. They stand for the arbitrary and unpredictable variations in river flow that neither trends nor seasonality can account for. Any remaining anomalies or out-of-the-ordinary events in the data can be understood by analyzing the residuals. The residuals allow us to determine the trend and seasonality components’ goodness of fit as well as any other variables affecting the river flow.

#### 4.3.1. Creating Lagged Features

#### 4.3.2. Date Feature Engineering

#### 4.3.3. One-Hot Encoding

## 5. Model Preparation

## 6. Results and Discussion

#### 6.1. Performance Metrics of Training Data

^{2}. These metrics evaluate each model’s performance on the training data. Higher R

^{2}values indicate a better fit of the model to the data, while lower MAE, MSE, RMSE, and RMSPE values denote superior performance. A comparison of the models in Table 2 reveals that CatBoost, XGBoost, and RF demonstrate improved prediction accuracy and model fit on the training data, due to their lower MAE, MSE, RMSE, RMSPE values and high R

^{2}. ElasticNet, KNN, Lasso, LR, MLP, Ridge, and SGD perform less effectively on the training data, having lower R

^{2}and higher MAE, MSE, RMSE, RMSPE values. LGBM also performs well, exhibiting relatively low values across all the criteria. Models with the lowest errors (MAE, MSE, RMSE, RMSPE), highest R

^{2}, and best performance on the training data are CatBoost, XGBoost, and RF. These models fit the training data well, and have excellent predictive capabilities. It is crucial to note that a model’s performance on training data might not necessarily generalize to new data. Therefore, further assessment of the models’ overall performance using validation and test data is necessary to select the most suitable model for prediction tasks.

#### 6.2. Performance Metrics of Validation Data

^{2}. After reviewing the performance of the models using validation data, the following conclusions can be drawn: LGBM, Lasso, MLP, and Ridge perform better on the validation data as a result of having comparatively lower values for MAE, MSE, RMSE, RMSPE, and higher R

^{2}. CatBoost, ElasticNet, LR, RF, SGD, and XGBoost also exhibit acceptable performance, with moderate metric values. KNN performs poorly on the validation data, with higher values for MAE, MSE, RMSE, RMSPE, and lower R

^{2}. LGBM, Lasso, MLP, and Ridge outperform the other models on the validation data. Their continuously decreased errors (MAE, MSE, RMSE, and RMSPE) and improved R

^{2}on the validation set indicate increased model fit and prediction accuracy. However, it is crucial to consider the possibility that model performance on the validation data may not generalize to new data. Therefore, additional testing on other datasets, such as a different test set, is required.

#### 6.3. Performance Metrics of Testing Data

^{2}. The following findings may be drawn from examining how well the models performed on the testing data: with lower MAE, MSE, RMSE, and RMSPE values and greater R

^{2}, LGBM, CatBoost, and MLP demonstrate improved performance on the test data. In addition to ElasticNet, Lasso, RF, Ridge, XGBoost, and others exhibit acceptable performance, with modest values for the metrics. The MAE, MSE, RMSE, RMSPE, and lower R

^{2}values for KNN, LR, and SGD are comparatively greater, indicating poor performance on the testing data. LGBM, CatBoost, and MLP perform better on the testing data when compared to the other models. They routinely achieve reduced errors (MAE, MSE, RMSE, RMSPE), greater R

^{2}, and better model fit on the testing set, all of which indicate enhanced prediction accuracy.

#### 6.4. Comparison of the Models

- Training Data: XGBoost has the highest R
^{2}and the lowest MAE, MSE, RMSE, and RMSPE values, indicating the best performance on the training data. The time series prediction for XGBoost is shown in Figure 5, where predicted streamflow inflows are depicted alongside the actual data. The fundamental patterns and fluctuations in streamflow across the dataset are largely captured by the XGBoost model, as can be seen in this figure. - Validation Data: the LGBM model has the highest R
^{2}and the lowest MAE, MSE, RMSE, and RMSPE values, demonstrating the best performance on the validation data. The time series prediction for LGBM against the actual data is shown in Figure 6. - Testing Data: LGBM has the highest R
^{2}and the lowest MAE, MSE, and RMSE values, showing the best performance on the testing data.

#### 6.5. Limitations of the Study

- (a)
- One limitation of our research is the reliance on a specific dataset from the Garudeshwar gauging station. The generalizability of the findings may be limited to this particular location, and may not directly apply to other river systems. Future studies should consider incorporating data from multiple gauging stations or rivers to validate the performance of the models across different regions.
- (b)
- Another limitation is the time frame of the dataset used in the study, which spans from 1980 to 2019. Although this provides a substantial historical perspective, it may not capture recent changes or evolving patterns in river inflow. Incorporating more up-to-date data would enhance the accuracy and relevance of the predictions.
- (c)
- Additionally, the study focused primarily on machine learning models and did not consider other factors that could influence river inflow, such as climate change, land use changes, or anthropogenic activities. Incorporating these factors into the modeling process may provide a more comprehensive understanding of the dynamics of river inflow.
- (d)
- Lastly, the performance of the models may be influenced by the quality and completeness of the data. Data quality issues, such as measurement errors, could impact the accuracy of the predictions. It is crucial for future research to address data preprocessing and quality control techniques to mitigate such limitations.

## 7. Conclusions

^{2}values on both the training and validation data, demonstrating a strong fit to the data and accurately capturing the variation in the target variable. Additionally, it performed well on the testing data, with relatively low MAE and RMSE values. LGBM also performed well across all three datasets, achieving competitive results for MAE, MSE, RMSE, and R

^{2}on both the testing and validation data, and demonstrated reasonable MAE and RMSE on the testing data. LGBM, renowned for its effective gradient-boosting implementation and its ability to handle large datasets and capture intricate correlations, showcased these strengths in this study. Results from XGBoost were encouraging, especially when applied to the training and validation data. It achieved the lowest MAE, MSE, RMSE, and RMSPE values on the training set, demonstrating an excellent fit. It also displayed reasonably low MAE and RMSE on the validation data, indicating strong generalization. However, it performed somewhat worse than CatBoost and LGBM in terms of R

^{2}scores on the testing data. Based on careful investigation and comparison from error plots, CatBoost was determined to have the best performance among the models. CatBoost performed optimally on the test data, demonstrating its ability to make accurate predictions on new, unseen data. Future studies should explore ensemble approaches, which combine the strengths of multiple models to enhance prediction accuracy. Incorporating domain knowledge and additional pertinent factors may also improve the performance of the models. To maintain the efficacy of these models in hydrological forecasting, continuous updating of the models with fresh data will be necessary.

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- El-Shafie, A.; Taha, M.R.; Noureldin, A. A neuro-fuzzy model for inflow forecasting of the Nile river at Aswan high dam. Water Resour. Manag.
**2007**, 21, 533–556. [Google Scholar] [CrossRef] - Stakhiv, E.; Stewart, B. Needs for Climate Information in Support of Decision-Making in the Water Sector. Procedia Environ. Sci.
**2010**, 1, 102–119. [Google Scholar] [CrossRef] [Green Version] - Kumar, V.; Yadav, S.M. Multi-objective reservoir operation of the Ukai reservoir system using an improved Jaya algorithm. Water Supply
**2022**, 22, 2287–2310. [Google Scholar] [CrossRef] - Chabokpour, J.; Chaplot, B.; Dasineh, M.; Ghaderi, A.; Azamathulla, H.M. Functioning of the multilinear lag-cascade flood routing model as a means of transporting pollutants in the river. Water Supply
**2020**, 20, 2845–2857. [Google Scholar] [CrossRef] - Venkataraman, K.; Tummuri, S.; Medina, A.; Perry, J. 21st century drought outlook for major climate divisions of Texas based on CMIP5 multimodel ensemble: Implications for water resource management. J. Hydrol.
**2016**, 534, 300–316. [Google Scholar] [CrossRef] [Green Version] - Hanak, E.; Lund, J.R. Adapting California’s water management to climate change. Clim. Chang.
**2012**, 111, 17–44. [Google Scholar] [CrossRef] - Sharma, K.V.; Kumar, V.; Singh, K.; Mehta, D.J. LANDSAT 8 LST Pan sharpening using novel principal component based downscaling model. Remote Sens. Appl. Soc. Environ.
**2023**, 30, 100963. [Google Scholar] [CrossRef] - Cho, K.; Kim, Y. Improving streamflow prediction in the WRF-Hydro model with LSTM networks. J. Hydrol.
**2022**, 605, 127297. [Google Scholar] [CrossRef] - Nearing, G.S.; Kratzert, F.; Sampson, A.K.; Pelissier, C.S.; Klotz, D.; Frame, J.M.; Prieto, C.; Gupta, H.V. What Role Does Hydrological Science Play in the Age of Machine Learning? Water Resour. Res.
**2021**, 57, e2020WR028091. [Google Scholar] [CrossRef] - Liang, J.; Li, W.; Bradford, S.; Šimůnek, J. Physics-Informed Data-Driven Models to Predict Surface Runoff Water Quantity and Quality in Agricultural Fields. Water
**2019**, 11, 200. [Google Scholar] [CrossRef] [Green Version] - Dinic, F.; Singh, K.; Dong, T.; Rezazadeh, M.; Wang, Z.; Khosrozadeh, A.; Yuan, T.; Voznyy, O. Applied Machine Learning for Developing Next-Generation Functional Materials. Adv. Funct. Mater.
**2021**, 31, 2104195. [Google Scholar] [CrossRef] - Clark, M.P.; Fan, Y.; Lawrence, D.M.; Adam, J.C.; Bolster, D.; Gochis, D.J.; Hooper, R.P.; Kumar, M.; Leung, L.R.; Mackay, D.S.; et al. Improving the representation of hydrologic processes in Earth System Models. Water Resour. Res.
**2015**, 51, 5929–5956. [Google Scholar] [CrossRef] - Legesse, D.; Vallet-Coulomb, C.; Gasse, F. Hydrological response of a catchment to climate and land use changes in Tropical Africa: Case study South Central Ethiopia. J. Hydrol.
**2003**, 275, 67–85. [Google Scholar] [CrossRef] - Yang, S.; Wan, M.P.; Chen, W.; Ng, B.F.; Dubey, S. Model predictive control with adaptive machine-learning-based model for building energy efficiency and comfort optimization. Appl. Energy
**2020**, 271, 115147. [Google Scholar] [CrossRef] - Wang, Z.; Yang, W.; Liu, Q.; Zhao, Y.; Liu, P.; Wu, D.; Banu, M.; Chen, L. Data-driven modeling of process, structure and property in additive manufacturing: A review and future directions. J. Manuf. Process.
**2022**, 77, 13–31. [Google Scholar] [CrossRef] - Hernández-Rojas, L.F.; Abrego-Perez, A.L.; Lozano Martínez, F.E.; Valencia-Arboleda, C.F.; Diaz-Jimenez, M.C.; Pacheco-Carvajal, N.; Garcia-Cardenas, J.J. The Role of Data-Driven Methodologies in Weather Index Insurance. Appl. Sci.
**2023**, 13, 4785. [Google Scholar] [CrossRef] - Feng, D.; Lawson, K.; Shen, C. Mitigating Prediction Error of Deep Learning Streamflow Models in Large Data-Sparse Regions With Ensemble Modeling and Soft Data. Geophys. Res. Lett.
**2021**, 48, e2021GL092999. [Google Scholar] [CrossRef] - San, O.; Rasheed, A.; Kvamsdal, T. Hybrid analysis and modeling, eclecticism, and multifidelity computing toward digital twin revolution. GAMM-Mitt.
**2021**, 44, e202100007. [Google Scholar] [CrossRef] - Aliashrafi, A.; Zhang, Y.; Groenewegen, H.; Peleato, N.M. A review of data-driven modelling in drinking water treatment. Rev. Environ. Sci. Bio/Technol.
**2021**, 20, 985–1009. [Google Scholar] [CrossRef] - Singh, K.; Singh, B.; Sihag, P.; Kumar, V.; Sharma, K.V. Development and application of modeling techniques to estimate the unsaturated hydraulic conductivity. Model. Earth Syst. Environ.
**2023**. [Google Scholar] [CrossRef] - Yang, D.; Chen, K.; Yang, M.; Zhao, X. Urban rail transit passenger flow forecast based on LSTM with enhanced long-term features. IET Intell. Transp. Syst.
**2019**, 13, 1475–1482. [Google Scholar] [CrossRef] - Nagar, U.P.; Patel, H.M. Development of Short-Term Reservoir Level Forecasting Models: A Case Study of Ajwa-Pratappura Reservoir System of Vishwamitri River Basin of Central Gujarat. In Hydrology and Hydrologic Modelling—HYDRO 2021; Timbadiya, P.V., Patel, P.L., Singh, V.P., Sharma, P.J., Eds.; Springer: Singapore, 2023; pp. 261–269. [Google Scholar] [CrossRef]
- Mehta, D.J.; Eslamian, S.; Prajapati, K. Flood modelling for a data-scare semi-arid region using 1-D hydrodynamic model: A case study of Navsari Region. Model. Earth Syst. Environ.
**2022**, 8, 2675–2685. [Google Scholar] [CrossRef] - Gangani, P.; Mangukiya, N.K.; Mehta, D.J.; Muttil, N.; Rathnayake, U. Evaluating the Efficacy of Different DEMs for Application in Flood Frequency and Risk Mapping of the Indian Coastal River Basin. Climate
**2023**, 11, 114. [Google Scholar] [CrossRef] - Omukuti, J.; Wanzala, M.A.; Ngaina, J.; Ganola, P. Develop medium- to long-term climate information services to enhance comprehensive climate risk management in Africa. Clim. Resil. Sustain.
**2023**, 2, e247. [Google Scholar] [CrossRef] - Kumar, V.; Yadav, S.M. A state-of-the-Art review of heuristic and metaheuristic optimization techniques for the management of water resources. Water Supply
**2022**, 22, 3702–3728. [Google Scholar] [CrossRef] - Rivera-González, L.; Bolonio, D.; Mazadiego, L.F.; Valencia-Chapi, R. Long-Term Electricity Supply and Demand Forecast (2018–2040): A LEAP Model Application towards a Sustainable Power Generation System in Ecuador. Sustainability
**2019**, 11, 5316. [Google Scholar] [CrossRef] [Green Version] - Singh, D.; Vardhan, M.; Sahu, R.; Chatterjee, D.; Chauhan, P.; Liu, S. Machine-learning- and deep-learning-based streamflow prediction in a hilly catchment for future scenarios using CMIP6 GCM data. Hydrol. Earth Syst. Sci.
**2023**, 27, 1047–1075. [Google Scholar] [CrossRef] - Mohammadi, B. A review on the applications of machine learning for runoff modeling. Sustain. Water Resour. Manag.
**2021**, 7, 98. [Google Scholar] [CrossRef] - Ibrahim, K.S.M.H.; Huang, Y.F.; Ahmed, A.N.; Koo, C.H.; El-Shafie, A. Forecasting multi-step-ahead reservoir monthly and daily inflow using machine learning models based on different scenarios. Appl. Intell.
**2023**, 53, 10893–10916. [Google Scholar] [CrossRef] - Rajesh, M.; Anishka, S.; Viksit, P.S.; Arohi, S.; Rehana, S. Improving Short-range Reservoir Inflow Forecasts with Machine Learning Model Combination. Water Resour. Manag.
**2023**, 37, 75–90. [Google Scholar] [CrossRef] - Cai, H.; Liu, S.; Shi, H.; Zhou, Z.; Jiang, S.; Babovic, V. Toward improved lumped groundwater level predictions at catchment scale: Mutual integration of water balance mechanism and deep learning method. J. Hydrol.
**2022**, 613, 128495. [Google Scholar] [CrossRef] - Jiang, S.; Zheng, Y.; Wang, C.; Babovic, V. Uncovering Flooding Mechanisms Across the Contiguous United States Through Interpretive Deep Learning on Representative Catchments. Water Resour. Res.
**2022**, 58, e2021WR030185. [Google Scholar] [CrossRef] - Herath, H.M.V.V.; Chadalawada, J.; Babovic, V. Hydrologically informed machine learning for rainfall–runoff modelling: Towards distributed modelling. Hydrol. Earth Syst. Sci.
**2021**, 25, 4373–4401. [Google Scholar] [CrossRef] - Chadalawada, J.; Herath, H.M.V.V.; Babovic, V. Hydrologically Informed Machine Learning for Rainfall-Runoff Modeling: A Genetic Programming-Based Toolkit for Automatic Model Induction. Water Resour. Res.
**2020**, 56, e2019WR026933. [Google Scholar] [CrossRef] - Lima, C.H.R.; Lall, U. Spatial scaling in a changing climate: A hierarchical bayesian model for non-stationary multi-site annual maximum and monthly streamflow. J. Hydrol.
**2010**, 383, 307–318. [Google Scholar] [CrossRef] - Turner, S.W.D.; Marlow, D.; Ekström, M.; Rhodes, B.G.; Kularathna, U.; Jeffrey, P.J. Linking climate projections to performance: A yield-based decision scaling assessment of a large urban water resources system. Water Resour. Res.
**2014**, 50, 3553–3567. [Google Scholar] [CrossRef] - Ab Razak, N.H.; Aris, A.Z.; Ramli, M.F.; Looi, L.J.; Juahir, H. Temporal flood incidence forecasting for Segamat River (Malaysia) using autoregressive integrated moving average modelling. J. Flood Risk Manag.
**2018**, 11, S794–S804. [Google Scholar] [CrossRef] - Banihabib, M.E.; Bandari, R.; Valipour, M. Improving Daily Peak Flow Forecasts Using Hybrid Fourier-Series Autoregressive Integrated Moving Average and Recurrent Artificial Neural Network Models. AI
**2020**, 1, 263–275. [Google Scholar] [CrossRef] - Demirel, M.C.; Venancio, A.; Kahya, E. Flow forecast by SWAT model and ANN in Pracana basin, Portugal. Adv. Eng. Softw.
**2009**, 40, 467–473. [Google Scholar] [CrossRef] - Chen, J.; Wu, Y. Advancing representation of hydrologic processes in the Soil and Water Assessment Tool (SWAT) through integration of the TOPographic MODEL (TOPMODEL) features. J. Hydrol.
**2012**, 420–421, 319–328. [Google Scholar] [CrossRef] - Yaseen, Z.M.; El-shafie, A.; Jaafar, O.; Afan, H.A.; Sayl, K.N. Artificial intelligence based models for stream-flow forecasting: 2000–2015. J. Hydrol.
**2015**, 530, 829–844. [Google Scholar] [CrossRef] - Dong, N.; Guan, W.; Cao, J.; Zou, Y.; Yang, M.; Wei, J.; Chen, L.; Wang, H. A hybrid hydrologic modelling framework with data-driven and conceptual reservoir operation schemes for reservoir impact assessment and predictions. J. Hydrol.
**2023**, 619, 129246. [Google Scholar] [CrossRef] - Kumar, V.; Sharma, K.V.; Caloiero, T.; Mehta, D.J.; Singh, K. Comprehensive Overview of Flood Modeling Approaches: A Review of Recent Advances. Hydrology
**2023**, 10, 141. [Google Scholar] [CrossRef] - Ikram, R.M.A.; Ewees, A.A.; Parmar, K.S.; Yaseen, Z.M.; Shahid, S.; Kisi, O. The viability of extended marine predators algorithm-based artificial neural networks for streamflow prediction. Appl. Soft Comput.
**2022**, 131, 109739. [Google Scholar] [CrossRef] - Ni, L.; Wang, D.; Wu, J.; Wang, Y.; Tao, Y.; Zhang, J.; Liu, J. Streamflow forecasting using extreme gradient boosting model coupled with Gaussian mixture model. J. Hydrol.
**2020**, 586, 124901. [Google Scholar] [CrossRef] - Meresa, H. Modelling of river flow in ungauged catchment using remote sensing data: Application of the empirical (SCS-CN), Artificial Neural Network (ANN) and Hydrological Model (HEC-HMS). Model. Earth Syst. Environ.
**2019**, 5, 257–273. [Google Scholar] [CrossRef] - Adnan, R.M.; Kisi, O.; Mostafa, R.R.; Ahmed, A.N.; El-Shafie, A. The potential of a novel support vector machine trained with modified mayfly optimization algorithm for streamflow prediction. Hydrol. Sci. J.
**2022**, 67, 161–174. [Google Scholar] [CrossRef] - Meng, E.; Huang, S.; Huang, Q.; Fang, W.; Wu, L.; Wang, L. A robust method for non-stationary streamflow prediction based on improved EMD-SVM model. J. Hydrol.
**2019**, 568, 462–478. [Google Scholar] [CrossRef] - Noori, R.; Karbassi, A.R.; Moghaddamnia, A.; Han, D.; Zokaei-Ashtiani, M.H.; Farokhnia, A.; Gousheh, M.G. Assessment of input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction. J. Hydrol.
**2011**, 401, 177–189. [Google Scholar] [CrossRef] - Tyralis, H.; Papacharalampous, G.; Langousis, A. A Brief Review of Random Forests for Water Scientists and Practitioners and Their Recent History in Water Resources. Water
**2019**, 11, 910. [Google Scholar] [CrossRef] [Green Version] - Tyralis, H.; Papacharalampous, G.; Langousis, A. Super ensemble learning for daily streamflow forecasting: Large-scale demonstration and comparison with multiple machine learning algorithms. Neural Comput. Appl.
**2021**, 33, 3053–3068. [Google Scholar] [CrossRef] - Song, Z.; Xia, J.; Wang, G.; She, D.; Hu, C.; Hong, S. Regionalization of hydrological model parameters using gradient boosting machine. Hydrol. Earth Syst. Sci.
**2022**, 26, 505–524. [Google Scholar] [CrossRef] - Akbarian, M.; Saghafian, B.; Golian, S. Monthly streamflow forecasting by machine learning methods using dynamic weather prediction model outputs over Iran. J. Hydrol.
**2023**, 620, 129480. [Google Scholar] [CrossRef] - Luo, P.; Luo, M.; Li, F.; Qi, X.; Huo, A.; Wang, Z.; He, B.; Takara, K.; Nover, D.; Wang, Y. Urban flood numerical simulation: Research, methods and future perspectives. Environ. Model. Softw.
**2022**, 156, 105478. [Google Scholar] [CrossRef] - Kumar, V.; Azamathulla, H.M.; Sharma, K.V.; Mehta, D.J.; Maharaj, K.T. The State of the Art in Deep Learning Applications, Challenges, and Future Prospects: A Comprehensive Review of Flood Forecasting and Management. Sustainability
**2023**, 15, 10543. [Google Scholar] [CrossRef] - Niu, W.; Feng, Z. Evaluating the performances of several artificial intelligence methods in forecasting daily streamflow time series for sustainable water resources management. Sustain. Cities Soc.
**2021**, 64, 102562. [Google Scholar] [CrossRef] - Bhasme, P.; Vagadiya, J.; Bhatia, U. Enhancing predictive skills in physically-consistent way: Physics Informed Machine Learning for Hydrological Processes. arXiv
**2021**, arXiv:2104.11009. [Google Scholar] [CrossRef] - Souza, D.P.M.; Martinho, A.D.; Rocha, C.C.; da S. Christo, E.; Goliatt, L. Hybrid particle swarm optimization and group method of data handling for short-term prediction of natural daily streamflows. Model. Earth Syst. Environ.
**2022**, 8, 5743–5759. [Google Scholar] [CrossRef] - Martinho, A.D.; Saporetti, C.M.; Goliatt, L. Approaches for the short-term prediction of natural daily streamflows using hybrid machine learning enhanced with grey wolf optimization. Hydrol. Sci. J.
**2023**, 68, 16–33. [Google Scholar] [CrossRef] - Haznedar, B.; Kilinc, H.C.; Ozkan, F.; Yurtsever, A. Streamflow forecasting using a hybrid LSTM-PSO approach: The case of Seyhan Basin. Nat. Hazards
**2023**, 117, 681–701. [Google Scholar] [CrossRef] - Hao, R.; Bai, Z. Comparative Study for Daily Streamflow Simulation with Different Machine Learning Methods. Water
**2023**, 15, 1179. [Google Scholar] [CrossRef] - Bakhshi Ostadkalayeh, F.; Moradi, S.; Asadi, A.; Moghaddam Nia, A.; Taheri, S. Performance Improvement of LSTM-based Deep Learning Model for Streamflow Forecasting Using Kalman Filtering. Water Resour. Manag.
**2023**, 37, 3111–3127. [Google Scholar] [CrossRef] - Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. Catboost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; Volume 31, pp. 6638–6648. [Google Scholar]
- Kramer, O. K-Nearest Neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Springer: Berlin/Heidelberg, Germany, 2013; Volume 51, pp. 13–23. [Google Scholar] [CrossRef]
- Fan, J.; Ma, X.; Wu, L.; Zhang, F.; Yu, X.; Zeng, W. Light Gradient Boosting Machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag.
**2019**, 225, 105758. [Google Scholar] [CrossRef] - Su, X.; Yan, X.; Tsai, C.-L. Linear regression. Wiley Interdiscip. Rev. Comput. Stat.
**2012**, 4, 275–294. [Google Scholar] [CrossRef] - Gardner, M.; Dorling, S. Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences. Atmos. Environ.
**1998**, 32, 2627–2636. [Google Scholar] [CrossRef] - Biau, G.; Scornet, E. A random forest guided tour. TEST
**2016**, 25, 197–227. [Google Scholar] [CrossRef] [Green Version] - Luo, X.; Chang, X.; Ban, X. Regression and classification using extreme learning machine based on L1-norm and L2-norm. Neurocomputing
**2016**, 174, 179–186. [Google Scholar] [CrossRef] - McDonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat.
**2009**, 1, 93–100. [Google Scholar] [CrossRef] - Ryali, S.; Chen, T.; Supekar, K.; Menon, V. Estimation of functional connectivity in fMRI data using stability selection-based sparse partial correlation with elastic net penalty. Neuroimage
**2012**, 59, 3852–3861. [Google Scholar] [CrossRef] [Green Version] - Song, S.; Chaudhuri, K.; Sarwate, A.D. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, Austin, TX, USA, 3–5 December 2013; pp. 245–248. [Google Scholar] [CrossRef] [Green Version]
- Sheridan, R.P.; Wang, W.M.; Liaw, A.; Ma, J.; Gifford, E.M. Extreme Gradient Boosting as a Method for Quantitative Structure–Activity Relationships. J. Chem. Inf. Model.
**2016**, 56, 2353–2360. [Google Scholar] [CrossRef] - Chadalawada, J.; Babovic, V. Review and comparison of performance indices for automatic model induction. J. Hydroinform.
**2019**, 21, 13–31. [Google Scholar] [CrossRef] [Green Version]

Flow | |
---|---|

Mean | 784.8985221 |

Standard Error | 18.28637548 |

Median | 184.0000428 |

Mode | 23.19005239 |

Standard Deviation | 2210.307722 |

Sample Variance | 4,885,460.225 |

Kurtosis | 128.7110287 |

Skewness | 8.786730848 |

Range | 60,640.72647 |

Minimum | 1.270052203 |

Maximum | 60,641.99652 |

Sr No. | Model | MAE_Train | MSE_Train | RMSE_Train | RMSPE_Train | R^{2}_Train |
---|---|---|---|---|---|---|

1 | CatBoost | 124.89 | 131,672.45 | 362.87 | 150.28 | 0.98 |

2 | ElasticNet | 414.90 | 2,304,350.42 | 1518.01 | 853.11 | 0.61 |

3 | KNN | 320.95 | 1,773,732.98 | 1331.82 | 310.48 | 0.70 |

4 | Lasso | 327.18 | 1,923,781.45 | 1387.00 | 568.25 | 0.67 |

5 | LGBM | 215.89 | 863,329.16 | 929.16 | 256.82 | 0.85 |

6 | LR | 434.94 | 1,979,323.29 | 1406.88 | 1005.55 | 0.67 |

7 | MLP | 298.63 | 1,599,712.13 | 1264.80 | 276.29 | 0.73 |

8 | RF | 117.58 | 332,086.13 | 576.27 | 295.72 | 0.94 |

9 | Ridge | 330.27 | 1,923,316.06 | 1386.84 | 584.78 | 0.68 |

10 | SGD | 366.52 | 1,973,385.04 | 1404.77 | 980.74 | 0.67 |

11 | XGBoost | 75.04 | 38,693.90 | 196.71 | 142.99 | 0.99 |

Sr No. | Model | MAE_Val | MSE_Val | RMSE_Val | RMSPE_Val | R^{2}_Val |
---|---|---|---|---|---|---|

1 | CatBoost | 261.90 | 1,430,686.30 | 1196.11 | 346.56 | 0.65 |

2 | ElasticNet | 385.08 | 1,555,769.49 | 1247.30 | 778.53 | 0.61 |

3 | KNN | 329.22 | 1,960,894.83 | 1400.32 | 446.31 | 0.51 |

4 | Lasso | 293.32 | 1,156,911.27 | 1075.60 | 538.62 | 0.71 |

5 | LGBM | 243.10 | 1,181,938.31 | 1087.17 | 287.91 | 0.71 |

6 | LR | 393.23 | 1,194,250.83 | 1092.82 | 992.99 | 0.70 |

7 | MLP | 249.45 | 1,069,732.66 | 1034.28 | 307.27 | 0.73 |

8 | RF | 259.75 | 1,386,585.60 | 1177.53 | 368.38 | 0.66 |

9 | Ridge | 296.56 | 1,157,972.15 | 1076.09 | 579.68 | 0.71 |

10 | SGD | 345.98 | 1,183,130.23 | 1087.72 | 908.38 | 0.71 |

11 | XGBoost | 264.54 | 1,349,874.60 | 1161.84 | 419.95 | 0.67 |

Sr No. | Model | MAE_Test | MSE_Test | RMSE_Test | RMSPE_Test | R^{2}_Test |
---|---|---|---|---|---|---|

1 | CatBoost | 108.24 | 135,853.97 | 368.58 | 327.13 | 0.66 |

2 | ElasticNet | 267.84 | 195,282.23 | 441.91 | 1308.04 | 0.52 |

3 | KNN | 163.42 | 257,940.28 | 507.88 | 1067.24 | 0.36 |

4 | Lasso | 183.20 | 141,977.14 | 376.80 | 959.14 | 0.65 |

5 | LGBM | 105.68 | 115,456.65 | 339.79 | 332.76 | 0.71 |

6 | LR | 292.27 | 209,780.42 | 458.02 | 1424.00 | 0.48 |

7 | MLP | 131.03 | 123,120.76 | 350.89 | 466.30 | 0.69 |

8 | RF | 123.84 | 152,710.94 | 390.78 | 831.76 | 0.62 |

9 | Ridge | 187.82 | 146,634.81 | 382.93 | 996.15 | 0.64 |

10 | SGD | 252.24 | 195,665.92 | 442.34 | 1451.56 | 0.51 |

11 | XGBoost | 129.03 | 171,242.26 | 413.81 | 1102.39 | 0.58 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kumar, V.; Kedam, N.; Sharma, K.V.; Mehta, D.J.; Caloiero, T.
Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models. *Water* **2023**, *15*, 2572.
https://doi.org/10.3390/w15142572

**AMA Style**

Kumar V, Kedam N, Sharma KV, Mehta DJ, Caloiero T.
Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models. *Water*. 2023; 15(14):2572.
https://doi.org/10.3390/w15142572

**Chicago/Turabian Style**

Kumar, Vijendra, Naresh Kedam, Kul Vaibhav Sharma, Darshan J. Mehta, and Tommaso Caloiero.
2023. "Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models" *Water* 15, no. 14: 2572.
https://doi.org/10.3390/w15142572