Predicting Flood Hazards in the Vietnam Central Region: An Artificial Neural Network Approach

Pham Quang, Minh; Tallam, Krti

doi:10.3390/su141911861

Open AccessArticle

Predicting Flood Hazards in the Vietnam Central Region: An Artificial Neural Network Approach

by

Minh Pham Quang

^1,*

and

Krti Tallam

^2,*

¹

VNU-HCM High School for the Gifted, Ho Chi Minh City 70000, Vietnam

²

Department of Biology, Stanford University, Stanford, CA 94305, USA

^*

Authors to whom correspondence should be addressed.

Sustainability 2022, 14(19), 11861; https://doi.org/10.3390/su141911861

Submission received: 1 August 2022 / Revised: 6 September 2022 / Accepted: 13 September 2022 / Published: 21 September 2022

(This article belongs to the Special Issue Applications of Machine Learning and Big Data Analytics for Environmental Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

Flooding as a hazard has negatively impacted Vietnam’s agriculture, economy, and infrastructure with increasing intensity because of climate change. Flood hazards in Vietnam are difficult to combat, as Vietnam is densely populated with rivers and canals. While there are attempts to lessen the damage through hazard mitigation policies, such as early evacuation warnings, these attempts are made heavily reliant on short-term traditional statistical models and physical hydrology modeling, which provide suboptimal results. The current situation is caused by the fragmented approach from the Vietnamese government and exacerbates a need for more centralized and robust flood predictive systems. Local governments need to employ their own prediction models which often lack the capacity to draw key insights from limited flood occurrences. Given the robustness of machine learning, especially in low data settings, in this study, we attempt to introduce an artificial neural network model with the aim to create long-term forecast and compare it with other machine learning approaches. We trained the models using different variables evaluated under three characteristics: climatic, hydrological, and socio-economic. We found that our artificial neural network model performed substantially better both in performance metrics (91% accuracy) and relative to other models and can predict well flood hazards in the long term.

Keywords:

flood risk assessment; artificial neural networks; natural hazards; machine learning; flood forecasting

1. Introduction

Situated in Southeast Asia, Vietnam has been affected by natural hazards and in particular, floods. Vietnam is ranked sixth in the highest climate-risk countries during the period of 1999 to 2018 [1] and ranked fifth in countries most prone to flood risk [2]. With climate change intensifying extreme weather patterns, Vietnam has been increasingly impacted by more erratic storms and floods. In particular, the 2020 monsoon season brought about intense flooding, causing 100 deaths and flooding thousands of homes in the Hue province, central Vietnam, and many crops and infrastructure were destroyed [3]. Therefore, it is of interest for Vietnam’s policymakers to develop a highly efficient flood mitigation system.

Flood risk is a highly complex factor which involves both natural and socio-economic elements [4,5]. Flood risk can be defined as the probability of being exposed to potential flood hazards. These hazards can come in the form of fluvial flood (river flood), pluvial flood (rainfall flood), and coastal flood. The occurrences of these process are facilitated through natural processes such as torrential rain, typhoons, and storm surges. Moreover, we can note the influence of urbanization in Vietnamese cities on the increased vulnerability of its citizens to urban flooding [6]. Given that Vietnam is exposed to a variety of flood hazards, we considered all types of flood hazards as flood risks in this study. As a result, policymakers need to identify interactions between natural risks and other societal risk factors in response to flood damage [7]. Vietnam puts flood control management as one of its national-level objectives under the Law on Natural Disaster Prevention and Control [8]. Vietnam also holds great emphasis on employing digital information to create data-driven environmental policies. With the growing availability of data and data analysis tools [9], Vietnam has developed its own national disaster database: the Damage and Needs Assessment system (DANA) to better assess the damages caused by flood disaster [10]. The Vietnamese government has also developed strategies to control and mitigate flood risk using a variety of strategies, including preventive measure such as building dikes, dams to divert river flow, and early warning communications system for evacuation measures. Mitigation measures include livelihood diversification strategies for households in flood-prone areas, as well as other safety nets [11]. However, current efforts remain insufficient to drastically reduce flood risk and vulnerability [12]. There are limited high-quality data analysis tools in Vietnamese policymaking which is exacerbated by a lack of research into the development of predictive modeling in Vietnam.

Current research in Vietnam faces some limitations. For instance, there is a lack of research addressing socio-economic parameters in evaluating and analyzing flood risk and prediction [13,14,15]. Additionally, current research often employs more traditional statistical models instead of utilizing state-of-the-art machine learning models, which suffers from the limitations of traditional models to simulate natural processes. These limitations often hamper the ability of Vietnamese policymakers to establish well-informed decisions on flood risk assessment [16]. Given the growing efficiency of machine learning (ML) algorithms in determining relationships between complex variables in flood hazards, it provides an effective tool to develop effective mitigation and first response policy to flood risk. ML predictive models and artificial neural networks (ANNs) have been deployed in many types of natural hazards, such as forest fires [17], earthquakes [18], and thunderstorms [19], in different contexts. Predictive ML models provide new knowledge in the relationship between flooding and environmental variables and provide accurate prediction on the occurrence of floods. These predictive models are often characterized as highly accurate and computational-scalable, compared to their traditional statistical modeling equivalents [20]. While flood hazards data increase, as seen in the DANA database, statistical modeling available shows to only be capable of indicating a strong causal relationship between natural factors and floods. This is insufficient in the context of Vietnamese decision makers’ flood mitigation strategy such as risk assessment and rapid flood response.

In the current literature, there are multiple approaches towards flood predictions. There are more physical simulation approaches such as MIKE FLOOD [21]. Methods such as these, while they can indicate flood hazards, are unable to generalize to larger areas. This is because they require substantial climatic and hydrological data which can only be obtained in limited areas. Others attempted the usage of predicting flood hazards by attempting to simulate water runoff through the usage of meta-heuristic techniques [22]. Many authors would create complex hydrological simulations based on the different natural parameters and techniques. These techniques would require heavy fine-tuning which inhibit the generality of the models, have intensive data and computational requirements, and need expert knowledge to perform well. Aside from physical simulation models, researchers have attempted to predict flooding through the usage of data-driven statistical models. Attempts are made using certainty factors, logistic regression ensemble methods [23], and ARIMA [24]. However, many studies share similar drawbacks, such as a lack of generality requiring different approaches for different areas, and many require long-term data, many upwards to 100 years long to perform well, which is unavailable in many areas. Moreover, studies have been conducted to integrate different forms of data such as GIS and remote sensing. These are attempts to integrate various geospatial features through LANDSAT and DEM imagery into traditional methods such as logistic regression [25,26] or MCDM–AHP models [27]. While geospatial variables can deliver promising results, there are several drawbacks in the current research. When using GIS data, high-quality data are often limited, which hampers model performance. Moreover, GIS data often do not have temporal components necessary for many models and cannot be used to detect patterns that are direct causes of flooding [14].

A current field of interest for research is in the application of ML algorithms in flood forecasting. ML algorithms are promising because they provide distinctive advantages compared to traditional models, including the ability to handle non-linear relationships and the ability to integrate in a wide range of parameters, including socio-economics variables [28]. Research have been conducted using tree-based ensemble ML algorithms such as AdaBoost [29], bagging [30] and stacking LWLR-RF [31], and random forest [32]. Additional studies have been conducted using Support Vector Machine (SVM) [33], Genetic Algorithm Rule-Set Production [34], and Quick Unbiased Efficient Statistical Tree [34]. These method offers promising results in the field of flood prediction and forecasting.

There are still gaps in the scientific communities in applying ML in flooding. The recent advancements in ANNs are still underutilized as little research has been conducted to apply ANNs in predicting flood hazards in Southeast Asia, such as in Vietnam. This requires local governments to rely on limited information and models, through the usage of traditional statistical models, instead of state-of-the-art predictive models, especially ANNs, which can provide specific predictions and identify key relationships [13]. Finally, another critical in flood prediction research is the lack of consideration in many ML studies for socio-economics variables. The few papers that do include socio-economics parameters only work with precise geospatial maps, which are extremely limited in the context of Vietnam and other economically developing countries [7,15].

The study Is our attempt at resolving these gaps. Here, we attempt to explore whether ML algorithms, particularly ANNs, can provide accurate and accessible predictive models for long-term flood forecasting. We aimed to develop highly accurate predictive models for flood hazards with easy to deploy models. We analyze how ANNs perform in the context of flooding in Vietnam and in low-information settings, and how socio-economics variables play a role in flood model performance. Our methods provide a novel approach in flood predictive modeling in Vietnam, as we introduce ANNs usage in long-term flood prediction and perform a comparative study of different ML models accuracy. We also consider socio-economic features that have not been evaluated by previous studies using machine learning [3,13,14] and evaluate their significance.

2. Materials and Methods

We present our findings in this study as follows: we first introduce our data collection and cleaning methodologies. We compile our data into training and validation datasets through standard random 80:20 splitting. The choice for such splitting is based on previous studies showcasing generally optimal results on the ratio [35]. Other split ratios could be considered in the future. We present ML predictive models trained through the training dataset and compare their respective accuracy on the validation dataset. Finally, we highlight the accuracy of ANNs when compared with other ML algorithms. We make a call for the need for increased fine-resolution flood hazards data for Vietnam flood prediction and management.

2.1. Study Region

We focus our study on 15 provinces, totaling 139,000 km², in central Vietnam. The approximate geographic location of the region is latitudinally from 10°33′ N to 19°50′ N and longitudinally from 104°92′ E to 106°47′ E. More specifically, selected provinces such as Da Nang, Thua Thien Hue, Nha Trang, Binh Dinh, and Nghe An were chosen for analysis based on increased availability of data.

Central Vietnam is characterized by a long coastline and heavy rainfalls. As such, the area often faces coastal and flash flooding. A combination of geographical and climatic features including mountainous terrain, rainfall upwards of 2000 mm per year, and tropical storms, all of which have made the region a hotspot for floods. With an averaged more than ten major floods per year, central Vietnam finds itself impacted from floods with regards to socio-economic development, agricultural and infrastructural damage, and human fatalities. Given the frequency and impact of flooding in the region and the availability of flood data in the region, we decided that these 15 provinces would provide suitable ground for our models.

2.2. Data Collection

2.2.1. Data Characteristics

We collected data based on three evaluation characteristics: climate, hydrology, and socio-economic. Climate features include monthly precipitation, monthly rainfall, and monthly temperature. Hydrological features include monthly river flow volume and average river velocity of major rivers in the central region of Vietnam. Socio-economic features include population density in the central region of Vietnam and urbanization percentage in the central region of Vietnam. We also collected the monthly number of flooding events and monthly estimated flood damages in the area. The selection of natural variables such as rainfall and precipitation were made based on previous work analysis indicating a strong relationship between these variables and flooding [12]. The variables were presented in time-series format with weather, meteorological, and hydrological stations chosen from specific locations in central Vietnam, including Da Lat, Da Nang, Hue, Nha Trang, Playku, Qui Nhon, and Vinh.

Monthly precipitation and rainfall are crucial in affecting the likelihood of flooding. Rain is a major contributor to the accumulation of water in rivers and as such is proportional to the probability of river runoff causing flooding. Heavy torrential downpour is also likely to cause floods, as they introduce excessive quantities of water, which exceed the river’s holding capacity over a given time period. This results in severe damage to local infrastructure [36].

Monthly temperature influences rainfall and soil absorption. Higher temperature means more water will evaporate, causing more rainfall. The inverse is true in lower temperatures, soil contains less moisture and absorbs less water. This allows for water to move freely in rivers, causing rapid river flow and flood risk [37].

Monthly river volume indicates the amount of water passing through a portion of the river. The intensity of a flood event is based on the excess water volume over the area, causing heavier river runoff and flood events [38].

River flow velocity is the speed at which water passes through a specific portion of the river on average. The intensity of velocity correlates with the likelihood of high-risk floods, as less time is given for locals to evacuate and defend against flood hazards [37].

Population density plays a vital role in understanding the extent of damages and likelihood of high-profile floods in the area. High population densities exacerbate flood damages due to the large number of homes and the value of infrastructure. Increased human populations also mean more human activity, which can often directly alter the flow of water and subsequently increase the probability of flooding [39].

Urbanization percentage directly affects flood risk probability of the area. In more urban areas, buildings are made with more flood-resistant materials and the government deploys better flood mitigation policies in cities to decrease flood risk. In urban areas, there are more dikes, dams, and other flood preventive measures, reducing the probability of floods [40].

We also identified time-series data to be most suitable for training and predicting ML and ANNs. Flooding as an event is heavily seasonal in Vietnam and is influenced by previous trends and weather patterns. The dataset would be representative of the more important variables for predictive ML models and be used for training and validating our models.

2.2.2. Data Collection

We collected the climate, hydrology, and socio-economic data from the General Statistics Office of Vietnam. These data points were collected from various Vietnamese governmental agencies and meteorological stations and presented by the General Statistics Office of Vietnam. The observations made were only available as the aggregate of each province for climatic and socio-economic data. This meant the specific locations of the various stations in each location are not given in the dataset and only the averages of the values made by all stations were available. Furthermore, for hydrological data, only the rivers on which the observations conducted were given. This required hand-labeling the observations of hydrological data to the specific provinces in the dataset. Only provinces in central Vietnam were chosen, while the rest were filtered out. The rationale was based upon the data availability of flood hazards in form of time-series data available in the central region of Vietnam and that the central region of Vietnam is often exposed to flood hazards [6].

For the monthly number of flooding events and monthly estimated flood damages, data were compiled using the EM-DAT and UN DesInventar flood risk assessment datasets. These datasets include dates of flood hazards, their locations, and estimated economic damages in VND.

We collected a total of 288 observations based on flood events in the central region of Vietnam during the period of 2002–2020. We appended our flood hazards event dataset with the corresponding monthly flood data, to label each of our monthly instances of data with correlation to flood hazards occurrence, thereby creating a labeled dataset for supervised machine learning.

2.2.3. Data Cleaning

When collecting our dataset, we discovered that our data on climate, hydrology, and socio-economic factors had missing data points. We performed interpolation to amend the dataset using the arithmetic average method, extrapolating data points using the averages of all other available data points, across time, for each feature. By aggregating the averages of each missing row of data, we ensured that the predictive model will still be able to generalize and predict on unseen cases [41]. We discovered 10 observations out of 288 included data points that were missing and subsequently were synthetically generated.

We also discovered varying timescales of each variable, which made it challenging for the dataset to train effectively with those discrepancies. Therefore, we converted the dataset into monthly frequencies by appending each incident to the corresponding month. The datasets were originally disjointed into different files; hence we merged them together into a single flood-incidence dataset.

2.3. Data Augmentation and Model Implementation

2.3.1. Experimental Setup and Software Materials

All data processing and analysis were conducted using Python 3.9 on Google Colab. NumPy was employed to clean and process the dataset. Scikit-learn was used to help implement SVM, K-Nearest Neighbor (KNN). The artificial neural network (ANN) was implemented utilizing Keras and Google Deep Learning Library, Tensorflow [42].

We developed our models to ensure ease of access and generalizability, using Python, Tensorflow, and Google Deep Learning, instead of other tools such as Weka. This was performed to ensure our model better integrates more broadly into other modeling pipelines in the future (modeling software, such as Weka, do not provide the code for their models’ implementation).

2.3.2. Dataset Rebalancing

During the experimental phase of developing predictive ML models, we determined that early experimental ML models were more accurate towards predicting months without flood (the majority class) compared to months with flood (the minority class), while our focus was on flood prediction. After reanalyzing the dataset, we concluded that the cause of bias was due to an imbalance in our majority and minority classes. This introduced unwanted bias and skewness towards the majority class in our predictions. To counteract the imbalance in our dataset, we employed the SMOTE + ENN algorithm to generate synthetic data points to ensure a balance between the majority and minority class. The SMOTE algorithm generates a balanced class distribution through random creation of new minority class instances. This is accomplished via examining the linear relationship of preexisting data for the minority class. These synthetic training samples are generated by randomly selecting K-nearest neighbors for each sample in the minority class and choosing a random position on the feature space [43]. We noted that by only employing SMOTE in the algorithm our classifier may be overfit on our training data, causing performance on testing data. This is because SMOTE introduces noise and causes class clusters to invade each other’s space, forcing models to create complex predictions which have poor generalizability. We addressed this issue through under-sampling the majority class using the ENN algorithm. The ENN algorithm removes a particular instance of the majority if the random neighbors of that instance misclassify it. By combining the SMOTE and ENN algorithms, classifier models can infer and make accurate predictions with less bias and overfitting [44].

After balancing our dataset using the SMOTE + ENN, we managed to generate a new dataset with 388 observations with equal number of observations for majority class and minority class.

2.3.3. Data Preprocessing

We normalized all our variables using Min-Max normalization [45] on our dataset to ensure each data point was scaled to the same range from 0 to 1. The Min-Max normalization function is defined as Equation (1):

x_scaled = (x − x_min)/(x_max − x_min)

(1)

where x is the variable being scaled, x_max is the maximum value x over all observation of the variable, x_min is the minimum value x over all observation of the variable, and x_scaled is our result. We scaled our dataset, given that some variables were measured using different scales such as mm for rainfall or m², m³, or seconds for river flow discharge. We normalize our variables so that variables measured in different ranges have equal weighted contributions towards the final model, avoiding bias.

We created a validation dataset as a method to evaluate the performance of each of our models and fine-tune the hyperparameters of our models. The validation is hidden from the model at the training period, which allows us to obtain an unbiased evaluation of our models’ performance. We divided our dataset into training dataset and validation dataset. The dataset was divided such that 80% of the dataset was used for training; 20% was used for validation. There was no overlap between the training dataset and validation dataset.

2.3.4. Machine Learning Algorithms

We determined that our predictive model would be used to solve the binary classification problem. We identified months with flood as 1 and months without flood with 0. We determined that the binary classification is the most suitable method of solving our problem, as we were only interested in predicting flood events. The binary classification class of algorithms is a set of supervised learning algorithms that deal with predicting only binary outcomes. We chose a variety of algorithms including ANNs, SVM, KNN, and Linear Regression to train on our dataset. We chose these algorithms as these represent the state-of-the-art ML predictive modeling techniques and provide powerful methods to predict flooding. While our datasets are small compared to datasets frequently used in ML, we determined that the ML algorithms we chose work well in low-information settings [46,47]. ML models better utilize data to identify hidden patterns as they can extract complex features through their autonomous learning process. These models provide more accurate results when compared to traditional mathematical modeling, which draws strict rules from human expertise. We focus on ANNs in this study, as they perform better for our goals in relation to other algorithms.

Artificial Neural Network

ANNs models are characterized by their ability to learn from data and extract information from key features [20]. They provide insight into the relationship between important variables causing floods, while making predictions on the likelihood of flooding events. They provide a method for policymakers to make better decisions by analyzing and extracting more information on the dataset while reducing policymakers’ human bias.

ANNs consist of different layers and are tuned using hyperparameters. The dataset is passed through the input layer and hidden layers. The hidden layers consist of different layers of neurons, which contribute to each other by using a set of connections with different weights. We chose to include neurons with Rectified Linear Unit (ReLU) to ensure that we do not face the vanishing gradient problem and decrease our computation cost [48]. We employed the use of the sigmoid function as our final layer activation function to ensure a smooth gradient and that sigmoid functions provide a result in the range of 0 and 1 which is suitable for the binary classification problem [49]. We determined the performance of our ANNs through the loss function. The loss function provides a method to measure the difference between predicted outcomes and actual outcome, creating a way to update ANNs’ weight and improving their accuracy. We choose Binary Cross-Entropy, as our problem deals with binary classification and the loss function also works well with the sigmoid function. ANNs update their weights and learning rates using an optimizer. Optimizers are especially important as they ensure the model can ensure lowest possible “loss” and arrive at the most accurate prediction. We chose the Adam optimizer for the model, as it generally has better performance than most optimizers and requires much less computational resources.

We also tuned the other hyperparameters of the model to optimize the performance of the model. These hyperparameters include epochs (number of iterations of the dataset the model has trained on) and the dropout rate (the probability at which a neuron is ignored) [50]. We determined the best set of hyperparameters through trial-and-error methods. We took note of each hyperparameter under our metrics and then selected the best performing model. For the number of epochs, we tested several epochs between 5 and 400. We found that with an increasing number of epochs, the loss decreases and performance of the model increases. Our model stabilized at 150 epochs, with diminishing returns afterwards. We also experimented on different dropout rates and determined that a dropout rate of 0.5 was most suitable to ensure minimal overfitting in the model.

Other Algorithms

We additionally developed a series of comparative ML prediction models to ensure the strength of our ANN model performance. Each model was trained using the same dataset and under the same testing conditions. We chose these algorithms because they work well with the binary classification problem and provide a good baseline to discuss about the result of our ANNs model [47,51].

K-Nearest Neighbor (KNN) is used for a variety of problems in ML. It is relatively simple to implement and is computation-efficient, only needing to calculate the relationship of an observation and “k” of its nearest neighbor. The algorithm works by considering the Euclidean distance between each of the nearest neighbors and then classifying them based on a plurality vote based on distance weights. K here is defined as the number of neighbors being weighted in calculating the cost of each sample point [52]. We tested to find the most optimal k for our model and found that k = 7 provides the best result overall. Table 1 showcases accuracy of ‘k’ tested for the model.

Support Vector Machine (SVM) is a machine learning algorithm designed for classification and regression problems. It does this by creating a best-fit hyperplane or hyperplanes between each group of inputs. An SVM model can determine the hyperplane quickly by calculating the maximal margin of the classes. However, this can only be performed if the inputs are linear; for non-linear inputs this problem is resolved via a kernel. A kernel is a function which allows the model to draw the hyperplane on a non-linear dataset by extending the dataset into infinite dimensions. It ensures that the algorithm can find the right hyperplane most of the time [53]. For our model, we chose the Radial Basis Function (RBF) as our kernel as it ensures the best performance overall.

3. Results

3.1. Feature Importance

Figure 1 showcases the contributing factor of each of the seven conditioning features considered in the study. The hierarchy is generated through an SVM model and indicates the weightage of each variable on the training process of each ML model. It is indicated that each variable from each of the evaluation characteristics is ranked in the top three most important features (precipitation, river flow volume, population density). It is worth noting that precipitation contribution index ranked first and is almost double the value of the second most important variable in the model. The rest of the variables include rainfall, river flow speed, urbanization percentage, and temperature. Temperature was the least crucial contributing factor.

Evaluating each variable importance is a vital step in the understanding of the flood prediction problem. It gives insights into which factors require attention, and which is most useful in the training process of ML models. By looking at the importance of each variable we can understand more about the sensitivity of the ML model to each feature. It is interesting to note that while precipitation is ranked first in importance, rainfall only ranked fourth. One explanation for the reason behind this is that precipitation has strong correlation with rainfall, determining its frequency and intensity while influencing other flood-related factors such as moisture and river runoff, thus justifying its high importance.

Another notable fact to consider is that population density is ranked third most important in our study. To our knowledge, socio-economic characteristics are often overlooked in many other similar studies in Vietnam and as such, influence the ability of models to gain useful information for prediction. It is indicative that future studies include more socio-economics factor as it provides crucial knowledge for the training of ML models.

3.2. Training and Validation of ANN Model

Figure 2 and Figure 3 show the model accuracy and loss curve of our ANN model. Here, the most accurate model with selected hyperparameters is presented. Other ANN models provided less than optimal results or similar results to the ANN being given here. We can see an overall trend for both training and testing datasets accuracy and loss curve, which is reaching near optimal prediction performance at epoch 150, with diminishing returns onwards. The curves for training datasets were not as stable compared to validation datasets. This is attributed to the relatively limited amount of information being available for the ANN models to learn from. However, the model still performs well on validation, indicating the ability of the ANN model to learn and generalize information on complex flood data.

Looking at the accuracy curve of ANN, we notice a steady improvement in the ANN model after 400 epochs (the accuracy curve stabilized at around 0.55 until epoch 150, and then plateaued at 0.91). While the training accuracy was slightly higher than the validation dataset, it was more unstable. While instability can be from overfitting of the model, we can note the relative stability of the testing dataset. The relative difference in accuracy between the validation and training dataset is within 5% most of the training period, proving the model’s ability to generalize its ability to predict flood hazards. The model shows promising potential to extract information to generate flood prediction as its accuracy increases in successive iterations of the dataset.

The loss curve follows a similar trend to the accuracy curve. The training loss started with an initial value of almost 1.0, which is quite high, then the model was able to tremendously optimize to decrease the loss value. Both the training and testing datasets loss curves follow a similar trend of decreasing significantly until epoch 150 then plateauing in. At the end of the training period, the model reached a value of 0.1, much smaller than the starting value. Similar to the accuracy curves, the validation curve shows more stability compared to the training curve, while still keeping a relatively small difference compared to training curve. This behavior is indicative of the strong characteristics of an ANN model, the ability to optimize prediction results and to still prevent well in unseen situations.

3.3. ML Models Performance Comparison

We evaluated the performance of each of our models using accuracy, precision, recall, and f1-score. We choose these metrics because we want to evaluate how well the models perform when it comes to predicting flood outcomes. These metrics are calculated using the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) outcomes of each model prediction on our validation dataset. Each metric is defined by Equations (2)–(5):

Accuracy = (TP + TN)/(TP + TN + FP + FN)

(2)

Recall = (TP)/(TP + FN)

(3)

Precision = TP/(TP + FP)

(4)

f1-score = 2 × (precision × recall)/(precision + recall)

(5)

The equations above demonstrate how well a classifier model will perform in a flood prediction problem. The primary parameters used to evaluate the model performance are f1-score and accuracy.

Accuracy is the most frequently used metric in the classification problem. It is used to measure the number of accurate predictions being made by the model. However, it is worth noting that there are shortcomings to only using accuracy to evaluate a model. This comes in the form of an imbalance dataset, as accuracy will often bias model training focus on the majority class, while failing in classifying in minority class. Initial variations in the dataset often face problems classifying the minority when the dataset was imbalanced, causing poor results in predicting flood hazards. This problem is resolved later using resampling and rebalancing algorithms, but it also indicates a need for the use of additional metrics to complement accuracy [54].

The f1-score, which is also called the harmonic mean of precision and recall, incorporates both metrics into the score. By using f1-score, both recall and precision can be relatively balanced and can capture aspects of both metrics. The f1-scores are often used on imbalanced datasets to analyze the performance of each model accurately. For this reason, f1-score can be used for the comparison in different ML models [54].

We can calculate each of these metrics through each model confusion matrix in Figure 4.

We trained our models based on our training dataset and tested each model on our validation dataset. The results here showcase the optimal combination of parameters.

We can also note the different training time for different models. The ANN runtime for training was 5 min compared to KNN and SVM training times of 5 s and 10 s, respectively. The much higher processing time for the ANN can be attributed to the fact that ANNs are fundamentally different from KNN and SVM in the way each model is trained and predicts results. The performance of ANNs over KNN and SVM is thanks to the training process of ANNs and ability to overcome noisy data and require less data augmentation compared to the other models. The ANN model is also able to distinguish between flood and no-flood classes with higher accuracies compared to that of the other two models because of its ability to discover hidden patterns in data.

We present our results in Table 2. Both the SVM and KNN have average results in the validation dataset, with accuracy at 74% and 79%, respectively. The two models also perform poorly in predicting the minority class (when flood does happen), indicating that the two models cannot extract information from the datasets as well as the ANN model. It is interesting to note that rebalancing the dataset only improved the performance of ANN models, while SVM and KNN models did not see significant improvement in performance. One possible explanation of the results is that given the limited size of the dataset, the models were unable to process through information well enough to create a high-performing predictive model.

Overall, the results show that ANNs have an accuracy of 91% and perform significantly better when using any metric (precision, recall and f-score, accuracy) compared to other models using SVM and KNN. We note that while ANN models are better at predicting flood events, there is only marginal improvement in classifying no-flood events. We also note the improvement of ANNs in predicting true positive floods compared to other models. While other models fail to make less accurate predictions on classifying flooding conditions, ANNs allows for accurate classification of both classes of flooding conditions. The performance of ANN can be attributed to data augmentation, dataset rebalancing, and hyperparameter tuning. The ANN model benefitted from the dataset rebalancing algorithms and can extract information and produce balance prediction. Hyperparameters, such as dropout, have also ensured that the ANN model does not experience overfitting and is stable. Thus, ANNs can be considered the best performing model overall as it is objectively better when evaluated under our metrics.

3.4. Potential Applications of ANN Model

The ANN model presented in this study can be employed in alert system and long-term hazards prediction. The model requires substantially small amounts of data in much larger time intervals than is relatively available, such as monthly precipitation and average river velocity. This is particularly useful in the context of Vietnam where each station has difficulty in gathering data for more data-intensive models. The data we used were one-dimensional (i.e., rainfall and environmental data) because two-dimensional data is challenging to obtain in the context of Central Vietnam. Therefore, there is further potential for two-dimensional data in future model iterations.

Further extending this point, the ANN model can successfully predict the multiclass flooding events in central Vietnam. The model can sieve through a variety of data features and deal with different types of flood hazards from different sources in the region. It can create predictions on a monthly interval for 12-month periods highly accurately and is able to capture patterns in flood hazards event occurrence in the region.

Given the increasing climatic complexity of natural and hydrological confluences in central Vietnam, it is imperative that ML models be used to employ for flood forecasting. Flood hazards alert systems in Vietnam have problems as data flow from river stations is often missing and poorly structured. The ability to create forecast flood events based on larger monthly intervals, instead of daily intervals, allows for policymakers to identify key areas that require more attention and more specialized flood prediction models.

Another key feature of the ANN model is its ability to extend into different conditioning factors such as socio-economics factors. Many physical models and statistical models are developed with a focus only on natural and hydrological elements without consideration of human activities in flooding hazards. As many simulate water processes, there is difficulty in incorporating the effect of housing near rivers or man-made flood prevention structures. The ANN model overcomes this problem as it operates without the need for simulation of the environment, instead focusing on discovering patterns in flood environment data.

Incorporating the ANN model in the central region of Vietnam can tremendously help government officials develop specific mitigation and prevention policies for the region. The model worked well in the context of low information in Vietnam, requiring no satellite or image data and can integrate many types of parameters. With the predictive model proposed, general trends of flood events over a long period of time can be produced, generating valuable insights into how often flood hazards occur in the region. Given the robustness and accessibility of the model, it can be used in different contexts to alert policymakers whether actions will be needed years in advance to prepare for potential flood hazards.

4. Discussion

This study presented new predictive ML models using ANNs and compared them to other KNN and SVM models when it comes to evaluating flooding events in Vietnam. We introduce new evaluation variables that factor in the socio-economic aspect of flood risk in Vietnam and employ a more accessible experimental environment setup. We demonstrated the superior performance of our ANNs when compared to other models when it comes to identifying flood conditions. The ANN model presented is a state-of-the-art long-term forecasting model which can be deployed to data-limited areas, which is useful in the context of central Vietnam. Our methods also introduce potential applications for different contexts of other hazards regions of Vietnam which also face some difficulty in data availability.

Identifying and accurately predicting the occurrence of floods provides a helpful tool for policymakers in flood risk prevention and mitigation policy. By developing a long-term flood prediction model, valuable patterns and insights are provided for government officials to plan and build flood mitigation policies. Generally, with simulation and traditional models, warning of imminent flood events can currently be accurately forecasted 12 h in advance, with uncertainties.

While the model might be able to raise awareness for citizens of incoming hazards and help notify official plan evacuations, the effectiveness of such models remains limited given the context of central Vietnam. Given the inaccessibility of the region and lack of effective communication methods, evacuation is often difficult. This may potentially lead to high human fatalities and severe economic damages. Another drawback of current models in place at the region is that the limited data velocity hampers model performance, as some stations cannot provide enough hydrological data daily, despite most short-term forecasting predictive models relying on such information to perform predictions. This necessitates the use of more long-term models to predict general long-term trends in flood hazards. These models can create predictions for early-stage flood hazards, months in advance, allowing more time-intensive policies, such as flood prevention infrastructure construction, to be deployed. Through the development of ANN models, the trends of flooding over many months can be incorporated in risk management and mitigation plan. This study introduced models with the ability to create yearlong forecasts affording policymakers the flexibility and time to enhance their flood combating policies. The baseline ANN model being given can perform well in low-information settings without requiring special optimization methods. This means that the model can continuously be trained and deployed on a smaller timeframe and can quickly integrate in different environments in the region. The design of ANNs also allows for easy augmentation of features input without any additional effort. Given the decentralized nature of data currently in Vietnam’s natural and hydrological stations, this feature allows for high tolerance on differing data availability while still maintaining performance and accuracy. Extension of the current model is also available and can be easily performed. This opens a path for further research based on the ANN model being presented.

It is important to note the importance of input parameters in the performance of the models. Seven factors were considered in the study spanning different areas. The feature selection process plays an important role as it should be representative of the flooding situation in the region being observed. In general, floods can be categorized into three types: coastal flood, fluvial (river) flood, and pluvial (rainfall) flood. The primary differences between each type of flood are in their causes. Fluvial floods are caused by overflowing amounts of water caused by rainfall over extended time periods. The volume of river flow of fluvial flood often exceeds river capacity and causes flooding. Such floods also happen because of the sudden failure and destruction of flood prevention structures, such as dikes and dams, which generally relates to socio-economic factors in Vietnam’s central region. A higher population density riverside may likely positively correlate to more well-maintained flood protection measures. On the other hand, pluvial floods are often a result of rapid downpour in the event of heavy rainfalls within a short time. This type of flooding happens on the ground surface in urban settings as well as on more higher elevation areas. Rainfall directly causes such floods and in the case for urban flooding, the extent of flooding can be caused by inefficiencies in the drainage system of Vietnam cities. Precipitation also plays a major role. Another type of flooding is coastal flooding, which, in the context of Vietnam, comes in the form of extreme weather events such as tropical cyclones. These events both damage the flood prevention structure in place and introduce high amounts of rainfall in the region, causing widespread flooding. The different types of flooding indicate the complexity of flood prevention problems that policymakers face. As traditional modeling would be based on one type of flooding and on modeling the physical hydrological processes, such models would often underperform when determining flood probabilities, as they lack data redundancy and complexity. The method developed in the study was built with different flood hazards of the region in mind and each feature is indicative of multiple flood types of the region. Every factor selected provides information to the ANN model and allows it to make meaningful predictions.

Precipitation is the most important factor in the study. This is in line with the context of central Vietnam, as floods often occur due to intense rainfall events, river flow, and speed in the region, all parameters corelated to precipitation. River flow volume and population density also play important roles in the model. Flow volume directly relates to whether water levels would exceed the natural limits of rivers causing flooding. It is an important measure for hydrological features and often condition water levels and water runoff in the region.

Population density represents an often overlooked measure for flood hazards. Our study found that it ranked third in importance and could be explained due to generally higher human activities towards the construction of flood prevention infrastructure. River and natural flood prevention can be negatively influenced due to human activities. Human-made dams and deforestation exacerbate the extent of flooding in the region by decreasing excessive river water- and rainwater-holding structures such as canals and forests. The importance of features presented in the study reinforces the need for the selection of good features. The ANN model, through the flood-incident dataset of the region, was able to create accurate long-term forecast of flood occurrence trends in the region.

Acknowledgements should be made for other variables which should be included that were not due to data availability. The most important features are in topology, including the slope and soil type of the region which conditions the speed of river flow and water volume. Thus, future research should consider more variety features in areas such as topology and socio-economics, further enhancing the purposed model’s generalizability and accuracy when implemented in real world scenarios.

Given the growing impact of anthropogenic influences, environmental variables have become more frequent and often sporadic in nature. This has led to unpredictable rainfall patterns in Vietnam and a general increase in humidity [28]. This has important influences on the evaluation of long-term climate forecasting. Uncertainties in weather patterns have influenced the performance of more short-term simulation models and hinder warning systems for evacuation planning. Given that changes in flood hazard patterns can happen years in advance, long-term forecasting models can be vital for policymakers. It can provide important support to the development of more accurate and short-term models by adding insight into flood patterns, while requiring much less data. Furthermore, algorithms used in this study have much lower computational costs and are easily transferable. Additional parameters can be introduced into the model and can be easily extracted to develop meaningful prediction. Accurate predictions allow for the more effective use of resources to develop better flood mitigation and preventive structures and models. This would result in a general decrease in flood casualties and economic damages in the region.

Flood hazard events in Vietnam are complex natural processes which are influenced by different variables which are often missing due to constraints in meteorological stations. The ANN models we proposed fulfilled a need for long-term forecasting for makers who often only rely on different short-term classification, hydrologic, and hydraulic models for detecting flooded areas. Our models provide computational-efficient and accurate models without the need for costly human-annotated satellite images, which are inaccessible in the context of Vietnam. Our models can work with open-access datasets that involve only one-dimensional data points, allowing for reduced barriers of entry for model integration across Southeast Asia. We should also note that while feature selection is important for computation-efficiency of the model, our models do not require specialized knowledge from domain experts to extend the input of the model. This allows for more experimentation for different parties and for further dissemination of the model. Compared to the conventional flood hazard predictive models, our proposed methods have more generalizability as it incorporates a wider domain of variables such as socio-economic variables. We are currently working on the application of topological data and river geographical features to develop more robust classifiers to differentiate between varying risk class. Another potential research application extension is to develop flood risk models. Our current model can only allow for organization hazard estimation in the region. Flood risk would indicate the potential impact on economics, humans, and the environment of flood events, informing the planners which high-risk area to prioritize. This opens potential opportunities for further study.

Overall, ANNs possess better flood prediction results compared to other models. Its performance can be attributed to ANN’s abilities to predict flood hazards without the need for modeling complex hydrological processes. It can be trained on any relevant dataset and discover patterns that are hard to detect from observation alone. It does not require its user to establish causal relationships between variables, but rather it provides knowledge on these relationships. Given its accuracy and simplicity, it can be potentially integrated into an automatic meteorological system to become a real-time flood prediction model. We also developed a highly accurate ANN that was implemented for the first time in the context of central Vietnam. The model was able to deliver considerable performance in predicting flood hazards in the absence of any geospatial variables. This indicates that our methods are not dependent on geographic settings and are able to generalize the result into a wider area [55]. We also found that socio-economic variables can provide important information for models to extract and generate predictions, regardless of geospatial accuracy. This is of interest to the international scientific community, as rarely have there been results that were made that employed socio-economics variables without considering the geospatial location of such variables. This indicates that socio-economic variables are independent from geospatial variables and thus can be easily integrated into future ML models without the need of limited GIS or remote sensing data.

Our work is not without its limitations and discussing them would introduce future research opportunities. It is necessary to acknowledge drawbacks in this work. The primary limitation of this work is the smaller dataset size due to data availability, especially when compared to other datasets used for ANN model training. This hinders our abilities to deploy other neural networks, such as Recurrent Neural Networks and Convolutional Neural Networks, as they are data-intensive and are suboptimal for small datasets since they are known to overfit on them [56]. It is also possible to improve ANNs by fine-tuning the hyperparameters. A major limitation in this study can also be attributed to the fact that there was a lack of geospatial features in our dataset. Our dataset was limited in spatial accuracy given the lack of the exact location of each station. Moreover, given the temporal nature of the dataset, we did not explore the implications of geospatial data on how the model would perform [46]. While our models were able to generalize into a much wider area, our model was unable to provide more spatially accurate predictions compared to others. This also meant that we did not explore our model applications in several other areas where flood risk analysis using ML could be utilized, such as susceptibility mapping [3,57]. However, given the general lack of data and the limited scope due to accuracy that image-based and satellite-based data imposed on the model, the results derived from such models would be less useful for authorities to make informed decisions. This was the reason behind our decision to develop predictive models as it would provide a more robust system for local governments to utilize in informing their decisions. However, we acknowledge the usefulness of flood risk mapping to local governments to determine key affected areas. We hope to analyze satellite data in more detail in the future when the accuracy and availability of information improves, helping to contribute to creating better flood susceptibility maps.

We believe that new advances in ML should be utilized to help materialize gains in improving flood mitigation tools and allow experts to gain quick access to new insights in the face of climate change. We call for more collective open-access flood information datasets for Vietnam and ultimately, for Southeast Asia and the world. We hope that our approaches will allow for further research in this space that improve upon our results. It is of utmost importance that we gain a deeper understanding of flood hazards because by doing so, we will make decisions that better manage flood hazard risks and save countless lives.

5. Conclusions

In this study, ML models, specifically ANNs, SVM, and KNN, were deployed to make predictions on flood hazards in the central region of Vietnam. A total of 288 observations from the period of 2002–2020 in the region, each with parameters with climate, hydrological, and socio-economical characteristics were employed to develop the dataset. Results from the models were validated using confusion matrices with measures such as accuracy, precision, and f1-score.

ANNs show the best performance in all metrics, followed by SVM and KNN. The results indicate ANNs ability to extract information from a range of variables in constrained environments to create flood hazard predictions. Further research could be conducted to enhance ANNs performance by developing better datasets using more observations or better feature selection of parameters, employing more advanced data augmentation techniques or hyperparameter tuning. The research has indicated that socio-economics variables do have an impact on ML model performance regardless of geographical context. This calls for further research to include these variables in future settings. This study showcases ANNs and ML models usefulness in generating insights for flood mitigation strategies of local authorities and prompts for more research into the field.

Author Contributions

Conceptualization, K.T. and M.P.Q.; methodology, M.P.Q. and K.T.; software, M.P.Q.; formal analysis M.P.Q.; validation, M.P.Q. and K.T.; investigation, M.P.Q. and K.T.; resources, M.P.Q. and K.T.; data curation, M.P.Q. and K.T.; writing—original draft preparation, M.P.Q.; writing—review and editing, K.T.; supervision, K.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study can be accessed via these links: https://www.gso.gov.vn/en/homepage/, (accessed on 8 July 2022), https://www.desinventar.net/, (accessed on 8 July 2022). A compilation of the dataset used can be accessed at: https://github.com/Mandolaro/flood-data, (accessed on 14 August 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Künzel, V.; Schäfer, L.; Winges, M. Global Climate Risk Index 2020. Available online: https://www.germanwatch.org/en/17307 (accessed on 22 June 2022).
Smith, M.P.; Ricker, M.; Prütz, R.; Anand, M.; Lehner, B.; Flörke, M.; Wimmer, F.; Mann, H.; Weller, D.; Mucke, P.; et al. Global Assessment of Current and Future River Flooding and the Role of Nature-Based Solutions for Risk Management; Summary Report; The Nature Conservancy: Berlin, Germany, 2021. [Google Scholar]
Ha, M.C.; Vu, P.L.; Nguyen, H.D.; Hoang, T.P.; Dang, D.D.; Dinh, T.B.H.; Şerban, G.; Rus, I.; Brețcan, P. Machine Learning and Remote Sensing Application for Extreme Climate Evaluation: Example of Flood Susceptibility in the Hue Province, Central Vietnam Region. Water 2022, 14, 1617. [Google Scholar] [CrossRef]
Blaikie, P.; Cannon, T.; Davis, I.; Wisner, B. At Risk: Natural Hazards, People’s Vulnerability and Disasters; Routledge: London, UK, 2005. [Google Scholar] [CrossRef]
Chen, A.; Giese, M.; Chen, D. Flood impact on Mainland Southeast Asia between 1985 and 2018—The role of tropical cyclones. J. Flood Risk Manag. 2020, 13, e12598. [Google Scholar] [CrossRef]
IPCC (Intergovernmental Panel on Climate Change). Climate Change (p. 2014). Impacts: Adaptation, and Vulnerability. Summary for Policymakers; IPCC: Geneva, Switzerland, 2014.
Nkwunonwo, U.; Whitworth, M.; Baily, B. A review of the current status of flood modelling for urban flood risk management in the developing countries. Sci. Afr. 2020, 7, e00269. [Google Scholar] [CrossRef]
GoV (Government of Vietnam). Order No. 07/2013/L-CTN on the Promulgation of the Law on Natural Disaster Prevention and Control; Government of Vietnam: Hanoi, Vietnam, 2013.
Ngo, H.; Radhakrishnan, M.; Ranasinghe, R.; Pathirana, A.; Zevenbergen, C. Instant Flood Risk Modelling (Inform) Tool for Co-Design of Flood Risk Management Strategies with Stakeholders in Can Tho City, Vietnam. Water 2021, 13, 3131. [Google Scholar] [CrossRef]
Below, R.; Vos, F.; Guha-Sapir, D. Moving towards Harmonization of Disaster Data: A Study of Six Asian Databases. Available online: https://www.alnap.org/help-library/moving-towards-harmonization-of-disaster-data-a-study-of-six-asian-databases (accessed on 27 June 2022).
Huynh, L.T.M.; Stringer, L.C. Multi-scale assessment of social vulnerability to climate change: An empirical study in coastal Vietnam. Clim. Risk Manag. 2018, 20, 165–180. [Google Scholar] [CrossRef]
Chinh, D.T.; Bubeck, P.; Dung, N.V.; Kreibich, H. The 2011 flood event in the Mekong Delta: Preparedness, response, damage and recovery of private households and small businesses. Disasters 2016, 40, 753–778. [Google Scholar] [CrossRef]
Nguyen, M.T.; Sebesvari, Z.; Souvignet, M.; Bachofer, F.; Braun, A.; Garschagen, M.; Schinkel, U.; Yang, L.E.; Nguyen, L.H.K.; Hochschild, V.; et al. Understanding and assessing flood risk in Vietnam: Current status, persisting gaps, and future directions. J. Flood Risk Manag. 2021, 14, e12689. [Google Scholar] [CrossRef]
Mosavi, A.; Ozturk, P.; Chau, K.-W. Flood Prediction Using Machine Learning Models: Literature Review. Water 2018, 10, 1536. [Google Scholar] [CrossRef]
Rehman, S.; Sahana, M.; Hong, H.; Sajjad, H.; Bin Ahmed, B. A systematic review on approaches and methods used for flood vulnerability assessment: Framework for future research. Nat. Hazards 2019, 96, 975–998. [Google Scholar] [CrossRef]
Ikirri, M.; Faik, F.; Echogdali, F.Z.; Antunes, I.M.H.R.; Abioui, M.; Abdelrahman, K.; Fnais, M.S.; Wanaim, A.; Id-Belqas, M.; Boutaleb, S.; et al. Flood Hazard Index Application in Arid Catchments: Case of the Taguenit Wadi Watershed, Lakhssas, Morocco. Land 2022, 11, 1178. [Google Scholar] [CrossRef]
Nguyen, H.D. Hybrid models based on Deep Learning Neural Network and optimization algorithms for the spatial prediction of tropical forest fire susceptibility in Nghe An Province, Vietnam. Geocarto Int. 2022. [Google Scholar] [CrossRef]
Kuyuk, H.S.; Susumu, O. Real-time classification of earthquake using Deep Learning. Procedia Comput. Sci. 2018, 140, 298–305. [Google Scholar] [CrossRef]
Kamangir, H.; Collins, W.; Tissot, P.; King, S.A. A deep-learning model to predict thunderstorms within 400 km 2 south Texas domains. Meteorol. Appl. 2020, 27, e1905. [Google Scholar] [CrossRef]
Elsafi, S.H. Artificial Neural Networks (Anns) for flood forecasting at Dongola station in the River Nile, Sudan. Alex. Eng. J. 2014, 53, 655–662. [Google Scholar] [CrossRef]
Tansar, H.; Babur, M.; Karnchanapaiboon, S.L. Flood inundation modeling and hazard assessment in Lower Ping River Basin using MIKE FLOOD. Arab. J. Geosci. 2020, 13, 934. [Google Scholar] [CrossRef]
Chau, K.-W. Use of Meta-Heuristic Techniques in Rainfall-Runoff Modelling. Water 2017, 9, 186. [Google Scholar] [CrossRef]
Cao, Y.; Jia, H.; Xiong, J.; Cheng, W.; Li, K.; Pang, Q.; Yong, Z. Flash Flood Susceptibility Assessment Based on Geodetector, Certainty Factor, and Logistic Regression Analyses in Fujian Province, China. ISPRS Int. J. Geo-Inf. 2020, 9, 748. [Google Scholar] [CrossRef]
Ab Razak, N.H.; Aris, A.Z.; Ramli, M.F.; Looi, L.J.; Juahir, H. Temporal flood incidence forecasting for Segamat River (Malaysia) using autoregressive integrated moving average modelling. J. Flood Risk Manag. 2016, 11, S794–S804. [Google Scholar] [CrossRef]
Malik, S.; Pal, S.C.; Chowdhuri, I.; Chakrabortty, R.; Roy, P.; Das, B. Prediction of highly flood prone areas by GIS based heuristic and statistical model in a monsoon dominated region of Bengal Basin. Remote Sens. Appl. Soc. Environ. 2020, 19, 100343. [Google Scholar] [CrossRef]
Feng, B.; Wang, J.; Zhang, Y.; Hall, B.; Zeng, C. Urban flood hazard mapping using a hydraulic–GIS combined model. Nat. Hazards 2020, 100, 1089–1104. [Google Scholar] [CrossRef]
Souissi, D.; Zouhri, L.; Hammami, S.; Msaddek, M.H.; Zghibi, A.; Dlala, M. GIS-based MCDM—AHP modeling for flood susceptibility mapping of arid areas, southeastern Tunisia. Geocarto Int. 2019, 35, 991–1017. [Google Scholar] [CrossRef]
Dang, V.H.; Tran, D.D.; Cham, D.D.; Hang, P.T.T.; Nguyen, H.T.; Van Truong, H.; Tran, P.H.; Duong, M.B.; Nguyen, N.T.; Van Le, K.; et al. Assessment of Rainfall Distributions and Characteristics in Coastal Provinces of the Vietnamese Mekong Delta under Climate Change and ENSO Processes. Water 2020, 12, 1555. [Google Scholar] [CrossRef]
Pham, B.T.; Luu, C.; Van Phong, T.; Nguyen, H.D.; Van Le, H.; Tran, T.Q.; Ta, H.T.; Prakash, I. Flood risk assessment using hybrid artificial intelligence models integrated with multi-criteria decision analysis in Quang Nam Province, Vietnam. J. Hydrol. 2020, 592, 125815. [Google Scholar] [CrossRef]
Yariyan, P.; Janizadeh, S.; Van Phong, T.; Nguyen, H.D.; Costache, R.; Van Le, H.; Pham, B.T.; Pradhan, B.; Tiefenbacher, J.P. Improvement of Best First Decision Trees Using Bagging and Dagging Ensembles for Flood Probability Mapping. Water Resour. Manag. 2020, 34, 3037–3053. [Google Scholar] [CrossRef]
Rahman, M.; Chen, N.; Elbeltagi, A.; Islam, M.; Alam, M.; Pourghasemi, H.R.; Tao, W.; Zhang, J.; Shufeng, T.; Faiz, H.; et al. Application of stacking hybrid machine learning algorithms in delineating multi-type flooding in Bangladesh. J. Environ. Manag. 2021, 295, 113086. [Google Scholar] [CrossRef] [PubMed]
Schoppa, L.; Disse, M.; Bachmair, S. Evaluating the performance of random forest for large-scale flood discharge simulation. J. Hydrol. 2020, 590, 125531. [Google Scholar] [CrossRef]
Choubin, B.; Moradi, E.; Golshan, M.; Adamowski, J.; Sajedi-Hosseini, F.; Mosavi, A. An ensemble prediction of flood susceptibility using multivariate discriminant analysis, classification and regression trees, and support vector machines. Sci. Total Environ. 2018, 651, 2087–2096. [Google Scholar] [CrossRef]
Darabi, H.; Choubin, B.; Rahmati, O.; Haghighi, A.T.; Pradhan, B.; Kløve, B. Urban flood risk mapping using the GARP and QUEST models: A comparative study of machine learning techniques. J. Hydrol. 2018, 569, 142–154. [Google Scholar] [CrossRef]
Dobbin, K.K.; Simon, R.M. Optimally splitting cases for training and testing high dimensional classifiers. BMC Med. Genom. 2011, 4, 31. [Google Scholar] [CrossRef] [Green Version]
Nachappa, T.G.; Piralilou, S.T.; Gholamnia, K.; Ghorbanzadeh, O.; Rahmati, O.; Blaschke, T. Flood susceptibility mapping with machine learning, multi-criteria decision analysis and ensemble using Dempster Shafer Theory. J. Hydrol. 2020, 590, 125275. [Google Scholar] [CrossRef]
Tabari, H. Climate change impact on flood and extreme precipitation increases with water availability. Sci. Rep. 2020, 10, 13768. [Google Scholar] [CrossRef] [PubMed]
Sofia, G.; Nikolopoulos, E.I. Floods and rivers: A circular causality perspective. Sci. Rep. 2020, 10, 5175. [Google Scholar] [CrossRef] [PubMed]
Ferdous, R.; Di Baldassarre, G.; Brandimarte, L.; Wesselink, A. The interplay between structural flood protection, population density, and flood mortality along the Jamuna River, Bangladesh. Reg. Environ. Chang. 2020, 20, 5. [Google Scholar] [CrossRef] [PubMed]
Bayazıt, Y.; Koç, C.; Bakış, R. Urbanization impacts on flash urban floods in Bodrum Province, Turkey. Hydrol. Sci. J. 2020, 66, 118–133. [Google Scholar] [CrossRef]
Hamzah, F.B.; Hamzah, F.M.; Razali, S.F.M.; Jaafar, O.; Jamil, N.A. Imputation methods for recovering streamflow observation: A methodological review. Cogent Environ. Sci. 2020, 6, 1745133. [Google Scholar] [CrossRef]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. Available online: https://arxiv.org/abs/1605.08695 (accessed on 3 July 2022).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Xu, Z.; Shen, D.; Nie, T.; Kou, Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. J. Biomed. Inform. 2020, 107, 103465. [Google Scholar] [CrossRef]
Iyengar, N.S.; Sudarshan, P. A method of classifying regions from multivariate data. Econ. Polit. Wkly. 1982, 17, 2048–2052. [Google Scholar]
Olson, M.; Wyner, A.J.; Berk, R. Modern neural networks generalize on small datasets. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018. [Google Scholar]
Ali, N.; Neagu, D.; Trundle, P. Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Appl. Sci. 2019, 1, 1559. [Google Scholar] [CrossRef]
Javid, A.M.; Das, S.; Skoglund, M.; Chatterjee, S. A relu dense layer to improve the performance of Neural Networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation Functions: Comparison of Trends in Practice and Research for Deep Learning. Available online: https://arxiv.org/abs/1811.03378 (accessed on 3 July 2022).
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Available online: https://jmlr.org/papers/v15/srivastava14a.html (accessed on 3 July 2022).
Awad, M.; Khanna, R. Support Vector Machines for classification. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 39–66. [Google Scholar]
Gauhar, N.; Das, S.; Moury, K.S. Prediction of flood in Bangladesh using K-nearest neighbors algorithm. In Proceedings of the 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 5–7 January 2021. [Google Scholar]
Han, D.; Chan, L.; Zhu, N. Flood forecasting using support vector machines. J. Hydroinform. 2007, 9, 267–276. [Google Scholar] [CrossRef]
Hicks, S.A.; Strümke, I.; Thambawita, V.; Hammou, M.; Riegler, M.A.; Halvorsen, P.; Parasa, S. On evaluation metrics for medical applications of Artificial Intelligence. Sci. Rep. 2022, 12, 5979. [Google Scholar] [CrossRef] [PubMed]
Avand, M.; Kuriqi, A.; Khazaei, M.; Ghorbanzadeh, O. DEM resolution effects on machine learning performance for flood probability mapping. J. Hydro-Environ. Res. 2021, 40, 1–16. [Google Scholar] [CrossRef]
Feng, S.; Zhou, H.; Dong, H. Using deep neural network with small dataset to predict material defects. Mater. Des. 2018, 162, 300–310. [Google Scholar] [CrossRef]
Prasad, P.; Loveson, V.J.; Das, B.; Kotha, M. Novel ensemble machine learning models in flood susceptibility mapping. Geocarto Int. 2022, 37, 4571–4593. [Google Scholar] [CrossRef]

Figure 1. The importance of each feature in the training process.

Figure 2. The model accuracy curve of ANN model.

Figure 3. The training loss curve of ANN model.

Figure 4. The confusion matrices for each of our models and their performance on our flood prediction testing dataset, with details of each model’s (TP, TN, FP, and FN) results; (a) confusion matrix of ANN; (b) confusion matrix of KNN; (c) confusion matrix of SVM.

Table 1. The accuracy for each k used in the KNN model on the testing dataset.

Number of k	Accuracy
3	60%
5	76%
7	79%
9	75%
11	74%
13	72%

Table 2. The results of the ANN, SVM, and KNN on our evaluation metrics (Precision, Recall, f1-Score, Accuracy).

Models	Precision	Recall	f1-Score	Accuracy
ANN	85%	100%	0.92	91%
SVM	72%	76%	0.74	74%
KNN	81%	76%	0.79	79%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pham Quang, M.; Tallam, K. Predicting Flood Hazards in the Vietnam Central Region: An Artificial Neural Network Approach. Sustainability 2022, 14, 11861. https://doi.org/10.3390/su141911861

AMA Style

Pham Quang M, Tallam K. Predicting Flood Hazards in the Vietnam Central Region: An Artificial Neural Network Approach. Sustainability. 2022; 14(19):11861. https://doi.org/10.3390/su141911861

Chicago/Turabian Style

Pham Quang, Minh, and Krti Tallam. 2022. "Predicting Flood Hazards in the Vietnam Central Region: An Artificial Neural Network Approach" Sustainability 14, no. 19: 11861. https://doi.org/10.3390/su141911861

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Flood Hazards in the Vietnam Central Region: An Artificial Neural Network Approach

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Region

2.2. Data Collection

2.2.1. Data Characteristics

2.2.2. Data Collection

2.2.3. Data Cleaning

2.3. Data Augmentation and Model Implementation

2.3.1. Experimental Setup and Software Materials

2.3.2. Dataset Rebalancing

2.3.3. Data Preprocessing

2.3.4. Machine Learning Algorithms

Artificial Neural Network

Other Algorithms

3. Results

3.1. Feature Importance

3.2. Training and Validation of ANN Model

3.3. ML Models Performance Comparison

3.4. Potential Applications of ANN Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI