Predicting Eastern Mediterranean Flash Floods Using Support Vector Machines with Precipitable Water Vapor, Pressure, and Lightning Data

Asaly, Saed; Gottlieb, Lee-Ad; Yair, Yoav; Price, Colin; Reuveni, Yuval

doi:10.3390/rs15112916

Open AccessArticle

Predicting Eastern Mediterranean Flash Floods Using Support Vector Machines with Precipitable Water Vapor, Pressure, and Lightning Data

by

Saed Asaly

¹

,

Lee-Ad Gottlieb

¹

,

Yoav Yair

²

,

Colin Price

³

and

Yuval Reuveni

^4,5,*

¹

Department of Computer Science, Ariel University, Ariel 40700, Israel

²

School of Sustainability, Reichman University, Herzliya 4610101, Israel

³

Porter School of the Environment and Earth Sciences, Department of Geophysics, Tel Aviv University, Tel Aviv 6997801, Israel

⁴

Department of Physics, Ariel University, Ariel 40700, Israel

⁵

Eastern R&D Center, Ariel 40700, Israel

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(11), 2916; https://doi.org/10.3390/rs15112916

Submission received: 14 March 2023 / Revised: 3 May 2023 / Accepted: 29 May 2023 / Published: 2 June 2023

(This article belongs to the Special Issue Remote Sensing for Precipitation Measurements and Lightning Meteorology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Flash floods in the Eastern Mediterranean (EM) region are considered among the most destructive natural hazards, which pose a significant challenge to model due to their high complexity. Machine learning (ML) methods have made a significant contribution to the advancement of flash flood prediction systems by providing cost-effective solutions with improved performance, enabling the modeling of the complex mathematical expressions underlying physical processes of flash floods. Thus, the development of ML methods for flash flood prediction holds the potential to mitigate risks, inform policy recommendations, minimize loss of human life, and reduce property damage caused by flash floods. Here, we present a novel approach for improving flash flood predictions in the EM region using Support Vector Machines (SVMs) with a combination of precipitable water vapor (PWV) data, derived from ground-based global navigation satellite system (GNSS) receivers, along with surface pressure measurements, and nearby lightning occurrence data to predict flash floods in an arid region of the EM. The SVM model was trained on historical data from 2004 to 2019 and was used to forecast the likelihood of flash floods in the region. The study found that integrating nearby lightning data with the other variables significantly improved the accuracy of flash flood prediction compared to using only PWV and surface pressure measurements. The results of the SVM model were validated using observed flash flood events, and the model was found to have a high predictive accuracy with an area under the receiver operating characteristic curve of 0.93 for the test set. The study provides valuable insights into the potential of utilizing a combination of meteorological and lightning data for improving flash flood forecasting in the Eastern Mediterranean region.

Keywords:

flash flood prediction; natural hazards; SVM; machine learning; lightning; PWV; pressure

Graphical Abstract

1. Introduction

Flash floods are sudden and intense flooding events that are typically caused by heavy rain. They can occur in a short period of time, making them difficult to predict [1]. Flash floods can lead to human casualties, causing extensive damage to infrastructure, property, and the natural environments [2]. Flash floods can also lead to serious injuries due to landslides [3] and collapsed infrastructures, as well as disrupt essential services such as electricity, water and transportation, thus leading to significant economic and social disruption. They can also cause erosion on roads and paths, resulting in the formation of potholes, sinkholes, and other hazards [4].

The short occurrence time of flash floods, which is typically a matter of several hours, makes them challenging to predict. Furthermore, when analyzing the output of hydrological models, the most significant factor that controls the generation of flash floods (such as soil saturation and surface cover) is the spatiotemporal distribution of rainfall [5,6,7,8]. The rainfall pattern in the arid and semi-arid areas of the Eastern Mediterranean region is highly variable and is mostly characterized by brief, high-intensity events [9,10,11]. In order to anticipate flood events, it is necessary to consider the location and timing of these heavy rainfall events, which can be done through remote sensing platforms such as weather radar [12,13,14].

Flash flood risk assessment, such as determining the likelihood of a precise prediction, is critical for determining the impact of future events within a particular area of interest [15]. Effective flood risk management is crucial in mitigating the devastating impacts of floods on human health, the environment, and economic activities. The 2007/60 European directive on the assessment and management of flood risks recognizes the shared responsibility of various stakeholders in ensuring coordinated efforts to reduce the risk and mitigate the impacts of floods. Floods are one of the most common and devastating natural hazards, causing loss of life, damage to infrastructure and property, and disruption to economic activities. To address these risks, the directive establishes a systematic process for identifying areas at risk of flooding, evaluating potential consequences, and developing strategies to reduce the overall risk. The directive emphasizes adopting a risk-based approach to flood risk management, prioritizing measures that reduce the likelihood and potential consequences of flooding. A comprehensive and coordinated approach involving various stakeholders and integrating different measures such as land-use planning, structural measures (e.g., dams and levees), and non-structural measures (e.g., early warning systems and flood insurance) is vital in mitigating flood risks. By implementing effective flood risk management strategies and measures, communities can better protect themselves from the devastating impacts of flooding. The 2007/60 European directive provides a framework for flood risk management in the EU, highlighting the importance of a risk-based approach and the involvement of various stakeholders.

As such, the importance in predicting flash flood events is reflected in numerous studies that have been conducted on this topic in recent years. For example, the FLASH project [16] used lightning data to improve flash flood predictions in the Mediterranean Basin, and the HYDRATE project reported by Borga et al. [1] was established in order to improve the scientific abilities of flash flood forecasting in Europe, by developing a coherent set of technologies and tools for an effective early warning systems, as well as enhancing the availability of flash flood data. The project focused on organizing existing flash flood data and improving the understanding of flash flood processes in ungauged basins. Other contributions to flash flood prediction has been in incorporating machine learning tools in order to improve prediction ability [17,18,19,20,21].

An alternative method is to identify heavy rainfall events is to monitor the quantity of water vapor (WV) present in the atmosphere. This serves as an indicator of mass moisture transport, a necessary condition for such events. One way to achieve this is through the use of Global Navigation Satellite System (GNSS) meteorology, which can provide a nearly instantaneous estimate of the precipitable WV (PWV) above the location of a ground-based Global Positioning Satellite (GPS) receiver on a continuous basis [22,23,24,25,26].

Recently, Ziskin and Reuveni [21] investigated the use of precipitable water vapor (PWV) derived from ground-based global navigation satellite system (GNSS) stations to assist in predicting flash floods in an arid region of the eastern Mediterranean (EM). In GNSS, navigation messages in the form of radio waves are transmitted from GPS satellites, orbiting the Earth at a distance of 20,200 km, can be received by ground-based GPS receivers. These messages contain information that enables users on the ground to determine their position with a high degree of accuracy, up to a centimeter or even millimeter level, using a technique called precise point positioning (PPP) [27,28]. In addition to determining the receiver position, this method can also be used to estimate the amount of water vapor present between the GPS satellite and the ground-based receiver, known as the PWV content. When radio waves transmitted by the GPS satellite reach the ground-based receiver, they are affected by the Earth’s atmosphere in two ways. The ionosphere disperses the radio waves, causing a measurable delay upon arrival at the receiver. To correct for this effect, GPS satellites transmit radio waves in at least two frequency bands. The other effect is caused by the troposphere, which absorbs the radio waves, causing a delay in their arrival time at the receiver [29]. This delay, known as the zenith tropospheric delay (ZTD), is composed of two types of delay: the hydrostatic delay or zenith hydrostatic delay (ZHD), which is mainly caused by atmospheric pressure, and the wet delay, which is caused by the interaction of the radio waves with water molecules. The wet delay can be calculated by subtracting the ZHD from the ZTD [30,31,32].

The approach of Ziskin and Reuveni [21] involved training three types of machine learning models (random forest (RF), multi-layer perceptron (MLP) and support vector machine (SVM)) with 24 h of PWV data, in order to predict whether a flash flood will occur. The models were trained with 107 unique flash flood events, and were tested using a nested cross-validation technique. The results showed good agreement across the various score metrics for the three ML models and indicated that the models can be improved by incorporating additional features such as surface pressure measurements and the day of year (DOY) information as an additional feature. In addition, a feature importance analysis revealed that PWV values from 2 to 6 h prior to a flash flood are the most important features. These results suggest that near real-time GNSS ground-based data-driven approaches can be used to augment current flash flood warning systems. Thus, when these models were tested with an imbalanced test set, simulating more realistic flash flood occurrence scenarios, they indicated a drop in the false alarm rate (precision score metrics) with a high hit rate (recall score metrics). The study suggested that the suggested flash flood prediction approach could be used to improve real-time flash flood early warning systems, possibly through the use of a multi-class classification task with peak discharge as a threshold parameter. For a comprehensive understanding of the processing parameters and methodology used to derive and validate PWV, as well as for an analysis of diurnal, interannual, and long-term trends, readers who are interested may refer to [31] or [30].

The Contribution of This Study

In this current study, the aim is to address the research gap that was present in a recent paper, presented by Ziskin and Reuveni [21], namely the sharp drop in the false alarm rate (i.e., the precision score metrics) when considering imbalanced data that closely simulate a realistic flash flood scenario.

An additional feature that has been successfully explored for predicting flash floods is the use of lightning activity data, which has proven to be a reliable precursor to heavy rainfall, thus [33] is known to be highly correlated with flash floods occurrence [34,35,36]. As such, the integration of nearby lightning data as a new dataset feature is performed, and the best model is tested with a highly imbalanced dataset. This approach closely mimics a real-life flash flood scenario, where the number of false alarms can have serious consequences. The results demonstrate a significant improvement over previous studies, particularly in terms of precision. Specifically, it was found that all models tested exhibited a lower false alarm rate while maintaining a high hit rate.

The inclusion of this feature may enhance the ability of the learning algorithms to better distinguish between a typical flood event from a fair weather day. The motivation for adding this new dataset feature is the previously reported results concerning heavy rainfalls, which are often accompanied by an increase in nearby lightning activity, that can lead to flash flood events.

The paper is structured as follows: in Section 3, the lightning data used and its integration into the dataset, as well as the flood events utilized in this study, are described. The ML methodology utilized for studying these datasets is then described in Section 4. Section 5 presents the results of the ML models’ performance. These results are discussed in Section 6, and concluding remarks are presented in Section 7.

2. Related Work

A recent study by Giannaros et al. [37] investigated the November 2019 catastrophic flash flood in Olympiada (North Greece) using the mesoscale weather and research forecasting (WRF) model and the integrated multi-satellite retrievals for global precipitation measurement (GPM-IMERG) algorithm. The study showed that the WRF-based Hydrologic Engineering Center-Hydrologic Modelling System (HEC-HMS) could provide a strong indication of the forthcoming flash flood at least two days in advance, while the GPM-IMERG algorithm yielded the best performance in capturing the timing of the excessive rainfall. Another study by Varlas et al. [38] evaluated a hydrometeorological forecasting system that operates at the Institute of Marine Biological Resources and Inland Waters (IMBRIW) of the Hellenic Centre for Marine Research (HCMR). The system combines the Advanced Weather Research and Forecasting (WRF-ARW) model, the WRF-Hydro hydrological model, and the HEC-RAS hydraulic-–hydrodynamic model to provide daily 120 h weather forecasts and hydrological forecasts for the Spercheios and Evrotas rivers in Greece. The study demonstrated that the system provided skillful precipitation and water level forecasts and timely flash flood forecasting products, which could benefit flood warning and emergency responses due to their efficiency and increased lead time.

In regards to the use of machine learning for flood prediction, Panahi et al. [39] investigated the potential of using two types of deep learning neural networks—convolutional neural networks (CNN) and recurrent neural networks (RNN)—for predicting and mapping flash flood probability at a spatial scale. They utilized a geospatial database containing records of historical flood events and environmental characteristics of the Golestan Province in northern Iran, to develop and validate the predictive models. A step-wise weight assessment ratio analysis was employed to identify the relationships between floods and various influencing factors. The CNN and RNN models were trained using the results of this analysis, and were validated using the receiver operating characteristics (ROC) technique. The results show that CNN performed slightly better than RNN in predicting future floods, with an area under the curve (AUC) of 0.832 and root mean squared error (RMS) of 0.144, compared to an AUC of 0.814 and RMSE of 0.181 for RNN.

Bui et al. [40] developed a new approach to flash flood susceptibility mapping based on a deep learning neural network (DLNN) algorithm, and tested their approach within a case study of a high-frequency tropical storm area in Vietnam. The DLNN model used a database of features such as elevation, slope, curvature, aspect, stream density, normalized difference index (NDVI), soil type, lithology, and rainfall to predict different levels of susceptibility to flash floods. Feature selection was performed using the information gain ratio. The results indicated that DLNN yields strong prediction accuracy, with a classification accuracy rate of 92.05%, a positive predictive value of 94.55%, and a negative predictive value of 89.55%. The DLNN model performed better than benchmarks based on a multilayer perceptron neural network (MLP) or on support vector machines (SVM), suggesting that it could be a useful tool for flash flood mitigation and land-use planning in the study area.

Band et al. [41] aimed to assess the susceptibility of the Kalvan watershed in Iran to flash floods, using five hybrid parallel and regularized approaches. The extremely randomized trees (ERT) model was found to be the most optimal, with an AUC value of 0.82. The ERT model indicated that 28.33% of the area was at very high to moderate risk of flash floods, with the remaining area at very low to low risk. Topographical and hydrological parameters such as altitude, slope, rainfall, and distance from the river were found to be the most important in assessing flash flood susceptibility. This study demonstrated the effectiveness of hybrid parallel and regularization approaches for estimating flash flood susceptibility in a semi-arid environment.

In regards to the correlation between lightning and floods, Koutroulis et al. [34] examined the relationship between lightning activity and high precipitation events leading to flash floods for the island of Crete. Their results showed that the maximal correlation between the lightning and rainfall data was obtained within a circular area of an average radius of 15 km and an average time lag of 15 min for flood events, and 25 min for non-flood events. In addition, lightning activity was also found to be four times higher during flood-triggering storms. Further analysis is needed to understand the differences between flood and non-flood producing storms.

Soula and Chauzy [35] and Price and Federmesser [36] both conducted studies on the correlation between lightning and rain intensity during thunderstorms and winter storms, respectively. Soula and Chauzy [35] found that the overall spatial correlation between rain and lightning occurrence was very consistent for all types of lightning during four days of thunderstorm activity in France. Price and Federmesser [36] found similar results while investigating winter storms over the central and eastern Mediterranean. Barnolas et al. [42] used a combination of rain gauges, radar, geographic information system (GIS), and lightning data to study a flash flood event that occurred in Catalonia during 12–14 September 2006. They found that the high lightning activity during the event made it an ideal case for studying the relation between lightning strikes and precipitation, thus concluding that the correlation between lightning and precipitation was stronger with increased lightning activity. Hence, these studies demonstrate the importance of harnessing lightning data for predicting and mitigating the risks of flash floods [2].

3. Datasets

In the current study, the main aim is to improve the performance of ML models used by Ziskin and Reuveni [21] for predicting flash flood events. To achieve this, the exact dataset and methodology utilized by them were used. The dataset for estimating PWV used in Ziskin and Reuveni [21] was obtained from the SOI-APN GNSS ground receivers. The daily RINEX files were processed using NASA’s JPL GipsyX software [43], with PPP solutions, minimum cutoff elevation angle of

15^{\circ}

, GMF for the tropospheric model [44], and 200 ocean loading for all stations. The ZWD was obtained and translated into PWV using the formula [23]:

P W V = Π \times Z W D

. The dimensionless constant of proportionality, Π, was calculated by Ziv et al. [31] using IMS’s automated stations and radiosonde measurements [22]. The PWV validation using the Bet-Dagan radiosonde station is extensively explained in [30,31]. The mean diurnal and annual variations were removed during the PWV dataset preparation process.

Supplementary lightning occurrence data were introduced into the ML models, in addition to the dataset used by Ziskin and Reuveni [21]. The lightning occurrence data were obtained from two sources: World Wide Lightning Location Network (WWLLN) and Israel Lightning Detection Network (ILDN). WWLLN determines the locations of lightning strikes by using the time of arrival from at least 5 sensors, with an average global detection efficiency of around 30% for strikes with peak currents values exceeding 30 kA [45]. The ILDN system, on the other hand, consists of 11 sensors, including LPATS and IMPACT sensors. These are distributed throughout the entire state of Israel and have a strike detection efficiency greater than 90% within the Israel area [46]. The ILDN system accurately registers cloud-to-ground strikes of each polarity with a time accuracy around 1 ms, where flashes with peak currents between 0 and 10 kA are automatically filtered out and treated as intra-cloud flashes. The lightning events captured in the vicinity of the SOI-APN GNSS stations in the southern part of Israel are illustrated in Figure 1.

To align the WWLLN and ILDN datasets, low-magnitude lightning events below 25 kA were excluded from the ILDN dataset. This allowed focusing on high-magnitude lightning events in both datasets. It is worth noting that the ILDN dataset does not include RMS information, precluding the employment of RMS considerations during pre-processing. The research period spanned from September 2004 to December 2010, as well as from July 2017 to July 2020, based on the availability of the lightning data.

4. Methodology

The ML methodology introduced in this study is based on the methodology presented by Ziskin and Reuveni [21], and is illustrated in Figure 2. The figure depicts the complete steps and processes, beginning with target and feature selection, through data pre-processing and model input, and finishing with the creation of the best model fit. The steps are explained in detail in the following sections.

4.1. Data Pre-Processing

One of the key steps in building a ML model is the selection and generation of features that can effectively capture the underlying patterns in the data [47]. In this section, we describe the various techniques and methods that we used to generate the features for the ML model, adding to the ones Ziskin and Reuveni [21] presented.

WWLLN: First, lightning events with large residual RMS greater than 30 ms, which exceeds the maximum allowed time for detecting the lightning event, were filtered out from the WWLLN dataset.
ILDN: For the ILDN dataset, it was necessary to remove low-magnitude lightning events due to the high-magnitude events contained in the WWLLN dataset. To achieve this, all lightning events below a magnitude of 25 kA were filtered out, allowing us to focus on the large magnitude events in both datasets. We note that the ILDN dataset lacks the RMS information, so it was not possible to pre-process this dataset using a RMS considerations.

Furthermore, since the lightning activity area with an average radius of 15 km has the highest correlation with rainfall data Koutroulis et al. [34], we integrated all the lightning locations within the same radius originally utilized by Ziskin and Reuveni [21], as they considered all the flash flood events within a 10 km radius of at least one of the GNSS stations listed in Table 1.

4.2. Feature Extraction

In this study, a method for extracting relevant features from the dataset was developed in order to analyze the correlation between flash flood events and lightning activity. Specifically, 24 h lighting vectors were created for each flood event by integrating the number of lightning strikes that occurred within close proximity to the nearest GNSS station at a temporal resolution of 1 h.

The GNSS station closest to each flood event was first co-located to construct the lightning occurrence vectors. Then, the number of lightning strikes occurring within a 10 km radius around each GNSS station at 1 h time window over a 24 h period was determined. The distance was chosen based on the fact that the circular area with the highest correlation between lightning and rainfall data had an average radius of 15 km [34].

The counts of lightning strikes were integrated for each 1 h time window and assembled into a 24 h vector representing the chosen flood event. This method of computing the lightning vector for each flood event allowed us to analyze the temporal evolution of lightning activity in relation to a specific flood event, investigating any potential correlations or patterns. A comparison between the mean lightning strikes within a time window of 24 h prior to all flash flood events analyzed in this study, versus the mean lightning strikes 24 h prior to all quiet days (non-flash events) is presented in Figure 3.

The feature extraction process in this study enabled us to effectively capture and analyze the relevant lightning data for each flood event in a consistent and systematic manner. This approach provided valuable insights into the coupling between flood events and lightning activity, informing the subsequent analysis and interpretation of the data.

After first filtering the dataset to include only flood and quiet (non-flood) day events for which lightning data were available, flood and quiet days that occurred only during winter days were taken into consideration, as summer rain is very rare in the EM region. The DOY feature introduced by Ziskin and Reuveni [21] reflected this filtering process. Consequently, a total of 105 flash flood events and 1219 quiet days remained. To simulate a realistic flash flood scenario, we then used an 80/20 randomized train-test split, resulting in 85 flood events and 85 quiet days in the training set, and 20 flood events and 1134 quiet days in the testing set. This resulted in a ratio of 56 quiet days to one flood event in the testing set. This split allowed us to evaluate the performance of the model on a separate, unseen dataset, ensuring its robustness and generalizability.

4.3. Support Vector Machine (SVM)

In this study, the support vector machine (SVM) technique was chosen to classify the flash flood event dataset. The SVM algorithm was applied to the dataset from Ziskin and Reuveni [21], which includes precipitable water vapor, surface pressure, and DOY, augmented with the associated lightning activity, as explained above. The decision to employ the SVM technique was based on its demonstrated effectiveness in classification tasks, as previously shown by Ziskin and Reuveni [21].

SVM works by discovering the high-dimensional hyperplane, which maximally separates the different classes [48]. It is particularly effective when the data are not linearly separable [49], as in this setting the kernel trick may be used to embed the data in a higher-dimensional space admitting a linear separator [50]. In this study, the SVM approach was used to classify flood events based on their associated lightning activity vectors. To choose the optimal hyperparameters for the SVM model, a Bayesian optimization has been used rather than a grid search approach. Bayesian optimization is a global optimization method, which uses a probabilistic model to guide the search for the best hyperparameters [51,52,53]. It has been shown to be more efficient and effective than grid search in many cases, particularly for complex, high-dimensional models such as SVM [54].

4.4. K-Fold Cross Validation

As a key aspect of the evaluation of the model’s performance, we have incorporated a k-fold cross validation process. The k-fold cross validation involves dividing the entire training-set into k equal subsets, using a randomized stratified sampling approach to ensure that each subset is representative of the overall dataset, where the other

k - 1

subsets are used for training, and one subset is used for testing. This process is repeated k times, with each subset being used once for testing. The results from each iteration are then aggregated to produce a comprehensive evaluation of the model’s performance. By utilizing this approach, we can avoid overfitting, as the model is tested on previously unseen data. The results from k-fold cross validation provide a useful understanding of how well the model generalizes to the new data, providing a more robust evaluation of the model’s performance compared to training and testing with a single fixed dataset. In this study, 5-fold cross validation was used.

The decision to use here a standard k-fold cross validation approach instead of nested cross validation was made due to the limited amount of data available [55]. With limited data, the standard k-fold cross validation approach is a suitable choice as it provides good balance between the computational cost and the ability to obtain meaningful results, while still allowing for an evaluation of the model’s generalization performance [56,57].

Figure 4 shows the result of the cross validation process, where the groups refers to the nine different GNSS stations stated in Table 1.

4.5. Score Metrics

In this study, several score metrics composed of different combinations between true positive (TP), false negative (FN), true negative (TN), and false positive (FP) ratios, were employed to assess the accuracy and robustness of the flood classification model. The score metrics used in this study include accuracy, precision, recall, F1 score, HSS score, TSS score, and the receiver operating characteristic (ROC) curve with its corresponding area under the curve (AUC), as suggested in previous studies [21,52,53,58].

Accuracy is the fraction of correct predictions made by the model, while precision is the proportion of true positive predictions among all positive predictions. Recall, also known as sensitivity, is the proportion of true positive predictions among all actual positive instances. The F1 score is the harmonic mean of precision and recall, and is often used as a single metric to balance these two measures.

The HSS and TSS scores are measures of the skill of a binary classification model in relation to a reference forecast. The HSS score measures the proportion of correctly predicted events, while the TSS score measures the proportion of correctly predicted events as well as the proportion of correctly predicted non-events.

The ROC curve is a graphical representation of the relationship between the true positive rate and the false positive rate of a binary classification model at different classification thresholds. The AUC, is a measure of the overall performance of the model, with higher values indicating better performance.

When working with imbalanced data, it is important to consider the impact of class imbalance on these score metrics. In such cases, it is often preferable to use metrics that are less sensitive to class imbalance, such as the HSS score and TSS score, in order to more accurately assess the performance of the model [59].

The following are the equations for the above metrics:

Precision = \frac{TP}{TP + FP}

(1)

Recall = \frac{TP}{TP + FN}

(2)

HSS = \frac{2 \cdot [(TP \times TN) - (FN \times FP)]}{(TP + FN) \cdot (FN + TN) + (TP + FN) \cdot (FP + TN)}

(3)

Accuracy = \frac{TP + TN}{P + N}

(4)

TSS = \frac{TP}{TP + FN} - \frac{FP}{FP + TN}

(5)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(6)

5. Experimental Results

5.1. SVM Result

In this section, we present the results of the best SVM model obtained through the use of Bayesian optimization. Specifically, we evaluate the model’s results by examining the minimum classification error per iteration in the optimization process, as illustrated in Figure 5. This approach indicates the progress of the optimization algorithm as it explores the hyperparameter space, with the y-axis representing the minimum classification error, and the x-axis representing the number of iterations. Thus, demonstrating the effectiveness of the Bayesian optimization algorithm in reducing the classification error over the course of the optimization process and achieving a global minimum. In addition, The visualization of the optimization process provides a clear understanding of the overall performance of the SVM model used in this study.

5.2. Skill Scores Results

In this section, the results of the skill score metrics evaluation for the best SVM model obtained through the use of Bayesian optimization are presented. The effectiveness of the model in accurately predicting the target variable is demonstrated by the results of this evaluation. Furthermore, a comparison was made between these results and those of Ziv and Reuveni [21] to provide insight into the relative performance of the SVM model compared to other approaches.

The comparison results are presented in Figure 6, demonstrating that the current model outperforms the results of Ziv and Reuveni [21] in terms of skill scores performance, indicating a higher accuracy in flash flood prediction for the realistic scenario.

The flood classification model’s experimental results are highly promising, with encouraging performance across multiple score metrics. An accuracy of 0.9913 was achieved by the model on the testing set, indicating correct predictions for the majority of instances. This is particularly impressive since the class imbalance in the data means that simply predicting the majority class all the time would result in a relatively high accuracy.

To support the results obtained by the SVM model, it’s important to note that the evaluation was conducted on imbalanced data, which is a common problem in flood prediction. Imbalanced data refers to a situation where the number of positive examples (flood events) is much smaller than the number of negative examples (non-flood events). This can make it difficult for models to accurately predict the positive examples and can lead to a bias towards negative examples.

The high skill scores achieved by the model of this study, despite the presence of imbalanced data, suggest that it is both robust and effective. Although imbalanced data are known to adversely affect model performance, the results of this study indicate that the model was able to overcome this issue and achieve high accuracy in predicting flash floods.

Furthermore, the high skill scores achieved in this study, particularly in precision, and F1, show that it has a low rate of false positives and false negatives, which is particularly important in flood prediction, as it can have severe consequences if a flood event is not predicted or if a non-flood event is incorrectly predicted as a flood.

In terms of the F1 score, which is the harmonic mean of precision and recall and is used to balance these two measures, the model achieved a value of 0.7917. This indicates that the model has a decent balance between the precision and recall score matrices, with a relatively high recall value of 0.95 and a lower precision value of 0.6786. The high recall value suggests that the model is able to effectively detect a large proportion of the examined flash flood events, while the lower precision value indicates that there were a relatively larger number of false positive predictions.

The HSS and TSS scores both measure the skill of a binary classification model in relation to a reference forecast. The model achieved an HSS score of 0.7875 and a TSS score of 0.9421, indicating strong performance in terms of both correctly predicted events and correctly predicted non-events. This suggests that the model was able to accurately classify both flood and non-flood events, and was not simply relying on the class imbalance to achieve high performance.

The strong performance of the model across these score metrics demonstrates its effectiveness at classifying flood events based on their associated lightning activity within a given time window.

The high accuracy, TSS and HSS scores indicate that the model succeed to correctly identify both flood and non-flood events, while the relatively high recall and lower precision values suggest that the model perform effectively in detecting flood events, but has higher number of false positive predictions.

By using the same machine learning technique as the model presented in the work of Ziv and Reuveni [21], the current model was able to achieve an improved performance due to the addition of the local lightning activity as an augmented feature. By doing so, we were able to provide the model with additional information regarding the key feature characteristics of each flash flood event, allowing it to make a more accurate predictions.

In addition to the quantitative score metrics analysis, we also assess the current model performance using a confusion matrix and ROC curve representation to provide a visual analysis of the model’s performance. These are presented in Figure 7 and Figure 8, respectively. The confusion matrix indicates the number of correct and incorrect predictions made by the model for each class, allowing for a more detailed understanding of its performance. The ROC curve, on the other hand, illustrates the trade-off between the true positive rate and false positive rate at different classification thresholds, allowing for a more nuanced understanding of the model’s performance. Together, the confusion matrix and ROC curve representation provide a comprehensive view of the model’s performance and allow for a more thorough evaluation of its accuracy and robustness.

In this study, a comparison was made to recent studies by Panahi et al. [39] and Bui et al. [40] in order to provide a more comprehensive understanding of the performance of the approach, see Figure 9. Notably, the comparison emphasizes the performance of the approach in the presence of imbalanced data, an aspect that has not been extensively investigated in either Panahi et al. [39] or Bui et al. [40]. By highlighting this research gap, the comparison underscores the novelty of the approach in addressing this critical issue and emphasizes the need for further research in this area. Despite this, the approach continues to demonstrate relatively good performance in comparison to available metrics, providing a promising foundation for future research efforts to utilize this methodology.

Incorporating nearby lightning activity, around the examined hydrometric stations, as a feature allowed the model to capture the correlation between the lightning activity and flash flood occurrence, enhancing the SVM model results presented in the previous study carried by Ziv and Reuveni [21]. This augmented feature added additional information, which clearly contributed to the improved performance of the model, as indicated across the various examined score metrics. All together, these results demonstrate the advantage of including diverse relevant features in ML models, along with the potential for improved performance by leveraging additional data sources.

6. Discussion

Flash floods are a major natural disaster that can cause significant damage and loss of human lives. As such, the development of accurate and reliable methods for predicting flash flood events is of critical importance for risk management and disaster response efforts.

We then filtered out the data to only include flash flood events with available nearby lightning data, taking into account the DOY feature (i.e., only integrating the lightning, which occurred during winter time), resulting in a dataset of 105 flash flood events along with 1219 quiet (non-flood) days. We separated the resulting dataset into a training set (80% of the data) and a testing set (20% of the data), ensuring that the ratio of flood events to quiet days was approximately 1:1 for the training set (i.e., balanced set), where for the remaining testing set a ratio of 1:56.

The pre-processed data were used to train an SVM model to classify flash flood events based on their adjusted PWV, surface pressure, and associated nearby lightning activity. The model achieved impressive performance across multiple score metrics calculated from the imbalanced testing set, including an accuracy of 0.9913, F1 score of 0.7917, HSS score of 0.7875, precision of 0.6786, recall of 0.95, and TSS score of 0.9421. These results demonstrate the effectiveness of the model in accurately predicting flash flood events, particularly in the presence of imbalanced data.

In this study, the focus was on the imbalanced dataset test to simulate a flash flood occurrence that is rarer, which is typical for the study area in the EM region. This scenario was estimated to represent a flash flood frequency of 1 in 57 days. Results were similar to those reported by Ziskin and Reuveni for most metrics, but a notable improvement in the precision and F1 metrics’ performance was observed, demonstrating the ability of this model to accurately classify both flash flood and non-flood events in a more realistic scenario.

The results of the confusion matrix and ROC curve representation are presented in addition to the quantitative score metrics to provide a visual understanding of the model’s performance. The confusion matrix shows the number of correct and incorrect predictions made by the model for each class, while the ROC curve illustrates the trade-off between the true positive and false positive rate at different classification thresholds. This provides a comprehensive evaluation of the model’s accuracy and robustness.

The comparison with recent studies presented in Figure 9 has provided a more comprehensive understanding of the performance of the current approach in comparison to other recent works. Notably, this comparison is significant for its emphasis on the performance of the current approach when faced with imbalanced data, which has not been extensively examined in either Panahi et al. [39] or Bui et al. [40]. By highlighting this research gap, the comparison underscores the novelty of the current approach in addressing this critical issue and the need for further research in this area. Despite this, the current approach demonstrates promising results and continues to perform relatively well in comparison to the available metrics, serving as a promising foundation for future research efforts aimed at utilizing the methodology.

In summary, the potential for accurately classifying flood events using machine learning and lightning activity data was demonstrated in this study. An improvement over the previous research presented by Ziv and Reuveni [21] has been achieved, and the value of using advanced machine learning techniques and diverse data sources to build more accurate and robust models has been highlighted. The use of additional features and data sources to further improve model performance, as well as the application of the model in operational settings to aid in flood prediction and risk management efforts, may be explored in further research.

7. Conclusions

The objective of this study was to explore the classification of flash flood events using an SVM model that incorporates GNSS-PWV and surface pressure measurements, augmented by nearby lightning activity data. The experimental results demonstrated that the model’s performance improved when nearby lightning activity was incorporated as an augmented feature, capturing the correlation between atmospheric electricity characteristics and flash flood occurrence. This improvement was observed in the precision and F1 metrics’ performance on an imbalanced testing set, contributing to the development of a more accurate and reliable flash flood classification system. The findings suggest that the integration of atmospheric electricity data can enhance the performance of existing flash flood prediction models and help mitigate the devastating effects of these natural disasters.

Author Contributions

All authors have made significant contributions to the manuscript. S.A. processed the lightning data along with the flood data, designed and implemented the SVM algorithms development, wrote the main manuscript text and prepared all the figures and tables in the manuscript; L.-A.G., Y.Y. and C.P. revised the manuscript; Y.R. conceived and designed part of the algorithm development, analyzed the data and results and is the main author who developed and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by Israel Science Foundation grant number: 1602/19.

Data Availability Statement

The data presented in this study are contained within the article in Section 3.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

EM	Eastern Mediterranean
ML	Machine Learning
PWV	Precipitable Water Vapor
NSS	Global Navigation Satellite System
PPP	Precise Point Positioning
GPS	Global Positioning Satellite
ZTD	Zenith Tropospheric Delay
ZHD	Zenith Hydrostatic Delay
DOY	Day Of Year
RF	Random Forest (RF)
MLP	Multi-Layer Perceptron
SVM	Support Vector Machine
WRF	Weather and Research Forecasting
CNN	Convolutional Neural Networks
RNN	Recurrent Neural Networks
AUC	Area Under the Curve
WWLLN	World Wide Lightning Location Network
ILDN	Israel Lightning Detection Network
ROC	Receiver Operating Characteristic

References

Borga, M.; Anagnostou, E.; Blöschl, G.; Creutin, J.D. Flash flood forecasting, warning and risk management: The HYDRATE project. Environ. Sci. Policy 2011, 14, 834–844. [Google Scholar] [CrossRef]
Llasat, M.C.; Llasat-Botija, M.; Prat, M.; Porcu, F.; Price, C.; Mugnai, A.; Lagouvardos, K.; Kotroni, V.; Katsanos, D.; Michaelides, S.; et al. High-impact floods and flash floods in Mediterranean countries: The FLASH preliminary database. Adv. Geosci. 2010, 23, 47–55. [Google Scholar] [CrossRef] [Green Version]
Rao, K.D.; Rao, V.V.; Dadhwal, V.; Diwakar, P. Kedarnath flash floods: A hydrological and hydraulic simulation study. Curr. Sci. 2014, 106, 598–603. [Google Scholar]
Arrighi, C.; Pregnolato, M.; Dawson, R.; Castelli, F. Preparedness against mobility disruption by floods. Sci. Total Environ. 2019, 654, 1010–1022. [Google Scholar] [CrossRef]
Andréassian, V.; Oddos, A.; Michel, C.; Anctil, F.; Perrin, C.; Loumagne, C. Impact of spatial aggregation of inputs and parameters on the efficiency of rainfall-runoff models: A theoretical study using chimera watersheds. Water Resour. Res. 2004, 40. [Google Scholar] [CrossRef] [Green Version]
Rozalis, S.; Morin, E.; Yair, Y.; Price, C. Flash flood prediction using an uncalibrated hydrological model and radar rainfall data in a Mediterranean watershed under changing hydrological conditions. J. Hydrol. 2010, 394, 245–255. [Google Scholar] [CrossRef]
Zoccatelli, D.; Borga, M.; Zanon, F.; Antonescu, B.; Stancalie, G. Which rainfall spatial information for flash flood response modelling? A numerical investigation based on data from the Carpathian range, Romania. J. Hydrol. 2010, 394, 148–161. [Google Scholar] [CrossRef]
Yakir, H.; Morin, E. Hydrologic response of a semi-arid watershed to spatial and temporal characteristics of convective rain cells. Hydrol. Earth Syst. Sci. 2011, 15, 393–404. [Google Scholar] [CrossRef] [Green Version]
Goodrich, D.C.; Faurès, J.M.; Woolhiser, D.A.; Lane, L.J.; Sorooshian, S. Measurement and analysis of small-scale convective storm rainfall variability. J. Hydrol. 1995, 173, 283–308. [Google Scholar] [CrossRef]
Syed, K.H.; Goodrich, D.C.; Myers, D.E.; Sorooshian, S. Spatial characteristics of thunderstorm rainfall fields and their relation to runoff. J. Hydrol. 2003, 271, 1–21. [Google Scholar] [CrossRef] [Green Version]
Segond, M.L.; Wheater, H.S.; Onof, C. The significance of spatial rainfall representation for flood runoff estimation: A numerical evaluation based on the Lee catchment, UK. J. Hydrol. 2007, 347, 116–131. [Google Scholar] [CrossRef]
Karklinsky, M.; Morin, E. Spatial characteristics of radar-derived convective rain cells over southern Israel. Meteorol. Z. 2006, 15, 513–520. [Google Scholar] [CrossRef]
Morin, E.; Jacoby, Y.; Navon, S.; Bet-Halachmi, E. Towards flash-flood prediction in the dry Dead Sea region utilizing radar rainfall information. Adv. Water Resour. 2009, 32, 1066–1076. [Google Scholar] [CrossRef]
Peleg, N.; Morin, E. Convective rain cells: Radar-derived spatiotemporal characteristics and synoptic patterns over the eastern Mediterranean. J. Geophys. Res. Atmos. 2012, 117, D15116. [Google Scholar] [CrossRef]
Shehata, M.; Mizunaga, H. Flash flood risk assessment for Kyushu Island, Japan. Environ. Earth Sci. 2018, 77, 76. [Google Scholar] [CrossRef]
Price, C.; Yair, Y.; Mugnai, A.; Lagouvardos, K.; Llasat, M.C.; Michaelides, S.; Dayan, U.; Dietrich, S.; Galanti, E.; Garrote, L.; et al. The FLASH Project: Using lightning data to better understand and predict flash floods. Environ. Sci. Policy 2011, 14, 898–911. [Google Scholar] [CrossRef] [Green Version]
Qian, K.; Mohamed, A.; Claudel, C. Physics informed data driven model for flood prediction: Application of deep learning in prediction of urban flood development. arXiv 2019, arXiv:1908.10312. [Google Scholar]
Nguyen, D.T.; Chen, S.T. Real-time probabilistic flood forecasting using multiple machine learning methods. Water 2020, 12, 787. [Google Scholar] [CrossRef] [Green Version]
Puttinaovarat, S.; Horkaew, P. Flood forecasting system based on integrated big and crowdsource data by using machine learning techniques. IEEE Access 2020, 8, 5885–5905. [Google Scholar] [CrossRef]
Nevo, S.; Morin, E.; Gerzi Rosenthal, A.; Metzger, A.; Barshai, C.; Weitzner, D.; Voloshin, D.; Kratzert, F.; Elidan, G.; Dror, G.; et al. Flood forecasting with machine learning models in an operational framework. Hydrol. Earth Syst. Sci. 2022, 26, 4013–4032. [Google Scholar] [CrossRef]
Ziv, S.Z.; Reuveni, Y. Flash Floods Prediction Using Precipitable Water Vapor Derived From GPS Tropospheric Path Delays Over the Eastern Mediterranean. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Bevis, M.; Businger, S.; Herring, T.A.; Rocken, C.; Anthes, R.A.; Ware, R.H. GPS meteorology: Remote sensing of atmospheric water vapor using the global positioning system. J. Geophys. Res. Atmos. 1992, 97, 15787–15801. [Google Scholar] [CrossRef]
Bevis, M.; Businger, S.; Chiswell, S.; Herring, T.A.; Anthes, R.A.; Rocken, C.; Ware, R.H. GPS meteorology: Mapping zenith wet delays onto precipitable water. J. Appl. Meteorol. (1988–2005) 1994, 33, 379–386. [Google Scholar] [CrossRef]
Leontiev, A.; Reuveni, Y. Combining Meteosat-10 satellite image data with GPS tropospheric path delays to estimate regional integrated water vapor (IWV) distribution. Atmos. Meas. Tech. 2017, 10, 537–548. [Google Scholar] [CrossRef] [Green Version]
Leontiev, A.; Reuveni, Y. Augmenting GPS IWV estimations using spatio-temporal cloud distribution extracted from satellite data. Sci. Rep. 2018, 8, 14785. [Google Scholar] [CrossRef]
Leontiev, A.; Rostkier-Edelstein, D.; Reuveni, Y. On the potential of improving WRF model forecasts by assimilation of high-resolution GPS-derived water-vapor maps augmented with METEOSAT-11 data. Remote Sens. 2020, 13, 96. [Google Scholar] [CrossRef]
Reuveni, Y.; Kedar, S.; Owen, S.E.; Moore, A.W.; Webb, F.H. Improving sub-daily strain estimates using GPS measurements. Geophys. Res. Lett. 2012, 39, L11311. [Google Scholar] [CrossRef]
Reuveni, Y.; Kedar, S.; Moore, A.; Webb, F. Analyzing slip events along the Cascadia margin using an improved subdaily GPS analysis strategy. Geophys. J. Int. 2014, 198, 1269–1278. [Google Scholar] [CrossRef] [Green Version]
Reuveni, Y.; Bock, Y.; Tong, X.; Moore, A.W. Calibrating interferometric synthetic aperture radar (InSAR) images with regional GPS network atmosphere models. Geophys. J. Int. 2015, 202, 2106–2119. [Google Scholar] [CrossRef] [Green Version]
Ziskin Ziv, S.; Alpert, P.; Reuveni, Y. Long-term variability and trends of precipitable water vapour derived from GPS tropospheric path delays over the Eastern Mediterranean. Int. J. Climatol. 2021, 41, 6433–6454. [Google Scholar] [CrossRef]
Ziv, S.Z.; Yair, Y.; Alpert, P.; Uzan, L.; Reuveni, Y. The diurnal variability of precipitable water vapor derived from GPS tropospheric path delays over the Eastern Mediterranean. Atmos. Res. 2021, 249, 105307. [Google Scholar]
Lynn, B.; Yair, Y.; Levi, Y.; Ziv, S.Z.; Reuveni, Y.; Khain, A. Impacts of non-local versus local moisture sources on a heavy (and deadly) rain event in Israel. Atmosphere 2021, 12, 855. [Google Scholar] [CrossRef]
Harats, N.; Ziv, B.; Yair, Y.; Kotroni, V.; Dayan, U. Lightning and rain dynamic indices as predictors for flash floods events in the Mediterranean. Adv. Geosci. 2010, 23, 57–64. [Google Scholar] [CrossRef] [Green Version]
Koutroulis, A.; Grillakis, M.; Tsanis, I.; Kotroni, V.; Lagouvardos, K. Lightning activity, rainfall and flash flooding–occasional or interrelated events? A case study in the island of Crete. Nat. Hazards Earth Syst. Sci. 2012, 12, 881–891. [Google Scholar] [CrossRef]
Soula, S.; Chauzy, S. Some aspects of the correlation between lightning and rain activities in thunderstorms. Atmos. Res. 2001, 56, 355–373. [Google Scholar] [CrossRef]
Price, C.; Federmesser, B. Lightning-rainfall relationships in Mediterranean winter thunderstorms. Geophys. Res. Lett. 2006, 33. [Google Scholar] [CrossRef]
Giannaros, C.; Dafis, S.; Stefanidis, S.; Giannaros, T.M.; Koletsis, I.; Oikonomou, C. Hydrometeorological analysis of a flash flood event in an ungauged Mediterranean watershed under an operational forecasting and monitoring context. Meteorol. Appl. 2022, 29, e2079. [Google Scholar] [CrossRef]
Varlas, G.; Papadopoulos, A.; Papaioannou, G.; Dimitriou, E. Evaluating the forecast skill of a hydrometeorological modelling system in Greece. Atmosphere 2021, 12, 902. [Google Scholar] [CrossRef]
Panahi, M.; Jaafari, A.; Shirzadi, A.; Shahabi, H.; Rahmati, O.; Omidvar, E.; Lee, S.; Bui, D.T. Deep learning neural networks for spatially explicit prediction of flash flood probability. Geosci. Front. 2021, 12, 101076. [Google Scholar] [CrossRef]
Bui, D.T.; Hoang, N.D.; Martínez-Álvarez, F.; Ngo, P.T.T.; Hoa, P.V.; Pham, T.D.; Samui, P.; Costache, R. A novel deep learning neural network approach for predicting flash flood susceptibility: A case study at a high frequency tropical storm area. Sci. Total Environ. 2020, 701, 134413. [Google Scholar]
Band, S.S.; Janizadeh, S.; Chandra Pal, S.; Saha, A.; Chakrabortty, R.; Melesse, A.M.; Mosavi, A. Flash flood susceptibility modeling using new approaches of hybrid and ensemble tree-based machine learning algorithms. Remote Sens. 2020, 12, 3568. [Google Scholar] [CrossRef]
Barnolas, M.; Atencia, A.; Llasat, M.; Rigo, T. Characterization of a Mediterranean flash flood event using rain gauges, radar, GIS and lightning data. Adv. Geosci. 2008, 17, 35–41. [Google Scholar] [CrossRef] [Green Version]
Bertiger, W.; Bar-Sever, Y.; Dorsey, A.; Haines, B.; Harvey, N.; Hemberger, D.; Heflin, M.; Lu, W.; Miller, M.; Moore, A.W.; et al. GipsyX/RTGx, a new tool set for space geodetic operations and research. Adv. Space Res. 2020, 66, 469–489. [Google Scholar] [CrossRef]
Böhm, J.; Niell, A.; Tregoning, P.; Schuh, H. Global Mapping Function (GMF): A new empirical mapping function based on numerical weather model data. Geophys. Res. Lett. 2006, 33, L07304. [Google Scholar] [CrossRef] [Green Version]
Rodger, C.; Brundell, J.; Holzworth, R.; Lay, E. Growing detection efficiency of the world wide lightning location network. AIP Conf. Proc. 2009, 1118, 15–20. [Google Scholar]
Shalev, S.; Saaroni, H.; Izsak, T.; Yair, Y.; Ziv, B. The spatio-temporal distribution of lightning over Israel and the neighboring area and its relation to regional synoptic systems. Nat. Hazards Earth Syst. Sci. 2011, 11, 2125–2135. [Google Scholar] [CrossRef] [Green Version]
Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the 2014 Science and Information Conference, London, UK, 27–29 August 2014; pp. 372–378. [Google Scholar]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef]
Suykens, J.A. Nonlinear modelling and support vector machines. In Proceedings of the IMTC 2001. Proceedings of the 18th IEEE Instrumentation and Measurement Technology Conference. Rediscovering Measurement in the Age of Informatics (Cat. No. 01CH 37188), Budapest, Hungary, 21–23 May 2001; Volume 1, pp. 287–294. [Google Scholar]
Hofmann, M. Support vector machines-kernels and the kernel trick. Notes 2006, 26, 1–16. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2951–2959. [Google Scholar]
Asaly, S.; Gottlieb, L.A.; Reuveni, Y. Using support vector machine (SVM) and ionospheric total electron content (TEC) data for solar flare predictions. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1469–1481. [Google Scholar] [CrossRef]
Asaly, S.; Gottlieb, L.A.; Inbar, N.; Reuveni, Y. Using support vector machine (SVM) with GPS ionospheric TEC estimations to potentially predict earthquake events. Remote Sens. 2022, 14, 2822. [Google Scholar] [CrossRef]
Joy, T.T.; Rana, S.; Gupta, S.; Venkatesh, S. Batch Bayesian optimization using multi-scale search. Knowl.-Based Syst. 2020, 187, 104818. [Google Scholar] [CrossRef]
Wainer, J.; Cawley, G. Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst. Appl. 2021, 182, 115222. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Y. Cross-validation for selecting a model selection procedure. J. Econom. 2015, 187, 95–112. [Google Scholar] [CrossRef]
Rodriguez, J.D.; Perez, A.; Lozano, J.A. Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 569–575. [Google Scholar] [CrossRef]
Landa, V.; Reuveni, Y. Low-dimensional convolutional neural network for solar flares GOES time-series classification. Astrophys. J. Suppl. Ser. 2022, 258, 12. [Google Scholar] [CrossRef]
Ahmadzadeh, A.; Angryk, R.A. Measuring Class-Imbalance Sensitivity of Deterministic Performance Evaluation Metrics. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 51–55. [Google Scholar]

Figure 1. Local map illustration of all lightning occurrence in both WWLLN and ILDN dataset (indicated by red dots), in proximity to each of the 9 SOI-APN GNSS stations used in Ziskin and Reuveni [21], with a radius of 10 km around each GNSS station.

Figure 2. Main block diagram illustration. The input features are the PWV, surface pressure, day of year, and the nearby lightning activity, with the target being the flash flood occurrence times. The pre-processing stage involves standardizing the lightning data by resampling them at 1 h resolution time, and aligning the hydrometric station data, GNSS-PWV, DOY and surface pressure measurements following the pre-processing step described in Ziskin and Reuveni [21] work. The creation of 24 h sequences, with balanced classes, concludes the pre-processing phase. In the learning process, the SVM model is optimized using cross-validation technique. The final output of each model is a prediction of whether or not a flash flood will occur in the 25th hour.

Figure 3. A comparison between the mean lightning strikes within a time window of 24 h prior to all flash flood events analyzed in this study (blue line), versus the mean lightning strikes 24 h prior to quiet days (non-flash events). As can be seen, the mean number of lightning strikes on quiet days plus one standard deviation does not reach the mean number of lightning strikes on flood days minus one standard deviation. This finding suggests that there is a strong correlation between the number of lightning strikes and the likelihood of flood occurrence.

Figure 4. Five-Fold Cross Validation Results. The diagram illustrates the performance of the cross validation process with 5 subsets of the data, obtained through a randomized stratified sampling approach, allowing each iteration to randomly pick testing sets while still taking into account all 9 stations (groups).

Figure 5. Evolution of minimum classification error during the Bayesian optimization of the SVM model. The y-axis represents the minimum classification error, while the x-axis represents the number of iterations. The plot demonstrates the effectiveness of the Bayesian optimization algorithm in reducing the classification error over the course of the optimization process and achieving a global minimum.

Figure 6. Comparison of skill score metrics for flash flood event prediction between the current SVM model and Ziv and Reuveni [21] work. The results show an improvement in the accuracy of the current model in predicting flash flood events and non-floods, as indicated by the higher values in most skill scores.

Figure 7. The confusion matrix for the SVM model results extracted from the training set (left), and the test set (right).

Figure 8. ROC model curve obtained during the hyperparameters optimization process.

Figure 9. Comparison of skill score metrics for flash flood event prediction between the current SVM model and both Panahi et al. [39], and Bui et al. [40] works.

Table 1. Geographical coordinates and names of SOI-APN GNSS stations used by Ziskin and Reuveni [21], in accordance with Figure 1 lightning occurrence locations.

GNSS Station Name	Latitude [N $^{\circ}$ ]	Longitude [E $^{\circ}$ ]
Nizana	30.88	34.2
Kibutz Lahav	31.38	34.87
Yerucham	30.99	34.93
Mitzpe Ramon	30.60	34.76
Metzoki dragot	31.59	35.39
Dead-Sea Manufactories	31.04	35.37
Sapir	30.61	35.18
Kibutz Neve Harif	30.04	35.04
Eilat	29.51	34.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Asaly, S.; Gottlieb, L.-A.; Yair, Y.; Price, C.; Reuveni, Y. Predicting Eastern Mediterranean Flash Floods Using Support Vector Machines with Precipitable Water Vapor, Pressure, and Lightning Data. Remote Sens. 2023, 15, 2916. https://doi.org/10.3390/rs15112916

AMA Style

Asaly S, Gottlieb L-A, Yair Y, Price C, Reuveni Y. Predicting Eastern Mediterranean Flash Floods Using Support Vector Machines with Precipitable Water Vapor, Pressure, and Lightning Data. Remote Sensing. 2023; 15(11):2916. https://doi.org/10.3390/rs15112916

Chicago/Turabian Style

Asaly, Saed, Lee-Ad Gottlieb, Yoav Yair, Colin Price, and Yuval Reuveni. 2023. "Predicting Eastern Mediterranean Flash Floods Using Support Vector Machines with Precipitable Water Vapor, Pressure, and Lightning Data" Remote Sensing 15, no. 11: 2916. https://doi.org/10.3390/rs15112916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Eastern Mediterranean Flash Floods Using Support Vector Machines with Precipitable Water Vapor, Pressure, and Lightning Data

Abstract

1. Introduction

The Contribution of This Study

2. Related Work

3. Datasets

4. Methodology

4.1. Data Pre-Processing

4.2. Feature Extraction

4.3. Support Vector Machine (SVM)

4.4. K-Fold Cross Validation

4.5. Score Metrics

5. Experimental Results

5.1. SVM Result

5.2. Skill Scores Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI