Next Article in Journal
The Impact of Climate Change on Urban Thermal Environment Dynamics
Previous Article in Journal
Warm Island Effect in the Lake Region of the Tengger Desert Based on MODIS and Meteorological Station Data
Previous Article in Special Issue
Enhancing the Encoding-Forecasting Model for Precipitation Nowcasting by Putting High Emphasis on the Latest Data of the Time Step
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing Machine Learning Models for Gap Filling Daily Rainfall Series in a Semiarid Region of Spain

by
Juan Antonio Bellido-Jiménez
*,
Javier Estévez Gualda
and
Amanda Penélope García-Marín
Projects Engineering Area, Department of Rural Engineering, University of Córdoba, 14071 Córdoba, Spain
*
Author to whom correspondence should be addressed.
Atmosphere 2021, 12(9), 1158; https://doi.org/10.3390/atmos12091158
Submission received: 21 August 2021 / Revised: 6 September 2021 / Accepted: 6 September 2021 / Published: 9 September 2021

Abstract

:
The presence of missing data in hydrometeorological datasets is a common problem, usually due to sensor malfunction, deficiencies in records storage and transmission, or other recovery procedures issues. These missing values are the primary source of problems when analyzing and modeling their spatial and temporal variability. Thus, accurate gap-filling techniques for rainfall time series are necessary to have complete datasets, which is crucial in studying climate change evolution. In this work, several machine learning models have been assessed to gap-fill rainfall data, using different approaches and locations in the semiarid region of Andalusia (Southern Spain). Based on the obtained results, the use of neighbor data, located within a 50 km radius, highly outperformed the rest of the assessed approaches, with RMSE (root mean squared error) values up to 1.246 mm/day, MBE (mean bias error) values up to −0.001 mm/day, and R2 values up to 0.898. Besides, inland area results outperformed coastal area in most locations, arising the efficiency effects based on the distance to the sea (up to an improvement of 63.89% in terms of RMSE). Finally, machine learning (ML) models (especially MLP (multilayer perceptron)) notably outperformed simple linear regression estimations in the coastal sites, whereas in inland locations, the improvements were not such significant.

1. Introduction

The spatial and temporal analysis of meteorological parameters, such as rainfall is crucial to numerous environmental, hydrological, and agroclimatic studies, as well as optimizing issues, such as water resource management or irrigation scheduling [1,2,3,4]. However, one of the most common problems in time series analyses, such as rainfall datasets, is the presence of gaps of different widths, making this task harder to carry out. This usually results from malfunctioning sensors or data loggers, lack of maintenance, meteorological events, or power outages. Sometimes, the solution is not instantaneous and causes delays because it needs the interaction of qualified personnel. Therefore, before starting with analyses, a common practice is to fill these gaps using different methodologies and applying automatic detection algorithms to detect spurious signals in automated weather stations [5].
Due to its high spatio-temporal variability and the large number of interconnected variables involved, rainfall is one of the most challenging atmospheric variables to characterize, estimate, and forecast [6], especially on a daily basis, with higher volatility and chaotic patterns [7]. A variety of techniques have been developed on both a monthly and daily basis. One of the most frequent algorithms to estimate missing rainfall records is the inverse distance weighting method (IDWM), where the estimated values are calculated with a weighted average (it resorts to the inverse of the distance when assigning the weights) from neighbor stations [8,9]. Another simple method to apply is the gauge mean estimator, which uses an average value of observations from the nearby stations, which can be obtained by optimization, proximity metric, or correlation, among other techniques [10]. Ordinary kriging is a spatially-dependent variance, based on scalar measurements at different locations, where the weights are derived from the distance between the source and the target stations [11,12,13]. However, these three methods tend to overestimate the number of rainy days and underestimate their magnitudes, and even a negative correlation is found in several reports between close stations [13,14]. Xia et al. [15] evaluated six methodologies (simple arithmetic averaging, inverse distance interpolation, normal ratio method, single best estimator, multiple regression analysis (REG), and closest station method) for estimating missing data in two stations in Germany and concluded that REG consistently obtained the best performance. Teegavarapu and Chandramouli [8] highlighted that the use of the coefficient of correlation provided an improvement in estimating the missing data and recommended the coefficient of correlation weighing method, artificial neural network estimation method, and kriging estimation method for this purpose, due to their conceptually superior performance. Teegavarapu et al. [16] introduced the fixed functional set genetic algorithm method (FFSGAM), eliminating the use of rigid functional forms and weighting approaches for gap-filling. FFSGAM outperformed conventional IDWM. Adhikary et al. [12] developed genetic programming-based ordinary kriging (GPOK) as a new variant of the kriging method, using the genetic programing-derived variogram model and ordinary kriging. GPOK obtained the best results, when compared to ANN-based ordinary kriging and traditional ordinary kriging. Different authors [17,18,19] have evaluated the k-nearest-neighbor algorithm, in conjunction with machine learning models, such as multilayer perceptron (MLP), support vector machine (SVM), and random forest (RF), with promising results. Bagirov et al. [20] evaluated clusterwise linear regression (CLR), using different combinations of maximum and minimum daily air temperature, evaporation, vapor pressure, and solar radiation to predict monthly rainfall in Victoria, Australia. Their results showed a higher performance of CLR against different methods, such as cluster regression-expectation maximization, multiple linear regression, support vector regression (SVR), and MLP. Kajewska-Szkudlarek [21] assessed the use of cluster analysis with SVR to outperform daily rainfall prediction in urban areas.
Additionally, other researchers study the performance of processing algorithms, such as wavelets [22,23], variational mode decomposition (VMD) [24], or singular spectrum analysis (SSA) [25,26]. Estévez et al. [22] evaluated different combinations of wavelet analysis with thermo-pluviometric variables, using MLP in sixteen locations of Spain to forecast monthly rainfall. The results indicated the suitability of the models using thermo-pluviometric variables without requiring long-term datasets. Partal and Kisis [23] assessed a wavelet analysis, in conjunction with neuro-fuzzy models, to forecast daily rainfall in Turkey. The developed models were significantly superior to traditional machine learning approaches, with a coefficient of determination (R2) around 0.8–0.9. Li et al. [24] studied the performance of VMD, coupled with an extreme learning machine (ELM) model, to improve monthly rainfall forecasts in the northwest of China. This hybrid model highly outperformed traditional algorithms, with a meager computational cost, due to the non-training requirement of ELM. Filho and Lima [25] evaluated the singular spectrum analysis (SSA) forecasting monthly rainfall in Brazil. Based on the results, it could be concluded that the SSA caterpillar algorithm can deal with the inherent non-stationary nature of rainfall records, extracting its long varying trends and periodic components. Sun et al. [26] assessed SSA in Korea with linear recurrent formulas (LRF) and MLP. MLP obtained the best performance when forecasting monthly rainfall.
Finally, due to the significant advances in computation, deep learning algorithms are gaining very high popularity. In this sense, Kim et al. [27] evaluated the convolutional neural network (CNN), in conjunction with long short-term memory (LSTM), named convLSTM, to nowcast 1 and 2 h in advance, using two years dataset periods. ConvLSTM was able to reduce RMSE by 23%, when compared to linear regression. Ha et al. [28] developed a deep belief network model to forecast rainfall one day ahead in Seoul, performing better than MLP. Chen et al. [29] studied the performance of convLSTM with group normalization (GN) to improve the optimization process and employ a multi-sigmoid loss, inspired in the critical success index (CSI) and compared to the COTREC model. COTREC obtained better performance, in terms of intensity in some areas, whereas convLSTM got a generally more reliable forecast.
This study aims to create a daily rainfall estimation model using only precipitation data, with different approaches in semiarid regions, such as Andalusia, to fill possible gaps in precipitation datasets. Additionally, a new approach is tested in daily rainfall estimations, which uses future precipitation values for this purpose. Thus, in this work, several machine learning models (MLP, SVM, and RF) and approaches for estimating missing rainfall data were tested and compared to empirical algorithms, such as linear interpolation (LI), in 14 locations from two different regions of Andalusia (coastal and inland areas) in Southern Spain. The first approach (A) uses neighbor stations’ rainfall data of the same gap day and its distance to the target station. All these neighbor stations are located within a radius up to 50 km, following the recommendations of Barrios et al. [9] on a monthly basis and Estévez et al. [30] on a daily basis. Secondly, a new approach is considered, using only rainfall data (past and future values) from the target station as the model’s inputs. Specifically, two different configurations were tested: (B) one day before and after the gap day and (C) two days before and after the gap day.
The rest of the work is organized as follows. Section 2 shows the information about the locations, the dataset, the theoretical background of the different machine learning (ML) models assessed, the preprocessing algorithms, and the evaluation metrics. Then, in Section 3, the results are reported and discussed. Finally, Section 4 describes the conclusions achieved in this work.

2. Materials and Methods

2.1. Source of Data

This study was carried out in Andalusia, Southern Spain, located in the southwest of Europe. Andalusia is a semiarid region with the following features: the meridians range from 1 to 7° W, the parallels from 37° to 39° N, an elevation above mean sea level from 26 to 822 m, and a total area of 87,268 m2.
The datasets used belong to the Agroclimatic Information Network of Andalusia (RIAA) and can be downloaded at the following link: https://www.juntadeandalucia.es/agriculturaypesca/ifapa/riaweb/web (accessed on 30 July 2021). A total of 14 stations, divided into two areas (coastal and inland locations), were evaluated. The first group of areas included Jaen, La Higuera de Arjona, Linares, Mancha Real, Marmolejo, Sabiote, and Torreblascopedro and the second group included Málaga, Antequera, Archidona, Cártama, Churriana, Pizarra, and Vélez. Figure 1 shows their geographical locations, and Table 1 shows their geo-climatic characteristics.

2.2. Methodology

An essential prerequisite to guarantee reliable results using raw meteorological data is the application of quality assurance procedures. The quality control guidelines, reported by Estévez et al. [31], were followed, as well as the procedure to detect spurious precipitation signals from automated weather stations (AWS), also Estévez et al. [5].
Afterward, data preprocessing was required for every approach, obtaining the corresponding input configuration, according to every strategy (see Table 2). Three different methodologies were evaluated: approach (i)—the use of rainfall neighbor data and its distance data to estimate the precipitation values at a different site (all locations are located within a 50 km radius); approach (ii)—the use of one day before and ahead rainfall data values from the target station; and approach (iii)—the use of two days before and ahead rainfall data values from the target station.
Later, in order to tune all the different hyperparameters from the different models, train them, and evaluate their performance, the full dataset was split into training, validation, and test. The train (to fit all the final weights and biases from the final model) and test dataset (never-seen data to assess the performance) were randomly split into 70% and 30%, respectively. Prior to this stage, it is necessary to determine all the hyperparameters of the models (such as the number of hidden layers and neurons in a multilayer perceptron). To this purpose, the training dataset was divided into train_2 and validation (random 80% and 20%, respectively) to train and test the different hyperparameters until the fittest set is found. It is worth noting that the seed used in the random algorithm is the same in all cases, so all assessed models (from different approaches) have the same train, test, and validation dataset. Then, the Bayesian optimization algorithm took place, where different hyperparameters were tested, using the validation dataset, until the fittest set was found. Afterward, the entire train dataset from the initial split was used to adjust all the different weights and biases. Finally, the performance accuracy was assessed, using the testing dataset, which was never seen during previous processes. All this methodology is shown in a flowchart in Figure 2.
Besides, after splitting the dataset into train and test, a standardization was carried out, which is highly recommended to outperform machine learning models, especially neural network-based models [32]. This can be expressed as Equation (1):
x = x x ¯ σ
where x represents the input data and x ¯ and σ correspond to the mean and standard deviation of the training dataset, respectively, and x is the standardized data.

2.3. Multilayer Perceptron (MLP)

Multilayer perceptron is one of the most used models in different sectors, especially in hydrology [22,33]. Its functionality is based on neurons in the biological nervous system, where many interconnected neurons work together to generate an interaction, based on different stimuli. It is structured in three types of layers, the input and output correspond to the input and output of the model, respectively, as well as the hidden layer, where neurons are located. The activation function determines the output of a node, given a set of inputs. For example, rectified linear output (ReLU) represents a ramp for positive input values. The process in which the neurons learn (value adjustment of weights and biases) is carried out automatically, which is why this layer is called hidden. ADAM, a very common algorithm for this purpose, uses squared gradients to scale the learning rate and a moving average of the gradients.
A single neuron mathematical logic is represented in Figure 3, where w represents the weight and b is the bias factor.
For further information, the following works can be reviewed [34,35].

2.4. Support Vector Machine (SVM)

Support vector machine (SVM) is a supervised machine learning model that analyzes data for classification and regression tasks (also known as support vector regression (SVR)). For classification tasks, its functionality is based on searching the fittest hyperplane to separate different datapoints’ classes (classification). On regression, it finds the hyperplane and margins that fit all of them (regression). Thus, an easy way to understand SVM for regression is similar to a linear regression, where a hyperplane (that includes the data) is searched, while having the flexibility to define how much error is considered acceptable. Figure 4 shows an example of SVM for classification (a) and regression (b).
The main feature of SVM models is the use of kernels (linear, sigmoid, or gaussian, among others) to enable operation in a high-dimensional feature map, where the number of features is greater than the number of observations.
SVM models are often used in rainfall forecasts, with promising results [36,37,38]. For further details, the following work can be reviewed [36,37].

2.5. Random Forest (RF)

Random Forest (RF) was first introduced by [39] as a supervised learning algorithm, where the “term” forest defines that it is built as an ensemble of decision tree models. The general idea is that the conjunction of multiple models increases the overall result. Additionally, RF introduces an extra-randomness when the number of trees starts to grow. Instead of searching for the best feature when splitting nodes, it searches for the best features among a random subset of them. The maximum number of features can be defined in scikit-learn as auto, sqrt, log2, none, or the exact number of maximum features (where auto and sqrt refer to the squared root of the initial number of features, log2 refers to the logarithm base 2 of the number of features, and none is to use all the features). This results in a broader diversity and, as a consequence, a better final performance.
Other researchers have already assessed RF in rainfall with promising results [40,41,42]. For further details, the following work can be revised [42].

2.6. Bayesian Optimization

One of the critical aspects of machine learning models’ efficiency is hyperparameter selection. Depending on whether the correct values have been set, the performance can dramatically change from outstanding to very poor results. A common practice in the scientific community uses a trial-and-error technique [22], where different values are evaluated, varying from dozens to thousands of possibilities. However, this method is far from efficient because if the hyperparameter space is ample, the algorithm (apart from being very slow) wastes significant time in non-promising configurations. On the other hand, when the hyperparameter space is tiny, an accurate hyperparameter configuration set may be missing, despite being quick. To solve this problem, several algorithms have been assessed in different works. In [6], the authors studied the effectiveness of particle-swarm optimization (PSO) and genetic algorithm (GA) to predict the monthly rainfall with MLP in a subtropical monsoon climate in Guilin, China. Wang et al. [43] assessed an artificial bee colony (ABC) with MLP to forecast rainfall values in 17 stations in the Wujiang River Basin. Banadkooki et al. [35] evaluated the flow regime optimization algorithm (FRA) with MLP and SVM to forecast monthly rainfall values in Iran.
In this study, Bayesian optimization was used, due to its high popularity in new automated machine learning (AML) models [44,45,46,47] and its good performance in [34,48]. It was first introduced by Wang et al. [43] as an algorithm, based on the Bayes theorem, to search the minimum/maximum function. Part of its popularity is due to its close relation to human behavior when tuning hyperparameters [49,50]. The prior results are taken into account to choose the following promising values to test, following the next four-step procedure: (1) the hyperparameter space is defined, which limits the values of the hyperparameter space; (2) the algorithm considers previous evaluations, in order to choose the following set of values to be assessed (acquisition function)—two kinds of possibilities can be handled, exploitation (consists of testing hyperparameters values that are assumed to be optimal) and exploration (the opposite of exploitation, to identify new best options); (3) to assess the new hyperparameter configuration using an objective function; and (4) if the optimization process has not finished yet, it goes to the second point. In this work, this algorithm was implemented using Python and the scikit-optimize library, following the instructions of Bellido-Jiménez et al. [34]. All the final hyperparameter sets, used for each model, approach, and location, can be seen in Table 3.

2.7. Evaluation Metrics

To assess the efficiency of the developed models, the statistics root mean square error (RMSE), mean bias error (MBE), and coefficient of determination (R2) were used. All of them are mathematically expressed as Equations (2)–(4), respectively:
R M S E = 1 n i = 1 n m e a s i p r e d i 2
M B E = 1 n i = 1 n m e a s i p r e d i
R 2 = ( i = 1 n m e a s i μ m e a s p r e d i μ p r e d ) 2 i = 1 n m e a s i μ m e a s 2   i = 1 n p r e d i μ p r e d 2
where n represents the number of prediction days, measi corresponds to the measured value for a specific day, predi is the predicted value, i represents every single gap day, and μ corresponds to the mean.

3. Results and Discussion

In order to help the reproducibility of this work, the best ML models were uploaded to an open access repository in Github (https://github.com/Smarity/gap-filling-precipitation-atmosphere-special-issue.git, accessed on 30 July 2021).

3.1. Using Neighbor Stations

Table 3 shows the RMSE, MBE, and R2 performances for all locations in Area 1 (inland locations) using the first approach (A), information from other AWS located within 50 km. In Higuera de Arjona, MLP obtained the best RMSE and R2 values (RMSE = 1.363 mm/day and R2 = 0.894), very close to RF (RMSE = 1.384 mm/day and R2 = 0.889). In terms of MBE, LI outperformed the rest of the ML models (MBE = −0.008 mm/day), followed closely to MLP and RF (MBE = 0.016 mm/day and MBE = 0.026 mm/day, respectively). In Jaen, all ML models outperformed LI in RMSE and R2, where MLP obtained the best values (RMSE = 1.767 mm/day and R2 = 0.827), whereas RF beat the rest, regarding MBE (MBE = 0.023 mm/day). In Linares, RF and LI obtained the best performance, in terms of MBE (MBE = 0.001 mm/day and MBE = −0.001 mm/day). Moreover, MLP outperformed the others, regarding RMSE and R2 (RMSE = 1.723 mm/day and R2 = 0.817), followed closely by RF (RMSE = 1.730 mm/day and R2 = 0.815). In Mancha Real, MLP outperformed the other models in all statistics (RMSE = 1.662 mm/day, MBE = −0.072 mm/day, and R2 = 0.831), whereas SVM was the worst (RMSE = 1.948 mm/day, MBE = −0.195 mm/day, and R2 = 0.780). In Marmolejo, with the highest mean annual rainfall (523.36 mm/year), the performance, in terms of RMSE and R2, showed that RF obtained the best values (RMSE = 2.129 mm/day and R2 = 0.801), followed closely by SVM (RMSE = 2.154 mm/day and R2 = 0.795) and MLP (RMSE = 2.176 mm/day and R2 = 0.791). In Sabiote, the location with the highest altitude, MLP obtained the best performance in RMSE and R2 (RMSE = 2.049 mm/day and R2 = 0.752), but LI beat ML in MBE (MBE = −0.006 mm/day). Finally, in Torreblascopedro, SVM outperformed the rest for all statistics (RMSE = 1.246 mm/day, MBE = −0.005, and R2 = 0.894), being the most accurate from this first region. It is worth noting that MLP generally obtained the best results, regarding RMSE and R2, in most locations, whereas RF and LI obtained the best values for MBE. Additionally, even though ML outperformed LI in all locations, the average improvement was not very significant.
Table 4 and Table 5 shows the RMSE, MBE, and R2 values for all locations and models in the coastal locations (Area 2). In Antequera, MLP beat the other models for all statistics (RMSE = 1.596 mm/day, MBE = 0.035 mm/day, and R2 = 0.875), sharing the same R2 performance with SVM (R2 = 0.875). All ML models highly outperformed LI, considering all statistics (especially RMSE and R2), except for MBE using SVM. In Archidona, MLP also obtained the most accurate modeling in RMSE and R2 (RMSE = 1.811 mm/day and R2 = 0.844), followed closely to SVM (RMSE = 1.817 mm/day and R2 = 0.844). Regarding MBE, RF outperformed the rest (MBE = −0.019 mm/day). In Cártama, RF obtained the best MBE value (MBE = 0.002 mm/day), whereas SVM got the best RMSE and R2 performance (RMSE = 2.502 mm/day and R2 = 0.778). In Churriana, MLP highly outperformed the rest, in terms of RMSE and R2 (RMSE = 2.192 mm/day and R2 = 0.876), whereas RF beat MLP in MBE (MBE = 0.019 mm/day and MBE = −0.052 mm/day, respectively). In Málaga, RF obtained the best values for RMSE and MBE (RMSE = 2.433 mm/day and MBE = 0.012 mm/day), whereas MLP got the most accurate values for R2 (R2 = 0.830). In Pizarra, all models obtained very similar performance (even LI). RMSE ranged from 2.032 mm/day (by MLP and SVM) to 2.108 mm/day (by LI), MBE ranged from 0.039 mm/day (by RF) to −0.112 mm/day (by SVM), and R2 ranged from 0.842 (by LI) to 0.854 (by MLP). Finally, in Vélez, MLP outperformed the rest of the models, in terms of RMSE and R2 (RMSE = 3.219 mm/day and R2 = 0.742), while RF obtained the best MBE performance (MBE = −0.020 mm/day), followed closely to MLP (MBE = −0.074 mm/day). Generally, the results obtained by ML highly outperformed LI in most locations and statistics, except for MBE, in which LI obtained very accurate results. Thus, the use of ML models to gap-fill daily rainfall data is highly recommended for coastal sites, performing significantly better than LI, arising the effect of sea distance in rainfall modelling. Eventually, in Figure 5, all these RMSE, MBE, and R2 values, from both areas and all models, are represented in a scatter plot. Due to the different performances between the ML models, it can be stated that MLP obtained the best results, or very close to them, in most locations. On the other hand, SVM had accurate performances in coastal sites, whereas the behavior was not so good on inland locations. Finally, RF behaved opposite to SVM, having an accurate performance on inland locations and a worse modeling on inland sites.

3.2. Using Data from the Target Station

Table 6 and Table 7 show the RMSE and MBE values for the inland and coastal locations, using two different approaches, one day after and before (approach B) and two days after and before (approach C), as inputs, respectively. Generally, all the results are much worse than in Table 2 and Table 3, for all cases. Mancha Real obtained an RMSE value above 4.0 for all models, whereas Churriana got the worse values (RMSE > 6.0 mm/day). In terms of MBE, La Higuera de Arjona obtained the best performance (MBE = −0.021 mm/day) using MLP and approach B, whereas Marmolejo was the worst (MBE = −1.459 mm/day), using MLP and this same approach. Finally, in terms of R2, the values obtained are low, from R2 = 0.004 (by SVM in Archidona) to R2 = 0.079 (by MLP in Málaga), highlighting the non-autocorrelation between precipitation values from the previous and following days. Comparing the results between B and C, in terms of RMSE, on average, approach C (RMSE = 4.359 mm/day) obtained a slightly better performance than approach B (RMSE = 4.323 mm/day). However, in area 2, the use of approach B (RMSE = 4.986 mm/day) was significantly better than approach C (RMSE = 5.588 mm/day).
Finally, in Figure 5 and Figure 6, all the RMSE, MBE, and R2 values, from both areas and all models, are represented in a scatter plot.

3.3. Comparison of the Two Areas

In order to compare the results in the two different areas, Figure 7 shows the RMSE, MBE, and R2 performance values for these two kinds of locations (inland and coastal), using the best approach (data from neighbor stations). In terms of RMSE mean values, MLP outperformed RF and SVM. Besides, the models applied on the coastal locations underperformed, on average, in all cases and obtained higher variability across sites, rather than inland ones. In terms of MBE mean values, RF and LI obtained values very close to 0, whereas SVM overestimated in most stations. Finally, in terms of R2, the results by ML models were quite similar in both inland and coastal locations. However, the results of LI were significantly worse than ML in coastal sites, whereas SVM performed worse on inland sites than coastal.
Additionally, Table 8 displays the maximum improvement, in terms of RMSE, R2, and MBE, comparing ML to LI (using the first approach). In inland sites, the RMSE improvement ranged from 0.031 mm/day in Torreblascopedro to 0.263 mm/day in Marmolejo, as well as from 0.004 (Torreblascopedro) to 0.048 (Marmolejo), in terms of R2. On the other hand, the upgrades in coastal sites ranged from RMSE = 0.076 mm/day and R2 = 0.012 (in Pizarra) to RMSE = 1.475 mm/day and R2 = 0.25 (in Archidona). Thus, coastal locations significantly differed between linear interpolation and ML models for gap-filling daily rainfall. In contrast, in inland areas, the improvement was not substantial.
Thus, using empirical approaches (such as LI) to gap-fill daily rainfall data is not recommended, especially in coastal sites; the results are worse than ML, due to the effect of sea distance.

3.4. Seasonality Performance

In order to assess seasonal performance, the RMSE, MBE, and R2 of all the stations and approaches, for the different evaluated models (SVM, MLP, and RF), are represented in Figure 8. Regarding RMSE, summer, autumn, and spring obtained very similar average performances, whereas, in winter, the mean results were the worst. Moreover, summer obtained the narrowest interquartile range, but spring and winter got the more extensive range, with LI being the model with the worst range (the less confident between different stations) among all seasons and models. MBE, MLP, RF, and LI always performed with very similar average results, although LI had the widest interquartile range for all seasons. Besides, SVM always performed the worst, in terms of MBE. In terms of R2, the highest mean values were carried out in winter, whereas the worst results were achieved in summer and spring. Regarding mean, all models performed with similar values during the same season.
Additionally, in Figure 9 and Figure 10, the values predicted by the different ML models using the first approach are shown and compared to LI. In Figure 9, the predictions from Torreblascopedro are plotted (the site that obtained the best performance, in terms of RMSE and R2). In winter, all predictions are close to the 1:1 line, which denotes the excellent performance of this model during this season. The predictions were also close to the 1:1 line in spring and autumn, although the points were more dispersed than in winter. Finally, summer obtained the worst results, with farthest points to the 1:1 line, especially with high rainfall values.
Finally, Figure 10 plots the prediction rainfall values in Archidona. Spring obtained the best general predictions among all models, followed by autumn, summer, and winter, in this order. The highest differences between ML and LI were found in winter and autumn, where most LI predictions were farther from the 1:1 line.
Generally, summer obtained worse results than the rest of the seasons, due to the Mediterranean climate; during summers in Andalusia, precipitation is very occasional. They usually respond to local events, such as local torments. So, gap-filling rainfall data using neighbor stations with very different pluviometry makes models fail in those specific dates. Comparing the results between Torreblascopedro and Archidona, the most significant differences can be seen in winter, where LI performed much worse than ML approaches.

3.5. General Discussion

In terms of R2, the results obtained in this work outperformed those obtained by Kim and Ryu [51] (Pocatello, ID, USA) using IDWM, OK, and GME, in conjunction with cluster analysis, having the best R2 performance, with a value below 0.7 (R2 = 0.689 or R = 0.83). Besides, the models developed in this work highly improved the RMSE and R2 performance of Wuthiwongyothin [52] in Northern Thailand, using the K-means technique with the inverse distance weighting (IDW) and correlation coefficient weighting (CCW), where the mean R2 values among all stations were below 0.6. Moreover, in terms of R2, the values obtained by Sehad et al. [53] in North Algeria using multispectral MSG SEVIRI imagery were slightly worse, on average, than the obtained in this work, with a mean R2 = 0.7241. However, in absolute terms, its developed model outperformed this work’s best results (R2 = 0.921 against R2 = 0.898 in Torreblascopedro using SVM). Thus, ML models with neighbor station data located within a 50 km radius are highly recommended to gap-fill rainfall values in coastal locations, due to their accurate performance (among other approaches) in the different areas assessed along the Andalusia region, being the preferred use of neighbor stations, over the use of cluster analysis with stations located within a further radius distance. However, in inland sites, the performances carried out by ML against LI were not as significant as in coastal sites. Finally, in order to improve the state of the art of these approaches, future works could analyze the possibility of false alarms and missing rainfall cases using the models developed in this work.

4. Conclusions

Three different approaches were evaluated for gap-filling daily rainfall values: (A) the use of data from neighbor stations within 50 km, (B) the use of one day before and ahead from the target station, and (C) the use of two days before and ahead from the target station. Fourteen different locations were evaluated from two areas, corresponding to inland and coastal sites. Additionally, three different ML models were assessed for this purpose: MLP, SVM, and RF. Daily large datasets of around 21 years were used (from 2000 to 2021), where 70% was used for training and a random 30% for testing purposes. Besides, 20% from the training dataset was used to find the fittest hyperparameters. Finally, a seasonality analysis was carried out. Based on the arisen results, no ML model significantly outperformed the rest, although MLP obtained the best results, or very close to them, in most locations. On the other hand, SVM had accurate performances in coastal sites, whereas the behavior was not so good at inland locations. RF behaved the opposite to SVM, having an accurate performance at inland locations and worse modeling at inland sites. Moreover, the first approach (the use of neighbor data) was notably better than the other approaches, with RMSE values below 2.0 mm/day and R2 values above 0.85 in most stations. There were no significant seasonal differences in performance, in terms of RMSE and MBE values in winter, spring, and autumn, but the results obtained in summer were generally worse for all locations. Besides, coastal area location models performed slightly worse and with higher performance differences between ML and LI, in most sites and models, highlighting the differences in rainfall prediction efficiency, depending on the sea distance. In conclusion, it could be stated that the use of neighbor data with MLP is highly recommended as a rainfall gap-filling technique, rather than the use of data from the target station from the past and future. Moreover, when these work’s results are compared to different paper’s approaches using a cluster analysis from wider ranges, the use of closer stations (within a 50 km radius) obtained better results in terms of R2.
Finally, due to the significant need to have a complete time series rainfall dataset on a daily basis and the increasing interest in installing low-cost wireless sensors (IoT), the models developed and assessed in this work can help with gap-filling datasets in this work near-real-time, thanks to the decreasing price of the low-cost automated weather stations using these new devices.

Author Contributions

Conceptualization, A.P.G.-M., J.A.B.-J. and J.E.G.; methodology, J.A.B.-J. and J.E.G.; software, J.A.B.-J. and J.E.G.; validation, A.P.G.-M., J.A.B.-J. and J.E.G.; formal analysis, J.A.B.-J. and J.E.G.; investigation, J.A.B.-J. and J.E.G.; resources, A.P.G.-M., J.A.B.-J. and J.E.G.; data curation, J.A.B.-J. and J.E.G.; writing—original draft preparation, J.A.B.-J. and J.E.G.; writing—review and editing, A.P.G.-M., J.A.B.-J. and J.E.G.; visualization, J.A.B.-J. and J.E.G.; supervision, A.P.G.-M., J.A.B.-J. and J.E.G.; project administration, A.P.G.-M. and J.E.G.; funding acquisition, A.P.G.-M. and J.E.G. All authors have read and agreed to the published version of the manuscript.

Funding

Spanish Ministry of Science, Innovation and Universities [grant AGL2017-87658-R] and University of Cordoba: PIF scholarship.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://www.juntadeandalucia.es/agriculturaypesca/ifapa/riaweb/web (accessed on 30 July 2021).

Acknowledgments

J.A. Bellido-Jiménez wishes to thank the University of Córdoba for providing a PIF scholarship funded by the research program and the Spanish Ministry of Science, Innovation and Universities. Grant number AGL2017-87658-R for also funding this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shen, R.; Huang, A.; Li, B.; Guo, J. Construction of a drought monitoring model using deep learning based on multi-source remote sensing data. Int. J. Appl. Earth Obs. Geoinf. 2019, 79, 48–57. [Google Scholar] [CrossRef]
  2. Fernández, A.J.; Molero, F.; Becerril-Valle, M.; Coz, E.; Salvador, P.; Artíñano, B.; Pujadas, M. Application of remote sensing techniques to study aerosol water vapour uptake in a real atmosphere. Atmos. Res. 2018, 202, 112–127. [Google Scholar] [CrossRef]
  3. Astel, A.; Mazerski, J.; Polkowska, Z.; Namieśsnik, J. Application of PCA and time series analysis in studies of precipitation in Tricity (Poland). Adv. Environ. Res. 2004, 8, 337–349. [Google Scholar] [CrossRef]
  4. Sayemuzzaman, M.; Jha, M.K. Seasonal and annual precipitation time series trend analysis in North Carolina, United States. Atmos. Res. 2014, 137, 183–194. [Google Scholar] [CrossRef]
  5. Estévez, J.; Gavilán, P.; García-Marín, A.P.; Zardi, D. Detection of spurious precipitation signals from automatic weather stations in irrigated areas. Int. J. Climatol. 2015, 35, 1556–1568. [Google Scholar] [CrossRef]
  6. Jiang, L.; Wu, J. Hybrid PSO and GA for Neural Network Evolutionary in Monthly Rainfall Forecasting; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7802. [Google Scholar]
  7. Cramer, S.; Kampouridis, M.; Freitas, A.A.; Alexandridis, A.K. An extensive evaluation of seven machine learning methods for rainfall prediction in weather derivatives. Expert Syst. Appl. 2017, 85, 169–181. [Google Scholar] [CrossRef] [Green Version]
  8. Teegavarapu, R.S.V.; Chandramouli, V. Improved weighting methods, deterministic and stochastic data-driven models for estimation of missing precipitation records. J. Hydrol. 2005, 312, 191–206. [Google Scholar] [CrossRef]
  9. Barrios, A.; Trincado, G.; Garreaud, R. Alternative approaches for estimating missing climate data: Application to monthly precipitation records in south-central Chile. For. Ecosyst. 2018, 5, 1–10. [Google Scholar] [CrossRef] [Green Version]
  10. McCuen, R.H. Hydrologic Analysis and Design, 3rd ed.; Pearson: New York, NY, USA, 2004; ISBN 978-0131424241. [Google Scholar]
  11. Bostan, P.A.; Heuvelink, G.B.M.; Akyurek, S.Z. Comparison of regression and kriging techniques for mapping the average annual precipitation of Turkey. Int. J. Appl. Earth Obs. Geoinf. 2012, 19, 115–126. [Google Scholar] [CrossRef]
  12. Adhikary, S.K.; Muttil, N.; Yilmaz, A.G. Genetic Programming-Based Ordinary Kriging for Spatial Interpolation of Rainfall. J. Hydrol. Eng. 2016, 21, 04015062. [Google Scholar] [CrossRef]
  13. Mair, A.; Fares, A. Comparison of Rainfall Interpolation Methods in a Mountainous Region of a Tropical Island. J. Hydrol. Eng. 2011, 16, 371–383. [Google Scholar] [CrossRef]
  14. Simolo, C.; Brunetti, M.; Maugeri, M.; Nanni, T. Improving estimation of missing values in daily precipitation series by a probability density function-preserving approach. Int. J. Climatol. 2010, 30, 1564–1576. [Google Scholar] [CrossRef]
  15. Xia, Y.; Fabian, P.; Stohl, A.; Winterhalter, M. Forest climatology: Estimation of missing values for Bavaria, Germany. Agric. For. Meteorol. 1999, 96, 131–144. [Google Scholar] [CrossRef] [Green Version]
  16. Teegavarapu, R.S.V.; Tufail, M.; Ormsbee, L. Optimal functional forms for estimation of missing precipitation data. J. Hydrol. 2009, 374, 106–115. [Google Scholar] [CrossRef]
  17. Teegavarapu, R.S.V. Estimation des données manquantes des précipitations en utilisant la proximité optimale d’imputation métrique base, la classification du plus proche voisin et méthodes d’interpolation à base de cluster. Hydrol. Sci. J. 2014, 59, 2009–2026. [Google Scholar] [CrossRef] [Green Version]
  18. Huang, M.; Lin, R.; Huang, S.; Xing, T. A novel approach for precipitation forecast via improved K-nearest neighbor algorithm. Adv. Eng. Inform. 2017, 33, 89–95. [Google Scholar] [CrossRef]
  19. Gorshenin, A.; Lebedeva, M.; Lukina, S.; Yakovleva, A. Application of Machine Learning Algorithms to Handle Missing Values in Precipitation Data. In Lecture Notes in Computer Science; (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin, Germany, 2019; Volume 11965, pp. 563–577. [Google Scholar]
  20. Bagirov, A.M.; Mahmood, A.; Barton, A. Prediction of monthly rainfall in Victoria, Australia: Clusterwise linear regression approach. Atmos. Res. 2017, 188, 20–29. [Google Scholar] [CrossRef]
  21. Kajewska-Szkudlarek, J. Clustering approach to urban rainfall time series prediction with support vector regression model. Urban Water J. 2020, 17, 235–246. [Google Scholar] [CrossRef]
  22. Estévez, J.; Bellido-Jiménez, J.A.; Liu, X.; García-Marín, A.P. Monthly Precipitation Forecasts Using Wavelet Neural Networks Models in a Semiarid Environment. Water 2020, 12, 1909. [Google Scholar] [CrossRef]
  23. Partal, T.; Kişi, Ö. Wavelet and neuro-fuzzy conjunction model for precipitation forecasting. J. Hydrol. 2007, 342, 199–212. [Google Scholar] [CrossRef]
  24. Li, G.; Ma, X.; Yang, H. A hybrid model for monthly precipitation time series forecasting based on variational mode decomposition with extreme learning machine. Information 2018, 9, 177. [Google Scholar] [CrossRef] [Green Version]
  25. Filho, A.S.F.; Lima, G.A.R. Gap Filling of Precipitation Data by SSA—Singular Spectrum Analysis. J. Phys. Conf. Ser. 2016, 759, 012085. [Google Scholar]
  26. Sun, M.; Li, X.; Kim, G. Precipitation analysis and forecasting using singular spectrum analysis with artificial neural networks. Clust. Comput. 2019, 22, 12633–12640. [Google Scholar] [CrossRef]
  27. Kim, S.; Hong, S.; Joh, M.; Song, S.K. DeepRain: ConvLSTM network for precipitation prediction using multichannel radar data. arXiv 2017, arXiv:1711.02316. [Google Scholar]
  28. Ha, J.-H.; Lee, Y.H.; Kim, Y.-H. Forecasting the Precipitation of the Next Day Using Deep Learning. J. Korean Inst. Intell. Syst. 2016, 26, 93–98. [Google Scholar] [CrossRef] [Green Version]
  29. Chen, L.; Cao, Y.; Ma, L.; Zhang, J. A Deep Learning-Based Methodology for Precipitation Nowcasting with Radar. Earth Space Sci. 2020, 7, e2019EA000812. [Google Scholar] [CrossRef] [Green Version]
  30. Estévez, J.; Gavilán, P.; García-Marín, A.P. Spatial regression test for ensuring temperature data quality in southern Spain. Theor. Appl. Climatol. 2018, 131, 309–318. [Google Scholar] [CrossRef]
  31. Estévez Gualda, J.; Gavilán, P.; Giráldez, J.V. Guidelines on validation procedures for meteorological data from automatic weather stations. J. Hydrol. 2011, 402, 144–154. [Google Scholar] [CrossRef] [Green Version]
  32. Shanker, M.S.; Hu, M.Y.; Hung, M.S. Effect of data standardization on neural network training. Omega 1996, 24, 385–397. [Google Scholar] [CrossRef]
  33. Luna, A.M.; Lineros, M.L.; Gualda, J.E.; Giráldez Cervera, J.V.; Madueño Luna, J.M. Assessing the Best Gap-Filling Technique for River Stage Data Suitable for Low Capacity Processors and Real-Time Application Using IoT. Sensors 2020, 20, 6354. [Google Scholar] [CrossRef]
  34. Bellido-Jiménez, J.A.; Estévez, J.; García-Marín, A.P. New machine learning approaches to improve reference evapotranspiration estimates using intra-daily temperature-based variables in a semi-arid region of Spain. Agric. Water Manag. 2020, 245, 106558. [Google Scholar] [CrossRef]
  35. Banadkooki, F.B.; Ehteram, M.; Ahmed, A.N.; Fai, C.M.; Afan, H.A.; Ridwam, W.M.; Sefelnasr, A.; El-Shafie, A. Precipitation forecasting using multilayer neural Network and support vector machine optimization based on flow regime algorithm taking into Account uncertainties of soft computing models. Sustainability 2019, 11, 6681. [Google Scholar] [CrossRef] [Green Version]
  36. Ortiz-García, E.G.; Salcedo-Sanz, S.; Casanova-Mateo, C. Accurate precipitation prediction with support vector classifiers: A study including novel predictive variables and observational data. Atmos. Res. 2014, 139, 128–136. [Google Scholar] [CrossRef]
  37. Nayak, M.A.; Ghosh, S. Prediction of extreme rainfall event using weather pattern recognition and support vector machine classifier. Theor. Appl. Climatol. 2013, 114, 583–603. [Google Scholar] [CrossRef]
  38. Aftab, S.; Ahmad, M.; Hameed, N.; Bashir, M.S.; Ali, I.; Nawaz, Z. Rainfall prediction in Lahore City using data mining techniques. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 254–260. [Google Scholar] [CrossRef]
  39. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  40. Sukovich, E.M.; Ralph, F.M.; Barthold, F.E.; Reynolds, D.W.; Novak, D.R. Extreme quantitative precipitation forecast performance at the weather prediction center from 2001 to 2011. Weather Forecast. 2014, 29, 894–911. [Google Scholar] [CrossRef]
  41. Das, S.; Chakraborty, R.; Maitra, A. A random forest algorithm for nowcasting of intense precipitation events. Adv. Space Res. 2017, 60, 1271–1282. [Google Scholar] [CrossRef]
  42. Wolfensberger, D.; Gabella, M.; Boscacci, M.; Germann, U.; Berne, A. RainForest: A random forest algorithm for quantitative precipitation estimation over Switzerland. Atmos. Meas. Tech. 2021, 14, 3169–3193. [Google Scholar] [CrossRef]
  43. Wang, Y.; Liu, J.; Li, R.; Suo, X.; Lu, E. Precipitation forecast of the Wujiang River Basin based on artificial bee colony algorithm and backpropagation neural network. Alex. Eng. J. 2020, 59, 1473–1483. [Google Scholar] [CrossRef]
  44. Kotthoff, L.; Thornton, C.; Hoos, H.; Hutter, F.; Leyton-Brown, K. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
  45. Jin, H.; Song, Q.; Hu, X. Auto-Keras: An Efficient Neural Architecture Search System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1946–1956. [Google Scholar]
  46. Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.T.; Blum, M.; Hutter, F. Auto-sklearn: Efficient and robust automated machine learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 2, pp. 2962–2970. [Google Scholar]
  47. Hutter, F.; Kotthoff, L.; Vanschoren, J. Automated Machine Learning; The Springer Series on Challenges in Machine Learning; Springer International Publishing: Cham, Switzerland, 2019; ISBN 978-3-030-05317-8. [Google Scholar]
  48. Bellido-Jiménez, J.A.; Estévez, J.; García-Marín, A.P. Assessing Neural Network Approaches for Solar Radiation Estimates Using Limited Climatic Data in the Mediterranean Sea. In Proceedings of the 3rd International Electronic Conference on Atmospheric Sciences (ECAS 2020), Online, 16–30 November 2020. [Google Scholar]
  49. Borji, A.; Itti, L. Bayesian optimization explains human active search. Adv. Neural Inf. Process. Syst. 2013, 26, 55–63. [Google Scholar]
  50. Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef] [Green Version]
  51. Kim, J.; Ryu, J.H. A heuristic gap filling method for daily precipitation series. Water Resour. Manag. 2016, 30, 2275–2294. [Google Scholar] [CrossRef]
  52. Wuthiwongyothin, S.; Kalkan, C.; Panyavaraporn, J. Evaluating Inverse Distance Weighting and Correlation Coefficient Weighting Infilling Methods on Daily Rainfall Time Series. SNRU J. Sci. Technol. 2021, 13, 71–79. [Google Scholar]
  53. Sehad, M.; Lazri, M.; Ameur, S. Novel SVM-based technique to improve rainfall estimation over the Mediterranean region (North of Algeria) using the multispectral MSG SEVIRI imagery. Adv. Space Res. 2017, 59, 1381–1394. [Google Scholar] [CrossRef]
Figure 1. Spatial distribution of the fourteen automated weather stations used in this work.
Figure 1. Spatial distribution of the fourteen automated weather stations used in this work.
Atmosphere 12 01158 g001
Figure 2. Methodology flowchart.
Figure 2. Methodology flowchart.
Atmosphere 12 01158 g002
Figure 3. One neuron control logic.
Figure 3. One neuron control logic.
Atmosphere 12 01158 g003
Figure 4. Support Vector Machine for classification (a) and regression (b).
Figure 4. Support Vector Machine for classification (a) and regression (b).
Atmosphere 12 01158 g004
Figure 5. RMSE (a), MBE (b), and R2 (c) values from testing dataset for all stations and models (MLP, SVM, RF, and LI), using only precipitation data from neighbor stations.
Figure 5. RMSE (a), MBE (b), and R2 (c) values from testing dataset for all stations and models (MLP, SVM, RF, and LI), using only precipitation data from neighbor stations.
Atmosphere 12 01158 g005
Figure 6. RMSE, MBE, and R2 values from testing dataset for all stations and models (MLP, SVM, and RF), using only precipitation data from the target station. (a) RMSE using approach B, (b) MBE using approach B, (c) R² using approach B, (d) RMSE using approach C, (e) MBE using approach C, (f) R² using approach C.
Figure 6. RMSE, MBE, and R2 values from testing dataset for all stations and models (MLP, SVM, and RF), using only precipitation data from the target station. (a) RMSE using approach B, (b) MBE using approach B, (c) R² using approach B, (d) RMSE using approach C, (e) MBE using approach C, (f) R² using approach C.
Atmosphere 12 01158 g006
Figure 7. RMSE (a), MBE (b), and R2 (c) values of the different models (SVM, MLP, and RF) in the coastal and inland locations, using rainfall values from neighbor stations as inputs, where the minimum, the first interquartile (Q1), the median, the third interquartile (Q3), the maximum, and the outlier values are represented.
Figure 7. RMSE (a), MBE (b), and R2 (c) values of the different models (SVM, MLP, and RF) in the coastal and inland locations, using rainfall values from neighbor stations as inputs, where the minimum, the first interquartile (Q1), the median, the third interquartile (Q3), the maximum, and the outlier values are represented.
Atmosphere 12 01158 g007
Figure 8. Seasonality performance of the different models (SVM, MLP, and RF) in all the stations and approaches.
Figure 8. Seasonality performance of the different models (SVM, MLP, and RF) in all the stations and approaches.
Atmosphere 12 01158 g008
Figure 9. Scatter plot for predicted values in Torreblascopedro, using MLP, SVM, RF, and LI, during the different seasons.
Figure 9. Scatter plot for predicted values in Torreblascopedro, using MLP, SVM, RF, and LI, during the different seasons.
Atmosphere 12 01158 g009
Figure 10. Scatter plot for predicted values in Archidona, using MLP, SVM, RF, and LI, during the different seasons.
Figure 10. Scatter plot for predicted values in Archidona, using MLP, SVM, RF, and LI, during the different seasons.
Atmosphere 12 01158 g010
Table 1. Geo-climatic characteristics of the AWS assessed in this work (lat.: latitude; long.: longitude; alt.: elevation above mean sea level).
Table 1. Geo-climatic characteristics of the AWS assessed in this work (lat.: latitude; long.: longitude; alt.: elevation above mean sea level).
StationAlt. [m]Lat. [ºN]Long. [ºW]Mean Annual Rainfall [mm]Time-Period (Number of Days)
Area 1:
Jaen (JAE)29937.893.77446.54From April 2001 to June 2021 (7361)
La Higuera de Arjona (ARJ)25737.954.00477.68From January 2001 to June 2021 (7456)
Linares (LIN)43238.073.65466.70From August 2000 to June 2021 (7601)
Mancha Real (MAN)40737.923.60390.86From August 2000 to June 2021 (7602)
Marmolejo (MAR)20838.064.13523.36From September 2000 to June 2021 (7590)
Sabiote (SAB)79138.083.24446.98From August 2000 to June 2021 (7615)
TorreblascoPedro (TOR)27537.993.69434.37From August 2000 to June 2021(7615)
Area 2:
Antequera (ANT)44037.034.56444.72From November 2000 to June 2021 (7512)
Archidona (ARC)51637.084.43457.83From December 2000 to June 2021 (7483)
Cártama (CAR)7836.724.68490.51From June 2001 to June 2021 (7300)
Churriana (CHU)1736.674.50510.32From February 2001 to June 2021 (7426)
Málaga (MAL)5536.764.54461.63From October 2000 to June 2021 (7546)
Pizarra (PIZ)7136.774.72463.47From January 2001 to June 2021 (7447)
Vélez (VEL)3336.804.13490.49From October 2000 to June 2021 (7546)
Table 2. Inputs configurations of the different models and approaches assessed. DOY represents the day of year, P corresponds to precipitation, D corresponds to distance, and i represents an index to the dataset-specific day.
Table 2. Inputs configurations of the different models and approaches assessed. DOY represents the day of year, P corresponds to precipitation, D corresponds to distance, and i represents an index to the dataset-specific day.
Target StationInputs Approach AInputs Approach BInputs Approach C
Area 1:
JaenDOY(i) + PARJ(i) + DJAE-LIN + PLIN(i)
+ DJAE-LIN + PMAN(i) + DJAE-MAN +
PMAR(i) + DJAE-MAR + PSAB(i) +
DJAE-SAB + PTOR(i) + DJAE-TOR
DOY(i) + PJAE(i − 1) + PJAE(i + 1)DOY(i) + PJAE(i − 1) + PJAE(i − 2) + PJAE(i + 1) + PJAE(i + 2)
La Higuera de ArjonaDOY(i) + PJAE(i) + DARJ-JAE + PLIN(i)
+ DARJ-LIN + PMAN(i) + DARJ-MAN +
PMAR(i) + DARJ-MAR + PSAB(i) +
DARJ-SAB + PTOR(i) + DARJ-TOR
DOY(i) + PARJ(i − 1) + PARJ(i + 1)DOY (i) + PARJ(i − 1) + PARJ(i − 2) + PARJ(i + 1) + PARJ(i + 2)
LinaresDOY(i) + PJAE(i) + DLIN-JAE + PARJ(i)
+ DLIN-ARJ + PMAN(i) + DLIN-MAN +
PMAR(i) + DLIN-MAR + PSAB(i) +
DLIN-SAB + PTOR(i) + DLIN-TOR
DOY(i) + PLIN(i − 1) + PLIN(i + 1)DOY (i) + PLIN(i − 1) + PLIN(i − 2) + PLIN(i + 1) + PLIN(i + 2)
Mancha RealDOY(i) + PJAE(i) + DMAN-JAE + PARJ(i) + DMAN-ARJ + PLIN(i) + DMAN-LIN +
PMAR(i) + DMAN-MAR + PSAB(i) +
DMAN-SAB + PTOR(i) + DMAN-TOR
DOY(i) + PMAN(i − 1) + PMAN(i + 1)DOY (i) + PMAN(i − 1) + PMAN(i − 2) + PMAN(i + 1) + PMAN(i + 2)
MarmolejoDOY(i) + PJAE(i) + DMAR-JAE + PARJ(i) + DMAR-ARJ + PLIN(i) + DMAR-LIN + PMAN(i) + DMAR-MAN + PSAB(i) +
DMAR-SAB + PTOR(i) + DMAR-TOR
DOY(i) + PMAR(i − 1) + PMAR(i + 1)DOY (i) + PMAR(i − 1) + PMAR(i − 2) + PMAR(i + 1) + PMAR(i + 2)
SabioteDOY(i) + PJAE(i) + DSAB-JAE + PARJ(i)
+ DSAB-ARJ + PLIN(i) + DSAB-LIN + PMAN(i) + DSAB-MAN + PMAR(i) +
DSAB-MAR + PTOR(i) + DSAB-TOR
DOY(i) + PSAB(i − 1) + PSAB(i + 1)DOY (i) + PSAB(i − 1) + PSAB(i − 2) + PSAB(i + 1) + PSAB(i + 2)
TorreblascoPedroDOY(i) + PJAE(i) + DTOR-JAE + PARJ(i) + DTOR-ARJ + PLIN(i) + DTOR-LIN + PMAN(i) + DTOR-MAN + PMAR(i) +
DTOR-MAR + PSAB(i) + DTOR-SAB
DOY(i) + PTOR(i − 1) + PTOR(i + 1)DOY (i) + PTOR(i − 1) + PTOR(i − 2) + PTOR(i + 1) + PTOR(i + 2)
Area 2:
AntequeraDOY(i) + PARC(i) + DANT-ARC + PCAR(i) + DANT-CAR + PCHU(i) + DANT-CHU +
PMAL(i) + DANT-MAL + PPIZ(i) +
DANT-PIZ + PVEL(i) + DANT-VEL
DOY(i) + PANT(i − 1) + PANT(i + 1)DOY (i) + PANT(i − 1) + PANT(i − 2) + PANT(i + 1) + PANT(i + 2)
ArchidonaDOY(i) + PANT(i) + DARC-ANT + PCAR(i) + DARC-CAR + PCHU(i) + DARC-CHU +
PMAL(i) + DARC-MAL + PPIZ(i) +
DARC-PIZ + PVEL(i) + DARC-VEL
DOY(i) + PARC(i − 1) + PARC(i + 1)DOY (i) + PARC(i − 1) + PARC(i − 2) + PARC(i + 1) + PARC(i + 2)
CártamaDOY(i) + PANT(i) + DCAR-ANT + PARC(i) + DCAR-ARC + PCHU(i) + DCAR-CHU +
PMAL(i) + DCAR-MAL + PPIZ(i) +
DCAR-PIZ + PVEL(i) + DCAR-VEL
DOY(i) + PCAR(i − 1) + PCAR(i + 1)DOY (i) + PCAR(i − 1) + PCAR(i − 2) + PCAR(i + 1) + PCAR(i + 2)
ChurrianaDOY(i) + PANT(i) + DCHU-ANT +
PARC(i) + DCHU-ARC + PCAR(i) +
DCHU-CAR + PMAL(i) + DCHU-MAL +
PPIZ(i) + DCHU-PIZ + PVEL(i) + DCHU-VEL
DOY(i) + PCHU(i − 1) + PCHU(i + 1)DOY (i) + PCHU(i − 1) + PCHU(i − 2) + PCHU(i + 1) + PCHU(i + 2)
MálagaDOY(i) + PANT(i) + DMAL-ANT +
PARC(i) + DMAL-ARC + PCAR(i) +
DMAL-CAR + PCHU(i) + DMAL-CHU +
PPIZ(i) + DMAL-PIZ + PVEL(i) + DMAL-VEL
DOY(i) + PMAL(i − 1) + PMAL(i + 1)DOY (i) + PMAL(i − 1) + PMAL(i − 2) + PMAL(i + 1) + PMAL(i + 2)
PizarraDOY(i) + PANT(i) + DPIZ-ANT + PARC(i) + DPIZ-ARC + PCAR(i) + DPIZ-CAR + PCHU(i) + DPIZ-CHU + PMAL(i) +
DPIZ-MAL + PVEL(i) + DPIZ-VEL
DOY(i) + PPIZ(i − 1) + PPIZ(i + 1)DOY (i) + PPIZ(i − 1) + PPIZ(i − 2) + PPIZ(i + 1) + PPIZ(i + 2)
VélezDOY(i) + PANT(i) + DVE-ANT + PARC(i) + DVEL-ARC + PCAR(i) + DVEL-CAR + PCHU(i) + DVEL-CHU + PMAL(i) + DVEL-MAL + PPIZ(i) + DVEL-PIZDOY(i) + PVEL(i − 1) + PVEL(i + 1)DOY (i) + PVEL(i − 1) + PVEL(i − 2) + PVEL(i + 1) + PVEL(i + 2)
Table 3. Hyperparameter set for each model, approach, and location, after carrying out Bayesian optimization, where activation represents the activation function, the optimizer represents the optimizer function, epochs represents the number of epochs, neurons represents the number of hidden layers and the number of neurons of each, kernel is the kernel function, c and epsilon represent internal hyperparameters of SVR, n_estimators is the number of trees in RF, and max_features is the number of features to consider when looking for the best split.
Table 3. Hyperparameter set for each model, approach, and location, after carrying out Bayesian optimization, where activation represents the activation function, the optimizer represents the optimizer function, epochs represents the number of epochs, neurons represents the number of hidden layers and the number of neurons of each, kernel is the kernel function, c and epsilon represent internal hyperparameters of SVR, n_estimators is the number of trees in RF, and max_features is the number of features to consider when looking for the best split.
Approaches:
LocationModelsHyperparametersABC
La Higuera de ArjonaMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs1008753
neurons(20, 20)(9, 15, 10)(6, 15, 9)
SVMkernelRBFRBFpoly
c10.010.01.855
epsilon0.010.010.01
RFn_estimators10010091
max_featuressqrtautolog2
JaenMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs926198
neurons(20, 20)(2, 1, 12)(1, 10, 8)
SVMkernellinearlinearRBF
c1.7589.73010.0
epsilon0.7390.010.01
RFn_estimators9495100
max_featuresautolog2log2
LinaresMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs100100100
neurons(20, 20)(1, 1, 1)(1, 1, 1)
SVMkernellinearRBFRBF
c10.04.02310.0
epsilon0.010.0180.01
RFn_estimators1009780
max_featuresautosqrtlog2
Mancha RealMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs10099100
neurons(20, 20)(5, 14)(1, 1, 1)
SVMkernelRBFRBFRBF
c10.06.2359.211
epsilon0.010.0100.01
RFn_estimators754146
max_featuresautosqrtlog2
MarmolejoMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs1009610
neurons(20, 6)(5, 3, 11)(1, 11)
SVMkernellinearRBFRBF
c10.04.3509.970
epsilon0.010.010.01
RFn_estimators10031100
max_featuresautoautosqrt
SabioteMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs10010095
neurons(20, 20)(1, 1, 1)(2, 11, 9)
SVMkernellinearRBFRBF
c10.010.010.0
epsilon0.010.010.01
RFn_estimators723957
max_featureslog2log2log2
TorreblascopedroMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs1007273
neurons(20, 12)(1, 4, 13)(5, 1, 17)
SVMkernellinearRBFpoly
c3.7953.1086.205
epsilon0.010.010.012
RFn_estimators816494
max_featureslog2sqrtsqrt
AntequeraMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs20017461
neurons(13, 8)(5, 2, 20)(14, 11, 13)
SVMkernellinearRBFRBF
c8.6847.6274.981
epsilon0.2250.010.014
RFn_estimators559441
max_featuresautoautolog2
ArchidonaMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs944011
neurons(13, 5, 18)(11, 12, 1)(16, 11, 19)
SVMkernellinearpolyRBF
c7.2464.5314.104
epsilon0.010.010.013
RFn_estimators8193100
max_featuresautoautosqrt
CártamaMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs12910112
neurons(8, 13, 17)(1, 1, 1)(14, 17, 6)
SVMkernellinearRBFpoly
c6.2737.8303.862
epsilon0.010.010.01
RFn_estimators921038
max_featuresautosqrtlog2
ChurrianaMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs18020070
neurons(20, 20, 18)(1, 1, 1)(5, 15, 7)
SVMkernellinearRBFRBF
c10.010.05.963
epsilon0.010.010.01
RFn_estimators1004036
max_featureslog2log2sqrt
MálagaMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs15897127
neurons(20, 20, 20)(17, 11, 10)(13, 4, 16)
SVMkernellinearRBFRBF
c8.7849.9996.952
epsilon0.010.010.011
RFn_estimators691410
max_featureslog2log2log2
PizarraMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs19294171
neurons(13, 15, 8)(3, 4, 6)(14, 1, 6)
SVMkernellinearRBFRBF
c7.64210.04.031
epsilon0.0150.010.01
RFn_estimators764595
max_featuresautosqrtsqrt
Vélez-MálagaMLPactivationReLUReLUReLU
optimizerADAMADAMADAM
epochs200180139
neurons(20, 20, 20)(15, 13, 10)(8, 2, 10)
SVMkernelRBFRBFRBF
c10.06.03210.0
epsilon0.010.010.01
RFn_estimators726278
max_featuressqrtlog2sqrt
Table 4. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the first area (inland locations), using data from neighbor stations. The best values for each site are in bold.
Table 4. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the first area (inland locations), using data from neighbor stations. The best values for each site are in bold.
Stations (Area 1)ModelRMSE
[mm/day]
MBE
[mm/day]
R2
La Higuera de ArjonaMLP1.3630.0160.894
SVM1.800−0.1060.818
RF1.3840.0260.889
LI1.502−0.0080.869
JaenMLP1.767−0.0970.827
SVM1.822−0.0640.823
RF1.8800.0230.804
LI1.9160.0510.797
LinaresMLP1.7230.0830.817
SVM1.808−0.1060.798
RF1.7300.0010.815
LI1.896−0.0010.784
Mancha RealMLP1.662−0.0720.831
SVM1.948−0.1950.780
RF1.730−0.0780.816
LI1.8520.1100.790
MarmolejoMLP2.176−0.1870.791
SVM2.154−0.1690.795
RF2.1290.0410.801
LI2.392−0.2490.753
SabioteMLP2.049−0.1010.752
SVM2.135−0.2240.739
RF2.105−0.0610.740
LI2.112−0.0060.742
TorreblascopedroMLP1.270−0.0350.894
SVM1.246−0.0050.898
RF1.3590.0190.878
LI1.2770.0470.894
Mean values 1.792−0.0480.815
Table 5. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the second area (coastal locations), using data from neighbor stations. The best values for each site are in bold.
Table 5. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the second area (coastal locations), using data from neighbor stations. The best values for each site are in bold.
Stations (Area 2)ModelRMSE
[mm/day]
MBE
[mm/day]
R2
AntequeraMLP1.5950.0350.875
SVM1.632−0.1040.875
RF2.0090.0420.799
LI2.8390.1000.684
ArchidonaMLP1.811−0.0430.844
SVM1.817−0.1680.844
RF2.002−0.0190.809
LI3.286−0.0410.594
CártamaMLP2.640−0.0750.756
SVM2.502−0.1060.778
RF2.8200.0020.737
LI2.6300.0610.756
ChurrianaMLP2.192−0.0520.876
SVM2.465−0.1470.860
RF2.3150.0190.862
LI2.973−0.0610.790
MálagaMLP2.4850.0990.830
SVM2.448−0.1700.825
RF2.4330.0120.816
LI2.6100.040.785
PizarraMLP2.0320.0430.854
SVM2.083−0.1120.853
RF2.0320.0390.854
LI2.1080.0790.842
Vélez-MálagaMLP3.219−0.0740.742
SVM3.531−0.3760.706
RF3.306−0.0200.719
LI3.489−0.1570.692
Mean values 2.475−0.0410.794
Table 6. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the first area (inland locations), using data from the target station in two different approaches, with the use of the previous and following day and the use of the two previous and two following days. The best values from each station are in bold.
Table 6. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the first area (inland locations), using data from the target station in two different approaches, with the use of the previous and following day and the use of the two previous and two following days. The best values from each station are in bold.
One Day (B)Two Days (C)
Stations (Area 1)ModelRMSE
[mm/day]
MBE
[mm/day]
R2RMSE
[mm/day]
MBE
[mm/day]
R2
La Higuera de ArjonaMLP4.409−0.0210.0234.0790.0610.051
SVM4.601−1.2180.0084.348−1.2250.027
RF4.524−0.8800.0204.224−0.9320.033
JaenMLP3.875−1.0710.0164.423−0.0160.022
SVM3.857−1.0390.0184.613−1.1890.007
RF3.785−0.7710.0194.583−1.1030.011
LinaresMLP4.797−1.3780.0154.455−1.2600.010
SVM4.754−1.3080.0194.423−1.2020.010
RF4.719−0.9400.0124.371−0.9110.014
Mancha RealMLP3.2460.1280.0473.2880.3050.012
SVM3.450−0.9460.0053.390−0.8420.002
RF3.386−0.8200.0213.383−0.7880.003
MarmolejoMLP5.530−1.4590.0125.396−1.3740.014
SVM5.501−1.4100.0225.360−1.3070.015
RF5.474−0.9470.0145.235−0.7610.028
SabioteMLP3.992−1.1590.0264.186−1.1140.008
SVM3.937−1.0910.0304.155−1.0410.006
RF3.893−0.7970.0164.119−0.9100.010
TorreblascopedroMLP4.658−1.2870.0224.283−1.2040.011
SVM4.626−1.2360.0274.263−1.1670.010
RF4.539−0.9000.0214.202−0.8020.015
Mean values 4.359−0.9780.0344.322−0.8940.037
Table 7. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the second area (coastal locations), using data from the target station in two different approaches, with the use of the previous and following day and the use of the two previous and two following days. The best values from each station are in bold.
Table 7. RMSE, MBE, and R2 performance values from testing dataset for all locations and models in the second area (coastal locations), using data from the target station in two different approaches, with the use of the previous and following day and the use of the two previous and two following days. The best values from each station are in bold.
One Day (B)Two Days (C)
Stations (Area 2)ModelRMSE
[mm/day]
MBE
[mm/day]
R2RMSE
[mm/day]
MBE
[mm/day]
R2
AntequeraMLP5.035−0.2460.0274.521−1.2460.048
SVM5.243−1.2960.0214.480−1.1970.045
RF5.229−1.2210.0054.467−1.1630.017
ArchidonaMLP4.108−1.0950.0084.109−0.0590.041
SVM4.089−1.0120.0044.328−1.1800.027
RF4.083−0.4800.0234.252−0.6950.029
CártamaMLP5.479−1.1490.0095.2350.2390.040
SVM5.550−1.1440.0275.431−1.1320.021
RF5.631−0.8960.0215.374−1.0540.024
ChurrianaMLP6.551−1.3140.0516.849−1.4060.017
SVM6.449−1.2630.0456.817−1.3670.009
RF6.448−1.1480.0226.781−1.2630.012
MálagaMLP5.0280.2940.0796.850−1.3240.044
SVM5.279−1.0280.0236.765−1.2730.056
RF5.104−0.8840.0796.693−1.1820.041
PizarraMLP5.1520.2530.0315.8710.0140.050
SVM5.266−1.0710.0446.064−1.2050.081
RF5.267−0.7850.0256.058−1.0740.021
Vélez-MálagaMLP5.2950.1910.0765.3600.1490.061
SVM5.535−1.1980.0475.544−1.1440.040
RF5.465−1.0460.0525.489−1.0540.056
Mean values 5.299−0.8350.0195.587−0.9340.015
Table 8. Best improvements between simple arithmetic averaging and the best ML model from each site for R2 and RMSE. A positive value means that ML outperformed LI.
Table 8. Best improvements between simple arithmetic averaging and the best ML model from each site for R2 and RMSE. A positive value means that ML outperformed LI.
StationRMSE (mm/day) R2
La Higuera de Arjona0.1390.025
Jaén0.1490.03
Linares0.1730.033
Mancha Real0.190.041
Marmolejo0.2630.048
Sabiote0.0630.010
Torreblascopedro0.0310.004
Antequera1.2440.191
Archidona1.4750.25
Cártama0.1280.022
Churriana0.7810.086
Málaga0.1770.045
Pizarra0.0760.012
Vélez-Málaga0.2650.05
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bellido-Jiménez, J.A.; Gualda, J.E.; García-Marín, A.P. Assessing Machine Learning Models for Gap Filling Daily Rainfall Series in a Semiarid Region of Spain. Atmosphere 2021, 12, 1158. https://doi.org/10.3390/atmos12091158

AMA Style

Bellido-Jiménez JA, Gualda JE, García-Marín AP. Assessing Machine Learning Models for Gap Filling Daily Rainfall Series in a Semiarid Region of Spain. Atmosphere. 2021; 12(9):1158. https://doi.org/10.3390/atmos12091158

Chicago/Turabian Style

Bellido-Jiménez, Juan Antonio, Javier Estévez Gualda, and Amanda Penélope García-Marín. 2021. "Assessing Machine Learning Models for Gap Filling Daily Rainfall Series in a Semiarid Region of Spain" Atmosphere 12, no. 9: 1158. https://doi.org/10.3390/atmos12091158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop