3. Theoretical Framework
Current technology allows the storage of large and multiple databases. The analysis of these data is often useful; however, it is impractical without the aid of computational tools. The knowledge discovery in databases (KDD) process uses computational tools to identify valid and potentially useful patterns in the data and to generate knowledge [
20,
21,
22,
23,
24]. Typically, this process includes the following steps:
Data selection/Problem definition: the domain of available data is defined, as are the information and data that are relevant and the knowledge-discovery objectives.
Preprocessing: this aims to prepare the data for the algorithms of the next stage. This involves performing data cleaning, data integration, data reduction, and data transformation/normalization.
Data Mining: the algorithms are applied to the data in search of knowledge and in order to extract patterns from the data. The choice of algorithm to be applied depends on the type of task to be performed.
Evaluation and representation of results: the models produced are interpreted, and evaluation metrics are used to estimate the quality of the results. Tools are used to visualize the data produced as output.
We aim to solve a regression problem in which the target variable is the number of accidents that occur on each road in a range of time periods. The learning is supervised once we already have the annotated data related to accidents, in order to train the model. The input data is categorical and the target variable is numeric.
Supervised learning occurs when data already have an associated output. As is the case with our data, we will only implement algorithms that fit this profile. For example, if the objective of a data mining problem is to predict male or female gender from the image of a face, it is necessary to have a set of faces with the gender already correctly identified. It is important to distinguish regression problems, where the data for which we want to predict the value are numerical values, from classification problems, where the data are categorical values [
23,
25,
26,
27,
28].
Different techniques were analyzed in [
26] and it was concluded that decision trees, naive Bayes, and support vector machines are the most frequently used techniques. Other frequently used supervised learning algorithms are k-nearest neighbors (kNN) [
25,
26,
27,
29] and the artificial neural network (ANN) [
25,
27,
30]. Based on this information, these algorithms were implemented.
The most important attributes for road traffic accidents [
5,
31,
32,
33,
34,
35,
36,
37] were divided into three groups and listed in
Table 1.
For the selection of attributes, it is important to analyze the correlation between the different variables and the target variable. The Pearson correlation coefficient is often used to compute the linear correlation between continuous numeric variables. However, we must use a different metric to compute the correlation between categorical variables, as is the case with our dataset. The Cramer V correlation is used to compute the correlation between nominal categorical variables with more than two (non-binary) values [
38].
The Cramer’s V correlation is defined as [
39]:
where
c is the value of V of Cramer,
X2 is the value of chi-squared,
N is the number of samples, and
k is the number of categories of the variable with the smallest number of categories. The chi-square value is defined as:
where
eij is the expected frequency value and
oij is the observed frequency value of a combination of two values, one of variable
i, the other of variable
j. The expected frequency value can be computed as
and represents the expected frequency of a combination of two values (one of
i, the other of
j). In the previous formula,
oi is the marginal frequency of one of the values of the variable
i,
oj is the marginal frequency of one of the values of
j, and
N is the total number of samples.
The interpretation of the strength of the correlation between two nominal categorical variables as a function of Cramer’s V is given in
Table 2 [
36].
To achieve a universal standard for deleting attributes with low correlation values, it is important that all calculated correlations be comparable. The Kruskal-Wallis is equivalent to the chi-square also used in Cramer’s V, so the values achieved can be compared in the two measures. The expression for the Kruskal-Wallis test [
29,
40,
41] is given by:
where
N is the total number of samples across all groups,
g is the number of groups,
ni is the number of samples in group
i,
rij is the rank value of sample
j that belongs to group
i,
is the mean value of the rank of all observations
j in group
i; and
is the average value of the sum of all classifications
rij, i.e., the expected value for the average of all groups.
Relief-based feature selection (RBA) and sequential backward selection (SBS) were used for the selection of features [
42,
43,
44,
45]. Starting from an empty set of features, the SBS gradually adds features selected by a performance measure, which measures the extent to which each feature improves or worsens a mining method. At each iteration, the feature to be included in the feature set is selected from those available in the feature set.
To evaluate the different mining algorithms, we use the mean absolute error (
MAE), which is an error measurement that sums the absolute error between the observations and the value obtained by the model. The mean squared error was not used, because the number of accidents has many outliers that significantly bias this metric. The
MAE is given by the following equation:
As the purpose of this work is to present the risk of accidents rather than to predict the exact value of accidents, the predicted values and the actual values are grouped into three risk groups: low, medium, and high. After making this grouping we can compute the classification accuracy:
The classification accuracy measures the ratio of correct predictions to the total number of instances evaluated, where
TP is the number of true positives,
TN is the number of true negatives,
FP is the number of false positives, and
FN is the number of false negatives [
29].
5. Conclusions
In this work, data-mining methods for the prediction of the risk of road accidents were analyzed. Data on accident reports were made available by the National Guard and related to accidents that occurred in the Setubal region from 2019 to 2021. We describe the process followed to develop accident-prediction methods. This process consists of three modules: (i) data selection and collection, (ii) pre-processing, and (iii) the use of mining algorithms.
Through a preliminary data analysis, it was concluded that the highest concentration of accidents is seen between 17 h and 20 h. It was also possible to conclude that rain is the meteorological factor with the highest probability of increasing the risk of an accident. A further conclusion is that the day of the week on which more accidents occur than any other is Friday. These conclusions are consistent with the literature [
47].
Through an analysis of the correlation between the different variables, it was possible to conclude that location is the variable that most influences the frequency of accidents. Following on from this conclusion, the information characterizing the accidents was grouped according to the type of road where the accidents occurred. For this reason, it was necessary to create different models for each set. In addition to the location, the correlation between variables also highlighted other factors that influenced the frequency of accidents, such as the time of day, the meteorological conditions, and whether the accident occurred in a village or elsewhere. After dividing the data set into the three types of location (motorways, national roads or itineraries, and villages), it was possible, using the feature-selection algorithms, to understand which features most influence each type of accident location.
The data-mining problem was approached as a regression problem, since the target variable was the frequency of accidents in the defined time range. The mining algorithms tested were kNN, simple linear regression, Lasso and Ridge, the Decision Tree for regression, and the traditional neural network, both for the initial dataset and for the datasets divided by location in the following sets: motorways, national roads or itineraries, and villages. The best result was achieved through the neural network. However, for each set, different models were produced, with different architectures (number of nodes, training periods, etc.). The best result occurred for the motorway dataset. The motorway, despite being the location with the lowest number of accidents, is the one with the highest density of accidents per area when compared with villages; it also features the highest density of accidents per road, when compared with the concentration of accidents on national routes or roads. In addition, the motorway is the location where there are more injuries and deaths per accident. The motorway is also the location where it is possible for the National Guard to carry out more effective surveillance, since in the villages there are a large number of roads, and consequently there is a vast area where accidents can occur; however, the density of accidents on village roads is low.
This work is of value owing to the fact that it was possible to obtain good results for the prediction of the risk of accidents on motorways, but with variables that can be predicted in a future time frame. For example, it is possible today to make a weather forecast for the next week; we can distinguish the different days of the week in the future; we know which days will be holidays, etc. By using input data that relating only to future events, we are able to obtain an accident-risk result for a day in the future and thus enable the police to improve their forward planning.
In future work, the first step would be to improve data collection to ensure that the geolocation of accidents was acquired, making it possible to opt for more complex approaches. Another important variable to obtain would be the level of human mobility; it would be possible to acquire this by using applications such as Google Maps or Waze, or simply by recording the speed at which Uber taxis or other companies’ vehicles travel.