# The Prediction of Road-Accident Risk through Data Mining: A Case Study from Setubal, Portugal

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. The Classical Approach

#### 2.2. The Deep Learning Approach

#### 2.3. Other Relevant Works

## 3. Theoretical Framework

_{c}is the value of V of Cramer, X

^{2}is the value of chi-squared, N is the number of samples, and k is the number of categories of the variable with the smallest number of categories. The chi-square value is defined as:

_{ij}is the expected frequency value and o

_{ij}is the observed frequency value of a combination of two values, one of variable i, the other of variable j. The expected frequency value can be computed as

_{i}is the marginal frequency of one of the values of the variable i, o

_{j}is the marginal frequency of one of the values of j, and N is the total number of samples.

_{i}is the number of samples in group i, r

_{ij}is the rank value of sample j that belongs to group i, $\overline{{r}_{i}}=\frac{{{\displaystyle \sum}}_{j=1}^{{n}_{i}}{r}_{ij}}{{n}_{i}}$ is the mean value of the rank of all observations j in group i; and $\overline{r}=\frac{r}{2}\left(N+1\right)$ is the average value of the sum of all classifications r

_{ij}, i.e., the expected value for the average of all groups.

## 4. Results and Discussion

#### 4.1. Dataset

#### 4.2. Selection of Attributes

#### 4.3. Data Mining

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Hengst, M.D.; Mors, J.T. Community of Intelligence: The Secret Behind Intelligence-Led Policing. In Proceedings of the 2012 European Intelligence and Security Informatics Conference, Odense, Denmark, 22–24 August 2012; pp. 22–29. [Google Scholar]
- Castro, Y.; Kim, Y.J. Data mining on road safety: Factor assessment on vehicle accidents using classification models. Int. J. Crashworthiness
**2016**, 21, 104–111. [Google Scholar] [CrossRef] - Kashyap, J.; Chandra, A.; Singh, P. Mining Road Traffic Accident Data to Improve Safety on Road-related Factors for Classification and Prediction of Accident Severity. Int. Res. J. Eng. Technol.
**2016**, 10, 2395–2456. [Google Scholar] - Hussain, S.; Muhammad, L.J.; Ishaq, F.S.; Yakubu, A.; Mohammed, I.A. Performance evaluation of various data mining algorithms on road traffic accident dataset. Smart Innov. Syst. Technol.
**2019**, 106, 67–78. [Google Scholar] [CrossRef] - Kumeda, B.; Zhang, F.; Zhou, F.; Hussain, S.; Almasri, A.; Assefa, M. Classification of road traffic accident data using machine learning Algorithms. In Proceedings of the 2019 IEEE 11th International Conference on Communication Software and Networks (ICCSN), Chongqing, China, 12–15 June 2019; pp. 682–687. [Google Scholar] [CrossRef]
- Chen, Q.; Song, X.; Yamada, H.; Shibasaki, R. Learning deep representation from big and heterogeneous data for traffic accident inference. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix, AZ, USA, 21 February 2016; pp. 338–344. [Google Scholar]
- Yuan, Z.; Zhou, X.; Yang, T. Hetero-ConvLSTM: A deep learning approach to traffic accident prediction on heterogeneous spatio-temporal data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, United Kingdom, 19 August 2018; Volume 18, pp. 984–992. [Google Scholar] [CrossRef]
- Krukowicz, T.; Firląg, K.; Chrobot, P. Spatiotemporal analysis of road crashes with animals in Poland. Sustainability
**2022**, 14, 1253. [Google Scholar] [CrossRef] - Billah, K.; Sharif, H.O.; Dessouky, S. How Gender Affects Motor Vehicle Crashes: A Case Study from San Antonio, Texas. Sustainability
**2022**, 14, 7023. [Google Scholar] [CrossRef] - Saveliev, A.; Lebedeva, V.; Lebedev, I.; Uzdiaev, M. An approach to the automatic construction of a road accident scheme using UAV and deep learning methods. Sensors
**2022**, 22, 4728. [Google Scholar] [CrossRef] - Tajnik, S.; Luin, B. Impact of Driver, Vehicle, and Environment on Rural Road Crash Rate. Sustainability
**2022**, 14, 15744. [Google Scholar] [CrossRef] - Bokaba, T.; Doorsamy, W.; Paul, B.S. Comparative study of machine learning classifiers for modelling road traffic accidents. Appl. Sci.
**2022**, 12, 828. [Google Scholar] [CrossRef] - Islam, M.K.; Gazder, U.; Akter, R.; Arifuzzaman, M. Involvement of Road Users from the Productive Age Group in Traffic Crashes in Saudi Arabia: An Investigative Study Using Statistical and Machine Learning Techniques. Appl. Sci.
**2022**, 12, 6368. [Google Scholar] [CrossRef] - Islam, M.K.; Reza, I.; Gazder, U.; Akter, R.; Arifuzzaman, M.; Rahman, M.M. Predicting Road Crash Severity Using Classifier Models and Crash Hotspots. Appl. Sci.
**2022**, 12, 11354. [Google Scholar] [CrossRef] - Mesquitela, J.; Elvas, L.B.; Ferreira, J.C.; Nunes, L. Data Analytics Process over Road Accidents Data—A Case Study of Lisbon City. ISPRS Int. J. Geo-Inf.
**2022**, 11, 143. [Google Scholar] [CrossRef] - Guido, G.; Shaffiee Haghshenas, S.; Shaffiee Haghshenas, S.; Vitale, A.; Astarita, V.; Park, Y.; Geem, Z.W. Evaluation of Contributing Factors Affecting Number of Vehicles Involved in Crashes Using Machine Learning Techniques in Rural Roads of Cosenza, Italy. Safety
**2022**, 8, 28. [Google Scholar] [CrossRef] - Kim, H.; Kim, J.-T.; Shin, S.; Lee, H.; Lim, J. Prediction of Run-Off Road Crash Severity in South Korea’s Highway through Tree Augmented Naïve Bayes Learning. Appl. Sci.
**2022**, 12, 1120. [Google Scholar] [CrossRef] - Rodionova, M.; Skhvediani, A.; Kudryavtseva, T. Prediction of crash severity as a way of road safety improvement: The case of Saint Petersburg, Russia. Sustainability
**2022**, 14, 9840. [Google Scholar] [CrossRef] - Infante, P.; Jacinto, G.; Afonso, A.; Rego, L.; Nogueira, V.; Quaresma, P.; Saias, J.; Santos, D.; Nogueira, P.; Silva, M. Comparison of statistical and machine-learning models on road traffic accident severity classification. Computers
**2022**, 11, 80. [Google Scholar] [CrossRef] - Goldschmidt, R.; Passos, E.; Bezerra, E. Data Mining, Conceitos Técnicas, Algoritmos, Orientações e Aplicações; Elsevier: Rio de Janeiro, Brasil, 2015. [Google Scholar]
- Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 82–88. [Google Scholar]
- Hendrickx, T.; Cule, B.; Meysman, P.; Naulaerts, S.; Laukens, K.; Goethals, B. Mining association rules in graphs based on frequent cohesive itemsets. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Ho Chi Minh City, Vietnam, 19–22 May 2015; pp. 637–648. [Google Scholar]
- Agarwal, S. Data mining: Data mining concepts and techniques. In Proceedings of the 2013 International Conference on Machine Intelligence and Research Advancement, Katra, India, 21 December 2013; pp. 203–207. [Google Scholar]
- Zhang, S.; Zhang, C.; Yang, Q. Data preparation for data mining. Appl. Artif. Intell.
**2003**, 17, 375–381. [Google Scholar] [CrossRef] - Mueller, J.P.; Massaron, L. Deep Learning for Dummies; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
- Berry, M.W.; Mohamed, A.; Yap, B.W. Supervised and Unsupervised Learning for Data Science; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Sen, P.C.; Hajra, M.; Ghosh, M. Emerging Technology in modelling and graphics. Singap. Springer Singap.
**2020**, 937, 99. [Google Scholar] - Belanche, L.A.; González, F.F. Review and evaluation of feature selection algorithms in synthetic problems. arXiv
**2011**, arXiv:1101.2320. [Google Scholar] - Indrakumari, R.; Poongodi, T.; Singh, K. Introduction to Deep Learning. In Advanced Deep Learning for Engineers and Scientists; Springer: Cham, Switzerland, 2021; pp. 1–22. [Google Scholar]
- Eisenberg, D. The mixed effects of precipitation on traffic crashes. Accid. Anal. Prev.
**2004**, 36, 637–647. [Google Scholar] [CrossRef] - Hayat, R.B.; Debbarh, M.; Antoniou, C.; Hayat, R.B.; Debbarh, M.; Antoniou, C.; Yannis, G. Explaining the road accident risk: Weather effects. Accid. Anal. Prev.
**2013**, 1, 456–465. [Google Scholar] [CrossRef][Green Version] - Tamerius, J.D.; Zhou, X.; Mantilla, R.; Greenfield-Huitt, T. Precipitation effects on motor vehicle crashes vary by space, time, and environmental conditions. Weather. Clim. Soc.
**2016**, 8, 399–407. [Google Scholar] [CrossRef] - Febres, J.D.; Garca-Herrero, S.; Herrera, S.; Gutirrez, J.M.; Lpez-Garca, J.R.; Mariscal, M.A. Influence of seat-belt use on the severity of injury in traffic accidents. Eur. Transp. Res. Rev.
**2020**, 12, 1–12. [Google Scholar] [CrossRef][Green Version] - Musile, G.; Pigaiani, N.; Sorio, D.; Colombari, M.; Bortolotti, F.; Tagliaro, F. Alcohol-associated traffic injuries in Verona territory: A nine-year survey. Med. Sci. Law
**2021**, 61, 7–13. [Google Scholar] [CrossRef] [PubMed] - Song, Y.; Kou, S.; Wang, C. Modeling crash severity by considering risk indicators of driver and roadway: A Bayesian network approach. J. Saf. Res.
**2021**, 76, 64–72. [Google Scholar] [CrossRef] - Martn-delosReyes, L.M.; Martnez-Ruiz, V.; Rivera-Izquierdo, M.; Jimnez-Mejas, E.; Lardelli-Claret, P. Is driving without a valid license associated with an increased risk of causing a road crash? Accid. Anal. Prev.
**2021**, 149, 1–7. [Google Scholar] [CrossRef] - Zhang, Z.; McDonnell, K.T.; Zadok, E.; Mueller, K. Visual correlation analysis of numerical and categorical data on the correlation map. IEEE Trans. Vis. Comput. Graph.
**2015**, 21, 289–303. [Google Scholar] [CrossRef][Green Version] - Bhattacharya, A.; Dunson, D.B. Simplex factor models for multivariate unordered categorical data. J. Am. Stat. Assoc.
**2012**, 107, 362–377. [Google Scholar] [CrossRef][Green Version] - Leon, A.C. Descriptive and Inferential Statistics. Compr. Clin. Psychol.
**1998**, 3, 243–285. [Google Scholar] [CrossRef] - Sun, J. The Microbiome in Health and Disease Preface; Academic Press: Cambridge, MA, USA, 2020; Volume 171, pp. XV–XVI. [Google Scholar]
- Urbanowicz, R.J.; Meeker, M.; La Cava, W.; Olson, R.S.; Moore, J.H. Relief-based feature selection: Introduction and review. J. Biomed. Inform.
**2018**, 85, 189–203. [Google Scholar] [CrossRef] - Robnik-Šikonja, M.; Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn.
**2003**, 53, 23–69. [Google Scholar] [CrossRef][Green Version] - Marcano-Cedeño, A.; Quintanilla-Domínguez, J.; Cortina-Januchs, M.; Andina, D. Feature selection using sequential forward selection and classification applying artificial metaplasticity neural network. In Proceedings of the IECON 2010—36th Annual Conference on IEEE Industrial Electronics Society, Glendale, AZ, USA, 7 November 2010; pp. 2845–2850. [Google Scholar]
- Molina, L.C.; Belanche, L.; Nebot, À. Feature selection algorithms: A survey and experimental evaluation. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9 December 2002; pp. 306–313. [Google Scholar]
- SeguroPorDias. O Congestionamento nas Estradas da Cidade do Porto (Congestion on the Roads of the City of Porto). Available online: https://seguropordias.pt/blog/tr%C3%A2nsito-porto-portugal (accessed on 29 December 2022).
- Ren, H.; Song, Y.; Wang, J.; Hu, Y.; Lei, J. A Deep Learning Approach to the Citywide Traffic Accident Risk Prediction. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3346–3351. [Google Scholar] [CrossRef]

**Figure 1.**Box plots of the frequency of accidents that occurred at the different time intervals (according to the TIME field of Table 3).

**Figure 4.**Number of monthly accidents before COVID-19 (2019) and during the COVID-19 pandemic (2020 and 2021).

**Figure 5.**Correlations of Cramer V and Kruskal-Wallis test for the most relevant pairs of variables.

Weather Conditions: | Precipitation, Temperature; Wind Force |
---|---|

Human behavior: | Seat belt use, cell phone use, alcohol consumption calendar |

Road conditions: | Road networks, luminosity, road identification, traffic volume |

Values of Cramer V Coefficient, ${\varnothing}_{\mathrm{c}}$ | Interpretation |
---|---|

[0.25; 1.00] | Very Strong |

[0.15; 0.25] | Strong |

[0.10; 0.15] | Moderated |

[0.05; 0.10] | Weak |

[0; 0.05] | Very Weak |

Attribute | Type/Format of Data |
---|---|

Identification of accident | Serial number |

Date | dd/mm/yyyy |

Time | {Morning, morning work, morning rush hours, lunch break, afternoon work, afternoon rush hours, night} |

Type of local | {Motorway, itineraries or national roads, village roads} |

Localization | {Urban location, non-urban location} |

Type of accident | {Damage only, with injured} |

Day of the week | {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} |

Holiday | Boolean |

Alcohol | Numerical with 2 decimal digits (g/L) |

Administrative offenses | Numerical with 2 decimal digits |

Weather conditions | {Good weather, fog, rain, strong wind, hail, smoke cloud} |

**Table 4.**Relevance of features for the creation of predictive models obtained with RBA and SBS for incidents that occur on motorways.

Motorways | Considered Relevant by Both Algorithms | Considered Irrelevant by Both Algorithms |
---|---|---|

RBA & SBS | Rain, morning work, afternoon rush hours, Friday, Saturday, August, February | Sunday |

Classes | Range of the number of accidents |
---|---|

Low Risk | <1.5 |

Medium Risk | ≥1.5 & <2.5 |

High Risk | ≥2.5 |

Algorithm | MAE (Distance) | Accuracy (%) |
---|---|---|

kNN | 0.74 | 56% |

Linear Regression | 0.63 | 57% |

Lasso Regression | 0.60 | 54% |

Ridge Regression | 0.61 | 52% |

Decision Tree | 0.69 | 56% |

Neural Network | 0.57 | 89% |

Algorithm | MAE (Distance) | Accuracy (%) |
---|---|---|

kNN | 0.30 | 81% |

Linear Regression | 0.27 | 86% |

Lasso Regression | 0.28 | 86% |

Ridge Regression | 0.28 | 80% |

Decision Tree | 0.31 | 76% |

Neural Network | 0.55 | 87% |

Algorithm | MAE (Distance) | Accuracy (%) |
---|---|---|

kNN | 0.93 | 48% |

Linear Regression | 0.85 | 50% |

Lasso Regression | 0.80 | 51% |

Ridge Regression | 0.79 | 50% |

Decision Tree | 0.91 | 55% |

Neural Network | 0.52 | 88% |

Algorithm (Regression with Neural Network) | MAE (Distance) | Accuracy (%) |
---|---|---|

General model | 0.49 | 88% |

Motorways (9.3% of total accidents) | 0.57 | 89% |

Itineraries or national roads (30% of total accidents) | 0.55 | 87% |

Village roads (60.7% of total accidents) | 0.52 | 88% |

Type of Road | Percentage of Accidents involving Injuries or Deaths | Nº of Injured/Dead per Accident |
---|---|---|

Motorway | 25.1% | 1.7 |

Village roads | 17.3% | 1.2 |

Itineraries or national roads | 29.8% | 1.42 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dias, D.; Silva, J.S.; Bernardino, A.
The Prediction of Road-Accident Risk through Data Mining: A Case Study from Setubal, Portugal. *Informatics* **2023**, *10*, 17.
https://doi.org/10.3390/informatics10010017

**AMA Style**

Dias D, Silva JS, Bernardino A.
The Prediction of Road-Accident Risk through Data Mining: A Case Study from Setubal, Portugal. *Informatics*. 2023; 10(1):17.
https://doi.org/10.3390/informatics10010017

**Chicago/Turabian Style**

Dias, David, José Silvestre Silva, and Alexandre Bernardino.
2023. "The Prediction of Road-Accident Risk through Data Mining: A Case Study from Setubal, Portugal" *Informatics* 10, no. 1: 17.
https://doi.org/10.3390/informatics10010017