Comparing Resampling Algorithms and Classifiers for Modeling Traffic Risk Prediction
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data
2.1.1. Data Integrity and Accuracy
2.1.2. Feature Variables
2.1.3. Road Risk Level
2.2. Data Balancing
2.3. Classifier Selection
2.3.1. KNN Classifier
2.3.2. SVM Classifier
2.3.3. Ensemble Classifier
2.3.4. Bayesian Classifier
2.4. Performance Evaluation Measures
3. Results and Discussion
- Analysis of the correlation of 18 kinds of feature variables, retaining the necessary feature variables to construct a dataset.
- Comparison of the improvement effects of 11 different resampling algorithms on model training and identification of the best resampling algorithm.
- Comparison of the impact of seven classification algorithms on model performance and identification of the best classification algorithm.
- Use of the best algorithms and classifiers to analyze the contribution of various feature variables in model training and identification of the key risk factors that affect road risk.
3.1. Feature Analysis and Dimensionality Reduction
3.2. Imbalanced Versus Balanced Datasets
3.3. Classifier Performance
- Evaluating the performance of a model based solely on the accuracy, precision, sensitivity, and specificity did not result in a suitable model. For example, the accuracy of the model trained by the SVM algorithm in the original dataset is 81%, the precision of L_1 is 81%, the sensitivity is 100%, and the specificity is 0. However, when the precision and sensitivity of L_2 and L_3 are both 0, the specificity is 100%. Obviously, it cannot meet the needs of road section risk classification.
- The F1 indicator can integrate the accuracy, precision, sensitivity, and specificity, and collectively evaluate the classification effect of a single category. According to F1(1), it is found that the original dataset and the under-sampling dataset are biased towards the majority class (L_1). According to F1(2) and F1(3), it is found that the muse ensemble classifier (XGBoost or random forest) can effectively improve the classification effect of minority categories (L_2 and L_3).
- Not all data balancing algorithms improve the classification performance. Meanwhile, whether the data is balanced hardly affects the ranking order of the classifier performance.
- The ensemble classifier is used in the top 10 method combinations in the Score index. Among them, the model trained by SMOTEENN and random forest has the highest score.
3.4. Feature Importance and Interpretation
- When RICU is greater than 30 and GAE is less than 0, the partial dependence value of the road section with the traffic risk level L_3 is greater than 0.32. At the same time, when RICU is less than 5 and GAE is greater than 0.02, the partial dependence value of the road section with the traffic risk level L_1 is greater than 0.65. This shows that the model established in this paper can effectively identify two types of typical high-risk road sections: downhill small radius curves or downhill continuous curved road sections.
- When RICU is greater than 20 and PCI is less than 98, the partial dependence value of the road section with the traffic risk level L_3 is greater than 0.32. However, when RICU is less than 5 and PCI is less than 98, the partial dependence value of the road section with the traffic risk level L_1 is greater than 0.64. This shows that in the model established in this paper, the pavement condition has a greater impact on the driving safety of small-radius road sections or continuous curved road sections while it has a lesser impact on the driving safety of straight sections.
- When GAE is less than 0 and PCI is less than 95, the partial dependence value of the road section with the traffic risk level L_3 is greater than 0.30. However, when GAE is greater than 0.02 and PCI is greater than 0.4, the partial dependence value of the road section with the traffic risk level L_1 is greater than 0.5. This is basically consistent with the above two rules. At the same time, this also shows that in the model established in this paper, the downhill section has higher requirements regarding the pavement performance, and the higher the uphill slope, the safer the road section.
4. Conclusions
- In predicting the risk potential of expressway, horizontal alignment, vertical alignment, pavement, and weather play a significant role. This study provided 18 feature variables from 5 different perspectives, and 13 feature variables were retained after dimensionality reduction. Incorporating Gini importance, permutation importance, and SHAP importance, three feature importance indicators, the contribution of the feature variables to the model was ranked. The results showed that 11 of these feature variables have a high contribution to the prediction of the risk potential.
- The ensemble classifier demonstrated a good performance in processing traffic accident data, and the addition of a resampling algorithm further enhanced the classifier’s performance. By comparing the prediction results of XGBoost and the random forest algorithm, it was demonstrated that the model performance index Score provided in this research can effectively assess the performance of the three-level traffic risk prediction model.
- The combination of SMOTEENN and the random forest algorithm developed the best model for predicting the highway traffic risk potential. The three most representative feature variables, RICU, GAE, and PCI, were analyzed using the partial dependence plot, and the results showed that the classification principle of the established traffic crash risk prediction model was consistent with the objective risk effect law.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Dataset | Fine KNN | Weighted KNN | Cubic SVM | Fine Gaussian SVM | XGBoost | Random Forest | AODE | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | S.D | Mean | S.D | Mean | S.D | Mean | S.D | Mean | S.D | Mean | S.D | Mean | S.D | |
ACC | ||||||||||||||
Original | 0.70 | 0.02 | 0.76 | 0.01 | 0.81 | 0.00 | 0.81 | 0.00 | 0.78 | 0.01 | 0.77 | 0.01 | 0.50 | 0.04 |
PG | 0.41 | 0.02 | 0.34 | 0.01 | 0.27 | 0.03 | 0.33 | 0.00 | 0.15 | 0.02 | 0.19 | 0.01 | 0.35 | 0.05 |
RUS | 0.38 | 0.02 | 0.34 | 0.05 | 0.25 | 0.11 | 0.26 | 0.09 | 0.43 | 0.02 | 0.44 | 0.02 | 0.35 | 0.04 |
ENN | 0.76 | 0.01 | 0.80 | 0.00 | 0.81 | 0.00 | 0.81 | 0.00 | 0.80 | 0.01 | 0.80 | 0.01 | 0.44 | 0.07 |
ALLKNN | 0.72 | 0.01 | 0.80 | 0.00 | 0.81 | 0.00 | 0.81 | 0.00 | 0.78 | 0.00 | 0.79 | 0.01 | 0.42 | 0.05 |
ROS | 0.70 | 0.02 | 0.54 | 0.06 | 0.31 | 0.07 | 0.37 | 0.07 | 0.75 | 0.01 | 0.74 | 0.02 | 0.35 | 0.05 |
SMOTE | 0.63 | 0.03 | 0.52 | 0.07 | 0.32 | 0.08 | 0.31 | 0.08 | 0.75 | 0.01 | 0.75 | 0.02 | 0.35 | 0.05 |
ADASYN | 0.62 | 0.04 | 0.53 | 0.06 | 0.32 | 0.08 | 0.30 | 0.08 | 0.76 | 0.01 | 0.76 | 0.01 | 0.35 | 0.04 |
Borderline-SMOTE | 0.64 | 0.03 | 0.55 | 0.05 | 0.38 | 0.06 | 0.37 | 0.06 | 0.76 | 0.01 | 0.75 | 0.02 | 0.37 | 0.05 |
SMOTENC | 0.63 | 0.03 | 0.53 | 0.06 | 0.32 | 0.07 | 0.30 | 0.08 | 0.75 | 0.02 | 0.75 | 0.02 | 0.35 | 0.04 |
SMOTEENN | 0.38 | 0.03 | 0.30 | 0.04 | 0.19 | 0.04 | 0.16 | 0.04 | 0.54 | 0.04 | 0.57 | 0.03 | 0.22 | 0.04 |
SMOT-Tomek Links | 0.62 | 0.04 | 0.51 | 0.06 | 0.32 | 0.07 | 0.30 | 0.07 | 0.75 | 0.01 | 0.75 | 0.02 | 0.35 | 0.04 |
PPV1 | ||||||||||||||
Original | 0.82 | 0.01 | 0.83 | 0.01 | 0.81 | 0.00 | 0.81 | 0.00 | 0.85 | 0.00 | 0.85 | 0.00 | 0.83 | 0.01 |
PG | 0.81 | 0.01 | 0.84 | 0.02 | 0.81 | 0.03 | 0.86 | 0.01 | 0.86 | 0.06 | 0.90 | 0.01 | 0.85 | 0.03 |
RUS | 0.82 | 0.02 | 0.80 | 0.02 | 0.65 | 0.07 | 0.64 | 0.15 | 0.89 | 0.01 | 0.88 | 0.01 | 0.85 | 0.01 |
ENN | 0.82 | 0.00 | 0.82 | 0.00 | 0.81 | 0.00 | 0.81 | 0.00 | 0.84 | 0.00 | 0.84 | 0.00 | 0.83 | 0.01 |
ALLKNN | 0.82 | 0.00 | 0.83 | 0.00 | 0.81 | 0.00 | 0.81 | 0.00 | 0.84 | 0.01 | 0.84 | 0.01 | 0.83 | 0.01 |
ROS | 0.82 | 0.01 | 0.82 | 0.03 | 0.75 | 0.05 | 0.76 | 0.06 | 0.86 | 0.01 | 0.86 | 0.01 | 0.85 | 0.01 |
SMOTE | 0.82 | 0.01 | 0.81 | 0.03 | 0.75 | 0.04 | 0.74 | 0.05 | 0.85 | 0.00 | 0.85 | 0.01 | 0.84 | 0.01 |
ADASYN | 0.83 | 0.01 | 0.82 | 0.02 | 0.74 | 0.05 | 0.72 | 0.06 | 0.85 | 0.00 | 0.85 | 0.01 | 0.85 | 0.01 |
Borderline-SMOTE | 0.82 | 0.01 | 0.82 | 0.02 | 0.80 | 0.02 | 0.79 | 0.01 | 0.85 | 0.01 | 0.85 | 0.00 | 0.83 | 0.01 |
SMOTENC | 0.83 | 0.01 | 0.81 | 0.02 | 0.74 | 0.05 | 0.72 | 0.06 | 0.85 | 0.01 | 0.86 | 0.01 | 0.85 | 0.01 |
SMOTEENN | 0.81 | 0.01 | 0.77 | 0.03 | 0.69 | 0.08 | 0.78 | 0.08 | 0.85 | 0.01 | 0.87 | 0.01 | 0.86 | 0.03 |
SMOT-Tomek Links | 0.82 | 0.01 | 0.81 | 0.02 | 0.74 | 0.05 | 0.73 | 0.06 | 0.85 | 0.01 | 0.85 | 0.01 | 0.85 | 0.01 |
PPV2 | ||||||||||||||
Original | 0.14 | 0.03 | 0.11 | 0.04 | / | / | / | / | 0.26 | 0.09 | 0.26 | 0.05 | 0.30 | 0.17 |
PG | 0.10 | 0.01 | 0.10 | 0.02 | / | / | 0.45 | 0.20 | 0.12 | 0.01 | 0.11 | 0.01 | 0.22 | 0.14 |
RUS | 0.09 | 0.02 | 0.08 | 0.02 | / | / | 0.25 | 0.17 | 0.12 | 0.02 | 0.12 | 0.01 | 0.07 | 0.04 |
ENN | / | / | / | / | / | / | / | / | 0.05 | 0.04 | 0.07 | 0.06 | / | / |
ALLKNN | 0.19 | 0.05 | / | / | / | / | / | / | 0.13 | 0.06 | 0.24 | 0.08 | / | / |
ROS | 0.14 | 0.03 | 0.15 | 0.04 | 0.04 | 0.01 | 0.06 | 0.03 | 0.25 | 0.07 | 0.24 | 0.05 | 0.32 | 0.18 |
SMOTE | 0.12 | 0.02 | 0.12 | 0.03 | 0.04 | 0.02 | 0.04 | 0.02 | 0.23 | 0.08 | 0.27 | 0.07 | 0.17 | 0.12 |
ADASYN | 0.13 | 0.03 | 0.13 | 0.03 | 0.04 | 0.01 | 0.04 | 0.01 | 0.26 | 0.09 | 0.27 | 0.07 | 0.29 | 0.18 |
Borderline-SMOTE | 0.13 | 0.02 | 0.12 | 0.03 | 0.06 | 0.02 | 0.04 | 0.02 | 0.24 | 0.08 | 0.22 | 0.06 | 0.15 | 0.12 |
SMOTENC | 0.13 | 0.03 | 0.13 | 0.03 | 0.04 | 0.02 | 0.04 | 0.01 | 0.26 | 0.10 | 0.27 | 0.08 | 0.26 | 0.17 |
SMOTEENN | 0.09 | 0.01 | 0.09 | 0.01 | 0.08 | 0.02 | 0.09 | 0.02 | 0.14 | 0.02 | 0.18 | 0.02 | 0.19 | 0.10 |
SMOT-Tomek Links | 0.14 | 0.02 | 0.13 | 0.02 | 0.04 | 0.02 | 0.04 | 0.01 | 0.22 | 0.05 | 0.22 | 0.07 | 0.15 | 0.09 |
PPV3 | ||||||||||||||
Original | 0.15 | 0.07 | 0.16 | 0.09 | / | / | / | / | 0.51 | 0.04 | 0.48 | 0.09 | 0.09 | 0.01 |
PG | 0.09 | 0.02 | 0.08 | 0.01 | 0.08 | 0.01 | 0.08 | 0.01 | 0.09 | 0.01 | 0.11 | 0.01 | 0.08 | 0.00 |
RUS | 0.07 | 0.02 | 0.07 | 0.01 | 0.07 | 0.01 | 0.07 | 0.01 | 0.16 | 0.01 | 0.17 | 0.01 | 0.08 | 0.00 |
ENN | 0.12 | 0.04 | 0.14 | 0.08 | / | / | / | / | 0.36 | 0.02 | 0.34 | 0.02 | 0.09 | 0.01 |
ALLKNN | 0.09 | 0.03 | 0.15 | 0.08 | / | / | / | / | 0.32 | 0.03 | 0.32 | 0.02 | 0.09 | 0.00 |
ROS | 0.15 | 0.07 | 0.15 | 0.05 | 0.11 | 0.01 | 0.11 | 0.01 | 0.41 | 0.05 | 0.46 | 0.09 | 0.08 | 0.00 |
SMOTE | 0.14 | 0.06 | 0.15 | 0.06 | 0.12 | 0.01 | 0.11 | 0.01 | 0.44 | 0.05 | 0.47 | 0.06 | 0.08 | 0.00 |
ADASYN | 0.16 | 0.06 | 0.15 | 0.06 | 0.12 | 0.01 | 0.11 | 0.00 | 0.43 | 0.04 | 0.48 | 0.08 | 0.08 | 0.00 |
Borderline-SMOTE | 0.14 | 0.06 | 0.16 | 0.05 | 0.12 | 0.01 | 0.10 | 0.01 | 0.48 | 0.08 | 0.45 | 0.06 | 0.08 | 0.01 |
SMOTENC | 0.16 | 0.05 | 0.14 | 0.06 | 0.12 | 0.01 | 0.09 | 0.01 | 0.45 | 0.04 | 0.45 | 0.06 | 0.08 | 0.00 |
SMOTEENN | 0.08 | 0.02 | 0.08 | 0.02 | 0.10 | 0.00 | 0.08 | 0.01 | 0.21 | 0.01 | 0.27 | 0.03 | 0.08 | 0.00 |
SMOT-Tomek Links | 0.13 | 0.05 | 0.12 | 0.05 | 0.12 | 0.01 | 0.09 | 0.01 | 0.45 | 0.06 | 0.48 | 0.07 | 0.08 | 0.00 |
TPR1 | ||||||||||||||
Original | 0.83 | 0.02 | 0.92 | 0.02 | 1.00 | 0.00 | 1.00 | 0.00 | 0.91 | 0.02 | 0.89 | 0.01 | 0.56 | 0.06 |
PG | 0.44 | 0.02 | 0.35 | 0.02 | 0.25 | 0.05 | 0.34 | 0.00 | 0.07 | 0.01 | 0.12 | 0.01 | 0.36 | 0.06 |
RUS | 0.40 | 0.03 | 0.36 | 0.07 | 0.25 | 0.15 | 0.26 | 0.12 | 0.43 | 0.03 | 0.45 | 0.03 | 0.38 | 0.05 |
ENN | 0.92 | 0.01 | 0.98 | 0.01 | 1.00 | 0.00 | 1.00 | 0.00 | 0.95 | 0.01 | 0.95 | 0.01 | 0.49 | 0.09 |
ALLKNN | 0.86 | 0.02 | 0.97 | 0.01 | 1.00 | 0.00 | 1.00 | 0.00 | 0.93 | 0.01 | 0.93 | 0.01 | 0.46 | 0.06 |
ROS | 0.83 | 0.02 | 0.61 | 0.07 | 0.32 | 0.09 | 0.40 | 0.09 | 0.86 | 0.02 | 0.85 | 0.03 | 0.37 | 0.06 |
SMOTE | 0.73 | 0.04 | 0.59 | 0.08 | 0.33 | 0.10 | 0.32 | 0.10 | 0.88 | 0.02 | 0.87 | 0.02 | 0.37 | 0.06 |
ADASYN | 0.72 | 0.04 | 0.59 | 0.07 | 0.33 | 0.10 | 0.31 | 0.10 | 0.88 | 0.02 | 0.88 | 0.02 | 0.37 | 0.05 |
Borderline-SMOTE | 0.74 | 0.04 | 0.62 | 0.06 | 0.41 | 0.08 | 0.40 | 0.08 | 0.88 | 0.02 | 0.87 | 0.02 | 0.40 | 0.06 |
SMOTENC | 0.73 | 0.04 | 0.59 | 0.07 | 0.32 | 0.09 | 0.31 | 0.10 | 0.88 | 0.02 | 0.86 | 0.02 | 0.37 | 0.05 |
SMOTEENN | 0.41 | 0.04 | 0.29 | 0.04 | 0.14 | 0.06 | 0.11 | 0.05 | 0.58 | 0.06 | 0.60 | 0.05 | 0.18 | 0.06 |
SMOT-Tomek Links | 0.72 | 0.04 | 0.57 | 0.07 | 0.32 | 0.09 | 0.31 | 0.09 | 0.87 | 0.02 | 0.86 | 0.02 | 0.37 | 0.05 |
TPR2 | ||||||||||||||
Original | 0.13 | 0.02 | 0.06 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.12 | 0.02 | 0.17 | 0.02 | 0.13 | 0.08 |
PG | 0.24 | 0.03 | 0.25 | 0.05 | 0.12 | 0.05 | 0.07 | 0.03 | 0.38 | 0.04 | 0.37 | 0.02 | 0.06 | 0.04 |
RUS | 0.24 | 0.05 | 0.21 | 0.08 | 0.17 | 0.07 | 0.13 | 0.05 | 0.33 | 0.03 | 0.34 | 0.03 | 0.01 | 0.01 |
ENN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0.00 |
ALLKNN | 0.04 | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.01 | 0.06 | 0.02 | 0.00 | 0.00 |
ROS | 0.13 | 0.02 | 0.27 | 0.06 | 0.17 | 0.06 | 0.13 | 0.06 | 0.20 | 0.02 | 0.23 | 0.03 | 0.02 | 0.01 |
SMOTE | 0.17 | 0.02 | 0.23 | 0.03 | 0.18 | 0.06 | 0.17 | 0.06 | 0.15 | 0.01 | 0.21 | 0.03 | 0.02 | 0.01 |
ADASYN | 0.19 | 0.03 | 0.25 | 0.04 | 0.18 | 0.06 | 0.16 | 0.06 | 0.16 | 0.01 | 0.20 | 0.03 | 0.02 | 0.01 |
Borderline-SMOTE | 0.18 | 0.02 | 0.22 | 0.03 | 0.16 | 0.05 | 0.14 | 0.05 | 0.15 | 0.02 | 0.17 | 0.01 | 0.01 | 0.01 |
SMOTENC | 0.19 | 0.03 | 0.26 | 0.04 | 0.18 | 0.06 | 0.17 | 0.06 | 0.14 | 0.02 | 0.22 | 0.02 | 0.02 | 0.01 |
SMOTEENN | 0.29 | 0.02 | 0.31 | 0.03 | 0.29 | 0.08 | 0.30 | 0.08 | 0.31 | 0.03 | 0.41 | 0.04 | 0.20 | 0.06 |
SMOT-Tomek Links | 0.21 | 0.03 | 0.27 | 0.04 | 0.18 | 0.06 | 0.16 | 0.06 | 0.16 | 0.02 | 0.18 | 0.02 | 0.02 | 0.01 |
TPR3 | ||||||||||||||
Original | 0.15 | 0.09 | 0.14 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 0.37 | 0.05 | 0.34 | 0.06 | 0.53 | 0.08 |
PG | 0.36 | 0.09 | 0.45 | 0.09 | 0.76 | 0.06 | 0.73 | 0.06 | 0.78 | 0.05 | 0.75 | 0.05 | 0.80 | 0.04 |
RUS | 0.35 | 0.09 | 0.37 | 0.08 | 0.47 | 0.12 | 0.48 | 0.11 | 0.63 | 0.06 | 0.63 | 0.07 | 0.76 | 0.05 |
ENN | 0.17 | 0.08 | 0.14 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 0.38 | 0.05 | 0.37 | 0.05 | 0.68 | 0.09 |
ALLKNN | 0.18 | 0.08 | 0.14 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 0.35 | 0.06 | 0.37 | 0.05 | 0.75 | 0.05 |
ROS | 0.15 | 0.09 | 0.26 | 0.07 | 0.48 | 0.07 | 0.51 | 0.08 | 0.40 | 0.07 | 0.40 | 0.06 | 0.77 | 0.06 |
SMOTE | 0.20 | 0.09 | 0.25 | 0.09 | 0.49 | 0.07 | 0.48 | 0.08 | 0.36 | 0.07 | 0.38 | 0.05 | 0.75 | 0.06 |
ADASYN | 0.23 | 0.08 | 0.26 | 0.08 | 0.49 | 0.07 | 0.48 | 0.08 | 0.36 | 0.06 | 0.38 | 0.06 | 0.77 | 0.06 |
Borderline-SMOTE | 0.19 | 0.08 | 0.23 | 0.07 | 0.47 | 0.08 | 0.46 | 0.10 | 0.39 | 0.07 | 0.34 | 0.06 | 0.72 | 0.06 |
SMOTENC | 0.22 | 0.07 | 0.24 | 0.08 | 0.51 | 0.08 | 0.47 | 0.10 | 0.36 | 0.05 | 0.38 | 0.06 | 0.76 | 0.05 |
SMOTEENN | 0.27 | 0.08 | 0.34 | 0.09 | 0.57 | 0.09 | 0.54 | 0.11 | 0.47 | 0.04 | 0.53 | 0.06 | 0.75 | 0.06 |
SMOT-Tomek Links | 0.19 | 0.08 | 0.23 | 0.08 | 0.51 | 0.08 | 0.48 | 0.10 | 0.39 | 0.06 | 0.40 | 0.06 | 0.76 | 0.05 |
TNR1 | ||||||||||||||
Original | 0.23 | 0.04 | 0.15 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 0.04 | 0.30 | 0.03 | 0.49 | 0.06 |
PG | 0.54 | 0.05 | 0.70 | 0.04 | 0.79 | 0.02 | 0.75 | 0.03 | 0.96 | 0.01 | 0.94 | 0.01 | 0.73 | 0.05 |
RUS | 0.62 | 0.05 | 0.64 | 0.03 | 0.72 | 0.11 | 0.76 | 0.06 | 0.76 | 0.02 | 0.73 | 0.04 | 0.70 | 0.04 |
ENN | 0.12 | 0.03 | 0.09 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.21 | 0.02 | 0.21 | 0.03 | 0.56 | 0.09 |
ALLKNN | 0.17 | 0.03 | 0.10 | 0.03 | 0.01 | 0.01 | 0.00 | 0.00 | 0.21 | 0.03 | 0.24 | 0.03 | 0.60 | 0.06 |
ROS | 0.23 | 0.04 | 0.45 | 0.05 | 0.64 | 0.05 | 0.60 | 0.05 | 0.38 | 0.05 | 0.39 | 0.05 | 0.72 | 0.04 |
SMOTE | 0.33 | 0.04 | 0.43 | 0.04 | 0.63 | 0.06 | 0.63 | 0.06 | 0.31 | 0.04 | 0.35 | 0.04 | 0.70 | 0.04 |
ADASYN | 0.35 | 0.05 | 0.45 | 0.05 | 0.62 | 0.07 | 0.63 | 0.06 | 0.31 | 0.03 | 0.34 | 0.04 | 0.72 | 0.04 |
Borderline-SMOTE | 0.31 | 0.03 | 0.41 | 0.03 | 0.57 | 0.06 | 0.57 | 0.05 | 0.31 | 0.04 | 0.32 | 0.03 | 0.65 | 0.04 |
SMOTENC | 0.34 | 0.04 | 0.44 | 0.04 | 0.62 | 0.06 | 0.61 | 0.06 | 0.31 | 0.04 | 0.37 | 0.04 | 0.71 | 0.04 |
SMOTEENN | 0.58 | 0.02 | 0.66 | 0.05 | 0.83 | 0.05 | 0.86 | 0.06 | 0.56 | 0.04 | 0.60 | 0.05 | 0.83 | 0.06 |
SMOT-Tomek Links | 0.34 | 0.03 | 0.45 | 0.04 | 0.62 | 0.06 | 0.61 | 0.06 | 0.32 | 0.04 | 0.34 | 0.04 | 0.71 | 0.04 |
TNR2 | ||||||||||||||
Original | 0.88 | 0.01 | 0.94 | 0.01 | 1.00 | 0.00 | 1.00 | 0.00 | 0.92 | 0.02 | 0.92 | 0.01 | 0.90 | 0.04 |
PG | 0.68 | 0.01 | 0.67 | 0.02 | 0.83 | 0.07 | 0.91 | 0.07 | 0.59 | 0.03 | 0.54 | 0.03 | 0.97 | 0.02 |
RUS | 0.68 | 0.03 | 0.64 | 0.07 | 0.63 | 0.12 | 0.67 | 0.12 | 0.64 | 0.03 | 0.64 | 0.01 | 0.94 | 0.04 |
ENN | 0.99 | 0.00 | 0.99 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.99 | 0.01 | 0.99 | 0.01 | 1.00 | 0.00 |
ALLKNN | 0.97 | 0.01 | 0.99 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 | 0.98 | 0.01 | 0.97 | 0.01 | 0.99 | 0.00 |
ROS | 0.88 | 0.01 | 0.72 | 0.06 | 0.57 | 0.13 | 0.68 | 0.14 | 0.88 | 0.03 | 0.88 | 0.03 | 0.95 | 0.04 |
SMOTE | 0.80 | 0.03 | 0.70 | 0.06 | 0.57 | 0.12 | 0.57 | 0.12 | 0.90 | 0.02 | 0.89 | 0.02 | 0.93 | 0.04 |
ADASYN | 0.80 | 0.03 | 0.71 | 0.05 | 0.57 | 0.12 | 0.56 | 0.12 | 0.90 | 0.02 | 0.90 | 0.02 | 0.94 | 0.04 |
Borderline-SMOTE | 0.81 | 0.03 | 0.71 | 0.06 | 0.62 | 0.09 | 0.63 | 0.09 | 0.90 | 0.02 | 0.89 | 0.02 | 0.93 | 0.04 |
SMOTENC | 0.81 | 0.03 | 0.72 | 0.05 | 0.57 | 0.12 | 0.57 | 0.12 | 0.90 | 0.02 | 0.89 | 0.02 | 0.94 | 0.04 |
SMOTEENN | 0.60 | 0.03 | 0.53 | 0.04 | 0.50 | 0.11 | 0.50 | 0.11 | 0.69 | 0.06 | 0.70 | 0.05 | 0.75 | 0.06 |
SMOT-Tomek Links | 0.80 | 0.03 | 0.70 | 0.05 | 0.58 | 0.12 | 0.58 | 0.12 | 0.90 | 0.02 | 0.89 | 0.02 | 0.94 | 0.04 |
TNR3 | ||||||||||||||
Original | 0.95 | 0.00 | 0.98 | 0.01 | 1.00 | 0.00 | 1.00 | 0.00 | 0.97 | 0.01 | 0.97 | 0.01 | 0.67 | 0.05 |
PG | 0.76 | 0.02 | 0.66 | 0.02 | 0.42 | 0.07 | 0.42 | 0.07 | 0.49 | 0.04 | 0.58 | 0.03 | 0.39 | 0.05 |
RUS | 0.71 | 0.01 | 0.71 | 0.02 | 0.61 | 0.10 | 0.57 | 0.09 | 0.78 | 0.02 | 0.80 | 0.02 | 0.43 | 0.06 |
ENN | 0.93 | 0.01 | 0.98 | 0.01 | 1.00 | 0.00 | 1.00 | 0.00 | 0.95 | 0.01 | 0.95 | 0.01 | 0.49 | 0.10 |
ALLKNN | 0.89 | 0.01 | 0.98 | 0.01 | 1.00 | 0.00 | 1.00 | 0.00 | 0.95 | 0.01 | 0.95 | 0.01 | 0.47 | 0.06 |
ROS | 0.95 | 0.00 | 0.88 | 0.02 | 0.74 | 0.05 | 0.71 | 0.05 | 0.96 | 0.01 | 0.96 | 0.01 | 0.42 | 0.06 |
SMOTE | 0.92 | 0.01 | 0.88 | 0.02 | 0.75 | 0.04 | 0.74 | 0.05 | 0.97 | 0.01 | 0.97 | 0.01 | 0.44 | 0.06 |
ADASYN | 0.91 | 0.01 | 0.87 | 0.02 | 0.75 | 0.04 | 0.74 | 0.04 | 0.97 | 0.01 | 0.97 | 0.01 | 0.42 | 0.06 |
Borderline-SMOTE | 0.93 | 0.01 | 0.91 | 0.01 | 0.78 | 0.04 | 0.76 | 0.04 | 0.97 | 0.01 | 0.97 | 0.01 | 0.47 | 0.06 |
SMOTENC | 0.92 | 0.01 | 0.88 | 0.02 | 0.75 | 0.04 | 0.73 | 0.04 | 0.97 | 0.01 | 0.97 | 0.01 | 0.43 | 0.06 |
SMOTEENN | 0.80 | 0.01 | 0.75 | 0.00 | 0.63 | 0.06 | 0.60 | 0.07 | 0.89 | 0.01 | 0.90 | 0.01 | 0.43 | 0.06 |
SMOT-Tomek Links | 0.91 | 0.01 | 0.87 | 0.02 | 0.74 | 0.04 | 0.73 | 0.04 | 0.97 | 0.01 | 0.97 | 0.01 | 0.43 | 0.06 |
F1(1) | ||||||||||||||
Original | 0.83 | 0.01 | 0.87 | 0.01 | 0.90 | 0.00 | 0.90 | 0.00 | 0.88 | 0.01 | 0.87 | 0.01 | 0.66 | 0.04 |
PG | 0.57 | 0.01 | 0.49 | 0.02 | 0.37 | 0.06 | 0.49 | 0.00 | 0.13 | 0.03 | 0.22 | 0.01 | 0.50 | 0.06 |
RUS | 0.54 | 0.03 | 0.48 | 0.07 | 0.28 | 0.14 | / | / | 0.58 | 0.02 | 0.59 | 0.02 | 0.51 | 0.05 |
ENN | 0.87 | 0.00 | 0.89 | 0.00 | 0.90 | 0.00 | 0.90 | 0.00 | 0.89 | 0.00 | 0.89 | 0.00 | 0.59 | 0.07 |
ALLKNN | 0.84 | 0.01 | 0.89 | 0.00 | 0.90 | 0.00 | 0.90 | 0.00 | 0.88 | 0.00 | 0.88 | 0.01 | 0.58 | 0.05 |
ROS | 0.83 | 0.01 | 0.69 | 0.06 | 0.42 | 0.09 | 0.50 | 0.10 | 0.86 | 0.01 | 0.85 | 0.01 | 0.51 | 0.05 |
SMOTE | 0.77 | 0.03 | 0.67 | 0.06 | 0.43 | 0.10 | 0.42 | 0.10 | 0.86 | 0.01 | 0.86 | 0.01 | 0.51 | 0.05 |
ADASYN | 0.77 | 0.03 | 0.68 | 0.05 | 0.42 | 0.10 | 0.41 | 0.10 | 0.86 | 0.01 | 0.86 | 0.01 | 0.51 | 0.05 |
Borderline-SMOTE | 0.78 | 0.02 | 0.70 | 0.05 | 0.52 | 0.07 | 0.51 | 0.07 | 0.86 | 0.01 | 0.86 | 0.01 | 0.53 | 0.05 |
SMOTENC | 0.77 | 0.03 | 0.68 | 0.06 | 0.42 | 0.10 | 0.41 | 0.10 | 0.86 | 0.01 | 0.86 | 0.01 | 0.51 | 0.04 |
SMOTEENN | 0.54 | 0.04 | 0.41 | 0.05 | 0.22 | 0.08 | 0.18 | 0.08 | 0.68 | 0.04 | 0.71 | 0.03 | 0.27 | 0.08 |
SMOT-Tomek Links | 0.77 | 0.03 | 0.66 | 0.06 | 0.42 | 0.09 | 0.41 | 0.09 | 0.86 | 0.01 | 0.86 | 0.01 | 0.51 | 0.04 |
F1(2) | ||||||||||||||
Original | 0.13 | 0.02 | / | / | / | / | / | / | 0.15 | 0.03 | 0.20 | 0.02 | / | / |
PG | 0.14 | 0.01 | 0.14 | 0.03 | / | / | / | / | 0.18 | 0.02 | 0.16 | 0.01 | / | / |
RUS | 0.13 | 0.03 | / | / | / | / | / | / | 0.17 | 0.02 | 0.18 | 0.01 | / | / |
ENN | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
ALLKNN | 0.06 | 0.01 | / | / | / | / | / | / | / | / | / | / | / | / |
ROS | 0.13 | 0.02 | 0.18 | 0.04 | / | / | / | / | 0.20 | 0.03 | 0.22 | 0.03 | / | / |
SMOTE | 0.14 | 0.02 | 0.15 | 0.02 | / | / | / | / | 0.17 | 0.02 | 0.21 | 0.02 | / | / |
ADASYN | 0.15 | 0.03 | 0.16 | 0.03 | / | / | / | / | 0.19 | 0.03 | 0.20 | 0.03 | / | / |
Borderline-SMOTE | 0.15 | 0.02 | 0.15 | 0.03 | 0.07 | 0.02 | / | / | 0.17 | 0.03 | 0.18 | 0.02 | / | / |
SMOTENC | 0.15 | 0.03 | 0.17 | 0.03 | / | / | / | / | 0.16 | 0.02 | 0.23 | 0.03 | / | / |
SMOTEENN | 0.14 | 0.01 | 0.14 | 0.01 | 0.12 | 0.03 | 0.13 | 0.03 | 0.18 | 0.01 | 0.24 | 0.02 | / | / |
SMOT-Tomek Links | 0.16 | 0.02 | 0.17 | 0.02 | / | / | / | / | 0.17 | 0.02 | 0.19 | 0.03 | / | / |
F1(3) | ||||||||||||||
Original | / | / | / | / | / | / | / | / | 0.41 | 0.03 | 0.37 | 0.04 | 0.16 | 0.01 |
PG | 0.14 | 0.03 | 0.13 | 0.02 | 0.15 | 0.01 | 0.14 | 0.01 | 0.17 | 0.01 | 0.19 | 0.01 | 0.15 | 0.00 |
RUS | 0.12 | 0.03 | 0.12 | 0.02 | 0.12 | 0.01 | 0.12 | 0.02 | 0.25 | 0.02 | 0.26 | 0.02 | 0.15 | 0.00 |
ENN | / | / | / | / | / | / | / | / | 0.35 | 0.01 | 0.34 | 0.01 | 0.15 | 0.01 |
ALLKNN | / | / | / | / | / | / | / | / | 0.31 | 0.02 | 0.33 | 0.02 | 0.16 | 0.01 |
ROS | / | / | 0.18 | 0.06 | 0.18 | 0.01 | 0.17 | 0.01 | 0.39 | 0.05 | 0.41 | 0.05 | 0.15 | 0.01 |
SMOTE | 0.17 | 0.07 | 0.18 | 0.07 | 0.19 | 0.01 | 0.18 | 0.01 | 0.38 | 0.06 | 0.40 | 0.04 | 0.15 | 0.01 |
ADASYN | 0.19 | 0.06 | 0.18 | 0.07 | 0.18 | 0.01 | 0.17 | 0.01 | 0.38 | 0.04 | 0.41 | 0.05 | 0.15 | 0.00 |
Borderline-SMOTE | / | / | 0.18 | 0.06 | 0.19 | 0.01 | 0.16 | 0.03 | 0.41 | 0.06 | 0.37 | 0.03 | 0.15 | 0.01 |
SMOTENC | 0.18 | 0.06 | 0.17 | 0.07 | 0.19 | 0.01 | 0.15 | 0.03 | 0.38 | 0.03 | 0.40 | 0.05 | 0.15 | 0.00 |
SMOTEENN | 0.12 | 0.03 | 0.13 | 0.03 | 0.16 | 0.01 | 0.13 | 0.02 | 0.29 | 0.01 | 0.35 | 0.04 | 0.15 | 0.00 |
SMOT-Tomek Links | / | / | 0.16 | 0.06 | 0.18 | 0.01 | 0.16 | 0.03 | 0.40 | 0.05 | 0.42 | 0.06 | 0.15 | 0.00 |
Score | ||||||||||||||
Original | 0.33 | 0.04 | 0.32 | 0.04 | 0.27 | 0.00 | 0.27 | 0.00 | 0.42 | 0.03 | 0.43 | 0.03 | 0.39 | 0.07 |
PG | 0.34 | 0.04 | 0.35 | 0.05 | 0.38 | 0.05 | 0.37 | 0.03 | 0.44 | 0.04 | 0.44 | 0.03 | 0.40 | 0.04 |
RUS | 0.32 | 0.05 | 0.31 | 0.07 | 0.30 | 0.10 | 0.29 | 0.08 | 0.46 | 0.04 | 0.47 | 0.04 | 0.37 | 0.03 |
ENN | 0.31 | 0.03 | 0.32 | 0.03 | 0.27 | 0.00 | 0.27 | 0.00 | 0.39 | 0.02 | 0.39 | 0.02 | 0.37 | 0.05 |
ALLKNN | 0.31 | 0.03 | 0.32 | 0.03 | 0.27 | 0.00 | 0.27 | 0.00 | 0.38 | 0.02 | 0.40 | 0.03 | 0.39 | 0.03 |
ROS | 0.33 | 0.04 | 0.36 | 0.06 | 0.32 | 0.07 | 0.34 | 0.07 | 0.45 | 0.04 | 0.46 | 0.04 | 0.38 | 0.04 |
SMOTE | 0.33 | 0.05 | 0.33 | 0.06 | 0.33 | 0.07 | 0.32 | 0.07 | 0.42 | 0.03 | 0.44 | 0.03 | 0.37 | 0.04 |
ADASYN | 0.35 | 0.05 | 0.35 | 0.06 | 0.33 | 0.07 | 0.31 | 0.07 | 0.43 | 0.03 | 0.44 | 0.04 | 0.38 | 0.04 |
Borderline-SMOTE | 0.33 | 0.04 | 0.33 | 0.05 | 0.34 | 0.06 | 0.32 | 0.07 | 0.43 | 0.04 | 0.42 | 0.03 | 0.37 | 0.04 |
SMOTENC | 0.35 | 0.05 | 0.34 | 0.06 | 0.33 | 0.07 | 0.31 | 0.08 | 0.42 | 0.03 | 0.45 | 0.03 | 0.38 | 0.04 |
SMOTEENN | 0.31 | 0.04 | 0.32 | 0.05 | 0.35 | 0.07 | 0.33 | 0.08 | 0.44 | 0.04 | 0.51 | 0.04 | 0.39 | 0.05 |
SMOT-Tomek Links | 0.34 | 0.05 | 0.34 | 0.06 | 0.33 | 0.07 | 0.31 | 0.08 | 0.43 | 0.03 | 0.44 | 0.03 | 0.38 | 0.04 |
Appendix B
Dataset | Fine KNN | Weighted KNN | Cubic SVM | Fine Gaussian SVM | XGBoost | Random Forest | AODE | Rank |
---|---|---|---|---|---|---|---|---|
Original | 59 | 63 | 84 | 82 | 17 | 15 | 27 | 11 |
PG | 48 | 44 | 30 | 35 | 11 | 12 | 21 | 1 |
RUS | 64 | 76 | 77 | 78 | 3 | 2 | 37 | 10 |
ENN | 75 | 69 | 82 | 81 | 23 | 24 | 34 | 13 |
ALLKNN | 72 | 66 | 79 | 80 | 28 | 22 | 25 | 12 |
ROS | 59 | 39 | 67 | 49 | 6 | 4 | 29 | 3 |
SMOTE | 54 | 52 | 58 | 65 | 18 | 7 | 36 | 7 |
ADASYN | 42 | 41 | 61 | 70 | 16 | 9 | 31 | 4 |
Borderline SMOTE | 51 | 55 | 50 | 62 | 14 | 19 | 38 | 8 |
SMOTENC | 42 | 45 | 56 | 74 | 20 | 5 | 32 | 5 |
SMOTEENN | 73 | 68 | 40 | 53 | 10 | 1 | 26 | 2 |
SMOT-Tomek Links | 46 | 47 | 57 | 71 | 13 | 8 | 32 | 6 |
Rank | 5 | 4 | 6 | 7 | 2 | 1 | 3 | / |
References
- Yang, D.; Zhang, X.; Zhang, Y.; Liu, D.; Chang, M.; Zhang, L. Comparative study on factors for injury severity between highway and roadway motor vehicle crashes in China, 2004–2015. J. Third Mil. Med. Univ. 2017, 39, 589–596. [Google Scholar]
- Zhang, H.; Zhang, M.; Zhang, C.; Hou, L. Formulating a Gis-Based Geometric Design Quality Assessment Model for Mountain Highways. Accid. Anal. Prev. 2021, 157, 106172. [Google Scholar] [CrossRef] [PubMed]
- Yeung, J.S.; Wong, Y.D. Road traffic accidents in Singapore expressway tunnels. Tunn. Undergr. Space Technol. 2013, 38, 534–541. [Google Scholar] [CrossRef]
- Xu, C.; Wang, W.; Liu, P. Identifying crash-prone traffic conditions under different weather on freeways. J. Saf. Res. 2013, 46, 135–144. [Google Scholar] [CrossRef]
- National Bureau of Statistics. China Statistical Yearbook; China Statistics Press: Beijing, China, 2019. Available online: http://www.stats.gov.cn/tjsj/ndsj/2019/indexeh.htm (accessed on 14 October 2022).
- Chai, H.; Xie, J.; Li, X. A Policy Review of Road Safety Infrastructure Facilities in China. In Proceedings of the Asia-Pacific Conference on Intelligent Medical (APCIM)/7th International Conference on Transportation and Traffic Engineering (ICTTE), Beijing, China, 21–23 December 2018. [Google Scholar]
- Shi, X.; Wong, Y.D.; Li, M.Z.F.; Chai, C. Key risk indicators for accident assessment conditioned on pre-crash vehicle trajectory. Accid. Anal. Prev. 2018, 117, 346–356. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Quddus, M.A.; Ison, S.G. The effect of traffic and road characteristics on road safety: A review and future research direction. Saf. Sci. 2013, 57, 264–275. [Google Scholar] [CrossRef]
- Ghadi, M.; Torok, A. A comparative analysis of black spot identification methods and road accident segmentation methods. Accid. Anal. Prev. 2019, 128, 1–7. [Google Scholar] [CrossRef]
- Yannis, G.; Dragomanovits, A.; Laiou, A.; La Torre, F.; Domenichini, L.; Richter, T.; Ruhl, S.; Graham, D.; Karathodorou, N. Road traffic accident prediction modelling: A literature review. Proc. Inst. Civ. Eng. -Transp. 2017, 170, 245–254. [Google Scholar] [CrossRef]
- Gutierrez-Osorio, C.; Pedraza, C. Modern data sources and techniques for analysis and forecast of road accidents: A review. J. Traffic Transp. Eng. -Engl. Ed. 2020, 7, 432–446. [Google Scholar] [CrossRef]
- Santos, K.; Dias, J.P.; Amado, C. A literature review of machine learning algorithms for crash injury severity prediction. J. Saf. Res. 2022, 80, 254–269. [Google Scholar] [CrossRef]
- Chen, T.; Shi, X.; Wong, Y.D. A lane-changing risk profile analysis method based on time-series clustering. Phys. A: Stat. Mech. Its Appl. 2021, 565, 125567. [Google Scholar] [CrossRef]
- Shi, X.; Wong, Y.D.; Chai, C.; Li, M.Z.-F.; Chen, T.; Zeng, Z. Automatic clustering for unsupervised risk diagnosis of vehicle driving for smart road. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17451–17465. [Google Scholar] [CrossRef]
- Chang, L.-Y.; Chen, W.-C. Data mining of tree-based models to analyze freeway accident frequency. J. Saf. Res. 2005, 36, 365–375. [Google Scholar] [CrossRef]
- Krishnaveni, S.; Hemalatha, M. A perspective analysis of traffic accident using data mining techniques. Int. J. Comput. Appl. 2011, 23, 40–48. [Google Scholar] [CrossRef]
- Tiwari, P.; Dao, H.; Gia Nhu, N. Performance evaluation of lazy, decision tree classifier and multilayer perceptron on traffic accident analysis. Inform. -J. Comput. Inform. 2017, 41, 39–46. [Google Scholar]
- Sohn, S.Y.; Lee, S.H. Data fusion, ensemble and clustering to improve the classification accuracy for the severity of road traffic accidents in Korea. Saf. Sci. 2003, 41, 1–14. [Google Scholar] [CrossRef]
- Mujalli, R.O.; López, G.; Garach, L. Bayes classifiers for imbalanced traffic accidents datasets. Accid. Anal. Prev. 2016, 88, 37–51. [Google Scholar] [CrossRef] [PubMed]
- Fiorentini, N.; Losa, M. Handling imbalanced data in road crash severity prediction by machine learning algorithms. Infrastructures 2020, 5, 61. [Google Scholar] [CrossRef]
- Danesh, A.; Ehsani, M.; Nejad, F.M.; Zakeri, H. Prediction model of crash severity in imbalanced dataset using data leveling methods and metaheuristic optimization algorithms. Int. J. Crashworthiness 2022, 1–14. [Google Scholar] [CrossRef]
- Chen, T.; Shi, X.; Wong, Y.D.; Yu, X. Predicting lane-changing risk level based on vehicles’ space-series features: A pre-emptive learning approach. Transp. Res. Part C-Emerg. Technol. 2020, 116, 102646. [Google Scholar] [CrossRef]
- Silva, P.B.; Andrade, M.; Ferreira, S. Machine learning applied to road safety modeling: A systematic literature review. J. Traffic Transp. Eng. -Engl. Ed. 2020, 7, 775–790. [Google Scholar] [CrossRef]
- Li, L.; Zhu, L.; Sui, D.Z. A GIS-based Bayesian approach for analyzing spatial–temporal patterns of intra-city motor vehicle crashes. J. Transp. Geogr. 2007, 15, 274–285. [Google Scholar] [CrossRef]
- Chen, T.; Shi, X.; Wong, Y.D. Key feature selection and risk prediction for lane-changing behaviors based on vehicles’ trajectory data. Accid. Anal. Prev. 2019, 129, 156–169. [Google Scholar] [CrossRef] [PubMed]
- Wong, Y.D.; Nicholson, A. Driver behaviour at horizontal curves: Risk compensation and the margin of safety. Accid. Anal. Prev. 1992, 24, 425–436. [Google Scholar] [CrossRef]
- Malyshkina, N.V.; Mannering, F.L.; Tarko, A.P. Markov switching negative binomial models: An application to vehicle accident frequencies. Accid. Anal. Prev. 2009, 41, 217–226. [Google Scholar] [CrossRef] [Green Version]
- Lee, J.-Y.; Chung, J.-H.; Son, B. Analysis of traffic accident size for Korean highway using structural equation models. Accid. Anal. Prev. 2008, 40, 1955–1963. [Google Scholar] [CrossRef] [PubMed]
- Schloegl, M.; Stuetz, R.; Laaha, G.; Melcher, M. A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset. Accid. Anal. Prev. 2019, 127, 134–149. [Google Scholar] [CrossRef]
- Mazurowski, M.A.; Habas, P.A.; Zurada, J.A.; Lo, J.Y.; Baker, J.A.; Tourassi, G.D. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 2008, 21, 427–436. [Google Scholar] [CrossRef] [Green Version]
- Triguero, I.; Derrac, J.; Garcia, S.; Herrera, F. A taxonomy and experimental study on prototype generation for nearest neighbor classification. IEEE Trans. Syst. Man Cybern. Part C-Appl. Rev. 2012, 42, 86–100. [Google Scholar] [CrossRef]
- Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 3, 408–421. [Google Scholar] [CrossRef] [Green Version]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008. [Google Scholar]
- Batista, G.E.; Bazzan, A.L.; Monard, M.C. Balancing training data for automated annotation of keywords: A case study. WOB 2003, 10–18. Available online: https://www.inf.ufrgs.br/maslab/masbio/papers/balancing-training-data-for.pdf (accessed on 23 September 2022).
- Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Cuenca, L.G.; Puertas, E.; Aliane, N.; Andres, J.F. Traffic Accidents Classification and Injury Severity Prediction. In Proceedings of the 2018 3rd IEEE International Conference on Intelligent Transportation Engineering (ICITE), Singapore, 3–5 September 2018; pp. 52–57. [Google Scholar]
- Al Mamlook, R.E.; Kwayu, K.M.; Alkasisbeh, M.R.; Frefer, A.A. Comparison of Machine Learning Algorithms for Predicting Traffic Accident Severity. In Proceedings of the IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019. [Google Scholar]
- Guo, G.D.; Wang, H.; Bell, D.; Bi, Y.X.; Greer, K. KNN model-based approach in classification. In On the move to meaningful internet systems 2003: CoopIS, DOA, and ODBASE; Meersman, R., Tari, Z., Schmidt, D.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2888, pp. 986–996. [Google Scholar]
- Dudani, S.A. The distance-weighted K-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 1976, 4, 325–327. [Google Scholar] [CrossRef]
- Chen, W.; Wang, F.; Yan, K.; Zhang, Y.; Wu, D. Micro-stereolithography of KNN-based lead-free piezoceramics. Ceram. Int. 2019, 45, 4880–4885. [Google Scholar] [CrossRef]
- Yigit, H. A weighting approach for KNN classifier. In Proceedings of the 10th International Conference on Electronics, Computer and Computation (ICECCO), Turgut Ozal University, Ankara, Turkey, 7–9 November 2013. [Google Scholar]
- Kuang, L.; Yan, H.; Zhu, Y.; Tu, S.; Fan, X. Predicting duration of traffic accidents based on cost-sensitive Bayesian network and weighted K-nearest neighbor. J. Intell. Transp. Syst. 2019, 23, 161–174. [Google Scholar] [CrossRef]
- Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
- Bagasta, A.R.; Rustam, Z.; Pandelaki, J.; Nugroho, W.A. Comparison of cubic SVM with Gaussian SVM: Classification of infarction for detecting ischemic stroke. In Proceedings of the 9th Annual Basic Science International Conference (BaSIC)—Recent Advances in Basic Sciences Toward 4.0 Industrial Revolution, Brawijaya University, Malang, Indonesia, 20–21 May 2019. [Google Scholar]
- Zareapoor, M.; Shamsolmoali, P. Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier. In Proceedings of the 1st International Conference on Intelligent Computing, Communication and Convergence (ICCC), Bhubaneshwar, India, 27–28 December 2014. [Google Scholar]
- Malik, S.; El Sayed, H.; Khan, M.A.; Khan, M.J. Road Accident Severity Prediction—A Comparative Analysis of Machine Learning Algorithms. In Proceedings of the IEEE Global Conference on Artificial Intelligence and Internet of Things (GCAIoT), Dubai, United Arab Emirates, 12–16 December 2021. [Google Scholar]
- Shi, X.; Wong, Y.D.; Li, M.Z.-F.; Palanisamy, C.; Chai, C. A feature learning approach based on XGBoost for driving assessment and risk prediction. Accid. Anal. Prev. 2019, 129, 170–179. [Google Scholar] [CrossRef]
- Shi, X.; Wong, Y.D.; Chai, C.; Li, M.Z.-F. An automated machine learning (AUTOML) method of risk prediction for decision-making of autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7145–7154. [Google Scholar] [CrossRef]
- Parsa, A.B.; Movahedi, A.; Taghipour, H.; Derrible, S.; Mohammadian, A. Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accid. Anal. Prev. 2020, 136, 105405. [Google Scholar] [CrossRef]
- Leung, K.M. Naive Bayesian classifier. Polytech. Univ. Dep. Comput. Sci./Financ. Risk Eng. 2007, 2007, 123–156. [Google Scholar]
- Yang, K.; Wang, X.; Yu, R. A Bayesian dynamic updating approach for urban expressway real-time crash risk evaluation. Transp. Res. Part C-Emerg. Technol. 2018, 96, 192–207. [Google Scholar] [CrossRef]
- Flores, M.J.; Gámez, J.A.; Martínez, A.M. Supervised classification with Bayesian networks: A review on models and applications. Intell. Data Anal. Real-Life Appl. Theory Pract. 2012, 72–102. [Google Scholar] [CrossRef]
- Chen, T.; Wong, Y.D.; Shi, X.; Wang, X. Optimized structure learning of Bayesian network for investigating causation of vehicles? On-road crashes. Reliab. Eng. Syst. Saf. 2022, 224, 108527. [Google Scholar] [CrossRef]
- Mizianty, M.; Kurgan, L.; Ogiela, M. Comparative Analysis of the Impact of Discretization on the Classification with Naive Bayes and Semi-Naive Bayes Classifiers. In Proceedings of the 7th International Conference on Machine Learning and Applications, San Diego, CA, USA, 11–13 December 2008. [Google Scholar]
- Papadimitriou, E.; Filtness, A.; Theofilatos, A.; Ziakopoulos, A.; Quigley, C.; Yannis, G. Review and ranking of crash risk factors related to the road infrastructure. Accid. Anal. Prev. 2019, 125, 85–97. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Yin, L.; Ge, Y.; Xiao, K.; Wang, X.; Quan, X. Feature selection for high-dimensional imbalanced data. Neurocomputing 2013, 105, 3–11. [Google Scholar] [CrossRef]
- Schloegl, M. A multivariate analysis of environmental effects on road accident occurrence using a balanced bagging approach. Accid. Anal. Prev. 2020, 136, 105398. [Google Scholar] [CrossRef]
- Chen, T.; Wong, Y.D.; Shi, X.; Yang, Y. A data-driven feature learning approach based on Copula-Bayesian network and its application in comparative investigation on risky lane-changing and car-following maneuvers. Accid. Anal. Prev. 2021, 154, 106061. [Google Scholar] [CrossRef]
- Elvik, R. The more (sharp) curves, the lower the risk. Accid. Anal. Prev. 2019, 133, 105322. [Google Scholar] [CrossRef]
Number | Stake Number | Driving Direction | Weather |
---|---|---|---|
1 | K1~K2 | forward | sunny |
2 | K1~K2 | forward | cloudy |
…… | |||
1679 | K209~K210 | reverse | overcast |
1680 | K209~K210 | reverse | rainy |
Factor | Code | Variable | Description |
---|---|---|---|
Horizontal alignment | HPNE | Number of horizontal change points | The number of change points of horizontal alignment |
CULR | Curve length ratio | Ratio of curve length per kilometer (value between 0 and 1) | |
STLR | Straight length ratio | Ratio of straight length per kilometer (value between 0 and 1) | |
RIMX | Maximum curve radius index | Maximum value of the inverse of the radius. RIMX takes 0 in the straight sections | |
RICU | Cumulative radius index | Cumulative value of the inverse of the radius | |
Vertical alignment | VPNE | Number of vertical change points | The number of change points of vertical alignment |
GAE | Average grade | Average grade per kilometer | |
GMX | Maximum grade | Maximum grade per kilometer | |
GDF | Grade difference | The difference in the maximum grade per kilometer | |
DWLR | Downhill length ratio | The proportion of downhill in the section | |
Pavement | PCI | Pavement surface condition index | The value is between 0 and 100; the lower the value, the more serious the road damage. |
RQI | Riding quality index | The value is between 0 and 100; the lower the value, the lower the smoothness of the road surface. | |
RDI | Rutting depth index | The value is between 0 and 100; the lower the value, the deeper the road rut. | |
SRI | Skidding resistance index | The value is between 0 and 100; the lower the value, the lower the anti-skid performance of the pavement. | |
Tunnels and expressway facilities | TNN | Tunnel | 0—No tunnel in the section; 1—There are tunnels in the section. |
ITC | Interchange | 0—There is no interchange in the section; 1—There are interchanges in the section. | |
SVA | Service area | 0—There is no service area in the section; 1—There are service areas in the section. | |
Weather | WET | Weather | 1—Sunny; 2—Cloudy; 3—Overcast; 4—Rainy. |
Method | Algorithm | Description |
---|---|---|
Under-sampling | ●Prototype generation (PG) | Generate new samples based on original samples to achieve sample balance. Use K-means to cluster majority class samples and then use cluster centroids as newly generated replacement samples [31]. |
●Random under-sampling (RUS) | Some samples are randomly removed from the majority class, so that the samples of each class are balanced. | |
●Edited nearest neighbor (ENN) | Apply the nearest-neighbors algorithm to edit the dataset to remove samples with an insufficient neighborhood [32]. | |
●All-KNN (ALLKNN) | Apply ENN several times and vary the number of nearest neighbors [32]. | |
Oversampling | ●Naive random over-sampling (ROS) | Using the method of extraction with replacement, random sampling from minority class samples to replace the existing sample set; can increase the weight of minority class samples. |
●Synthetic minority oversampling technique (SMOTE) | For each minority class sample, the nearest k minority class samples are identified, a sample point is randomly selected each time, the corresponding adjacent sample point is randomly selected, and a new sample point is obtained by interpolating the sample point and adjacent sample point, thereby increasing the minority class samples to balance the data [33]. | |
●Borderline-SMOTE | This is an improved algorithm of SMOTE. Divide the minority class sample points into “noise points”, “dangerous points”, and “safe points”, and only use the dangerous points when calculating the nearest k minority class samples [34]. | |
●SMOTENC | This is an improved algorithm of SMOTE. Categorical variables are not properly distanced and interpolated. SMOTENC uses the value difference metric (VDM) algorithm to calculate the distance of categorical variables, which enables the processing of categorical variables [33]. | |
●ADASYN | Similar to SMOTE, it is also based on k adjacent and interpolation algorithms, the difference being that ADASYN considers other types of samples when calculating k adjacent samples [35]. | |
Mixed sampling | ●SMOTEENN | The SMOTE method is used to generate new minority class samples. There may be some noisy samples in the new samples. Apply the ENN method to remove noisy samples and obtain cleaner data [36]. |
●SMOT-Tomek Links | Similar to SMOTEENN, Tomek Links are applied to remove noisy samples to obtain cleaner data [37]. |
Dataset | Balancing Algorithm | Total | L_1 | L_2 | L_3 |
---|---|---|---|---|---|
Original dataset | / | 1344 | 1093 | 168 | 83 |
Under-sampling | ●PG | 249 | 83 | 83 | 83 |
RUS | 249 | 83 | 83 | 83 | |
ENN | 1124 | 1023 | 18 | 83 | |
●ALLKNN | 1065 | 948 | 34 | 83 | |
Over-sampling | ROS | 3279 | 1093 | 1093 | 1093 |
SMOTE | 3279 | 1093 | 1093 | 1093 | |
ADASYN | 3278 | 1093 | 1099 | 1086 | |
Borderline-SMOTE | 3279 | 1093 | 1093 | 1093 | |
SMOTENC | 3279 | 1093 | 1093 | 1093 | |
Mixed sampling | SMOTEENN | 2394 | 598 | 798 | 998 |
SMOT-Tomek Links | 3219 | 1068 | 1066 | 1085 |
Rank | Resampling Algorithm | Classifier | Score | |
---|---|---|---|---|
Mean | S.D. | |||
1 | SMOTEENN | Random forest | 0.51 | 0.04 |
2 | RUS | Random forest | 0.47 | 0.04 |
3 | RUS | XGBoost | 0.46 | 0.04 |
4 | ROS | Random forest | 0.46 | 0.04 |
5 | SMOTENC | Random forest | 0.45 | 0..03 |
6 | ROS | XGBoost | 0.45 | 0.04 |
7 | SMOTE | Random forest | 0.44 | 0.03 |
9 | SMOT-Tomek Links | Random forest | 0.44 | 0.03 |
8 | ADASYN | Random Forest | 0.44 | 0.04 |
10 | SMOTEENN | XGBoost | 0.44 | 0.04 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, B.; Zhang, C.; Wong, Y.D.; Hou, L.; Zhang, M.; Xiang, Y. Comparing Resampling Algorithms and Classifiers for Modeling Traffic Risk Prediction. Int. J. Environ. Res. Public Health 2022, 19, 13693. https://doi.org/10.3390/ijerph192013693
Wang B, Zhang C, Wong YD, Hou L, Zhang M, Xiang Y. Comparing Resampling Algorithms and Classifiers for Modeling Traffic Risk Prediction. International Journal of Environmental Research and Public Health. 2022; 19(20):13693. https://doi.org/10.3390/ijerph192013693
Chicago/Turabian StyleWang, Bo, Chi Zhang, Yiik Diew Wong, Lei Hou, Min Zhang, and Yujie Xiang. 2022. "Comparing Resampling Algorithms and Classifiers for Modeling Traffic Risk Prediction" International Journal of Environmental Research and Public Health 19, no. 20: 13693. https://doi.org/10.3390/ijerph192013693