3.1. Dataset and Feature Generation
In this research, we mainly emphasize the industries in the UK. The dataset used in this study is collected from the survey designed by Ipsos MORI Social Research Institute. Ipsos MORI Social Research Institute associated with the UK government conducted a telephone survey of 1008 UK businesses and 30 interviews during the year 2016 to figure out the cyber security issues and actions that needed to be processed in the UK industry [
3]. As the UK’s economy has become stronger, there is an increasing number of business operations in the UK. In order to make the UK become one of the most suitable places to run business, Ipsos MORI Social Research Institute and the UK government regularly take this kind of survey. Based on the task of exploring the causes leading companies to conduct SETA training, we extracted and generated various features from the survey result. Initially, several attributes hypothesized to contribute to the companies’ SETA implementation decision-making are extracted for the feature generation. After the preparation, we perform several actions, such as reassignment and mergence, to generate the features used to conduct the following analysis. The details of generating rules and generated features are shown as follows:
“Update”: The “Update” describes the frequency of antivirals software updates for a company, which is the answer to the question “how often does the company update the antivirals software?” in the survey. More frequent software updates indicate greater cybersecurity awareness. From this perspective, this feature reflects the awareness of corporate leadership in preventing cyber security threats. For this feature, we converted it into the ordinal type.
“Sizec”: The “Sizec” is one of the features describing the sizes of the company in the dataset. Different sizes of companies may have different considerations and attitudes toward cyber security issues and SETA implementation. The “Sizec” transformed the companies with varying numbers of employees into different scales of companies. The companies with 2–9 employees, 10–49 employees, 50–249 employees, and equal to or more than 250 employees are respectively regarded as micro, small, medium, and large companies. Based on the characteristics of the “Sizec”, it is also converted into the ordinal type.
“Freq”: The “Freq” describes the frequency of attacks and breaches that the company encountered during the past year of the survey. Different companies have encountered quite different cyber-attack situations during the past year of the survey. For example, some companies encountered cyber-attacks almost once a day or even several times a day, while some other companies did not encounter any attack in a year, which might be a key element affecting the companies’ decision-making on SETA implementation. For the “Freq”, we converted it into the ordinal type.
“Priority”: The “Priority” describes the level of importance placed on cybersecurity by organizations’ directors or senior managers. As the managers consider more cyber security protections, more measures and strategies might be taken to prevent cyber breach loss. This feature is a direct expression of the company management’s awareness regarding cyber security, and it is converted into the ordinal type.
“Numbb”: The “Numbb” is one of the features describing the total number of attacks and breaches that the company encountered during the past year of the survey in the original dataset. Different companies also encountered a different number of cyber-attacks and breaches in a fixed period. Similar to the “Freq”, the “Numbb” also describes the companies’ exposure to cyber-attacks. Compared to the “Freq” describing the attack frequency, the “Numbb” describes the order of magnitudes of cyber-attacks. The “Numbb” is also converted into the ordinal type.
“Core”: Th “Core” describes the dependence of products and services provided by an organization on online services. With the rapid advancement of information technology, online services have increasingly become the hard core of products and services in many companies and organizations. One of the goals of cyber-attacks for hackers is financial gain [
24], so the high dependence of products and services on online services might lead the managers to care more about cyber security and further conduct SETA training. Based on the dependency level on online services, we convert the “Core” into the ordinal type.
“Insure”: The “Insure” represents whether an organization or company has insurance against a cyber security breach or attack. In commercial activities, there is a number of business losses directly related to cyber security accidents. In the case of security accidents, cyber security insurance is an efficient way for organizations to reduce their capital loss, so the organizations with cyber security insurance might ignore many other cyber security protection channels and remedial measures, including SETA training. For the “Insure”, we converted it into the nominal type according to the insurance status.
“Factor”: The “Factor” describes whether the occurrence of cyber security breaches or attacks is related to staff-related factors. There are many attributes describing the specific reason contributing to the cyber security breaches or attacks in the original dataset. In these types of causes, both external invasion and internal negligence might contribute to cyber security breaches or attacks. It is challenging to prevent cyber security breaches or attacks caused by natural disasters. Besides, targeted external attacks on the organization, politically motivated attacks, and negligence of cyber compliance from the organization’s partners such as suppliers, are all external factors that lead to the cyber security breaches or attacks. SETA programs are the most effective way to improve employee information security protection behavior [
6,
7]. Thus, carrying out SETA training programs is a highly effective way to prevent cyber security breaches or attacks caused by human errors of the staff. In this case, we integrated the staff-related factors in the original dataset into one feature “Factor” to investigate whether the cyber security breaches and attacks related to staff-related factors are the key elements leading to the SETA implementation. Specifically, the feature “Factor” covers eight staff-related factors, including human error, unchanged or unsecure passwords, staff, ex-staff or contractors deliberately abusing their account, staff or ex-staff or contractors not adhering to policies or processes, absent vetting or inadequate vetting of staff, ex-staff or contractors, staff lacking awareness or knowledge, unsecure settings on browsers, software, computers, or accounts, and browsing untrusted or unsafe websites. For the “Factor”, we also converted it into the nominal type, same as the “Insure”. When there is at least one staff-related factor in the original dataset leading to the occurrence of cyber security breaches or attacks, the value of the “Factor” will be “Yes”. When there is no staff-related factor in the original dataset leading to the occurrence of cyber security breaches or attacks, the value of the “Factor” will be “No”. When the organization has not been attacked in the past 12 months of the survey, the value of the “Factor” will be “Unable to distinguish”.
“Cloud”: The “Cloud” describes whether the organization uses cloud or any other type of externally-hosted web services. Externally-hosted web services are efficient for organizations to reduce the difficulty of development, operation, and maintenance, as well as reduce a large number of preliminary IT infrastructure investments, enabling the organizations to focus more on their business operation and innovation. Compared to organizations using traditional servers, organizations using externally-hosted web services will relatively lack some control over some resources and operating material such as datasets, which exist as potential management risks for the organizations and companies. To protect the security of the confidential resources stored in externally-hosted web services, the organizations might conduct SETA implementation to reduce the loss and occurrence of threats such as social engineering. For the “Cloud”, we converted it into the nominal type.
“Critical”: The “Critical” describes the significance of the externally-hosted web services to an organization when this organization uses externally-hosted web services. The organizations with higher importance on the externally-hosted web services might focus more on the protection of the resources stored in externally-hosted web services. The “Critical” is converted into the ordinal type according to the corresponding significance of externally-hosted web services.
“Train”: The “Train” is the class. In this study, we regard the task exploring the causes evoking companies to conduct SETA implementation as a binary classification task. Thus, we merged the situations that anyone from an organization has participated in any type of SETA training programs mentioned in the survey, covering attending seminars or conferences on cyber security, attending any externally provided training on cyber security, and receiving any internal training on cyber security, into one set, while leaving the organizations with nobody participating in any type of SETA training programs as the other set. Following this rule, we converted the “Train” into a binary class type.
Based on the representations of generated features in the real world, we divide the features into four feature groups and evaluate their contribution to SETA implementation in the four corresponding dimensions. The four divided feature groups are respectively companies’ and organizations’ nature related group, corporate leadership’s awareness related group, internal factor related group, and external factor related group.
The feature values and corresponding feature groups of one example company are shown in
Table 1.
3.2. Experiment Design
Compared to the general qualitative methods used in SETA implementation studies, we use the supervised learning method to test the hypotheses that generated features are all contributing to businesses’ SETA implementation via feature importance. The framework of the whole experiment is shown in
Figure 1.
In the experimental part, we use the stratified K-fold cross-validation method to reduce model training bias caused by the possible data splitting contingency. Specifically, we use the stratified 5-fold cross-validation method to split the dataset into training dataset and validation dataset with the proportion of 8:2 in five times. The five split validation datasets could be merged to form the whole dataset. In each splitting, we train the classification models using the training dataset and evaluate the performances of trained models using the validation dataset. In the evaluation step, we calculate the average classification performances of the models among four different evaluation metrics in the five split validation datasets to evaluate the performances of these models in this task.
In order to more accurately investigate the causes evoking the companies to conduct SETA implementation, it is essential to achieve more accurate detection of the companies with SETA implementation using the features. Thus, based on the type of features, we train and evaluate eight representative supervised learning models that are generally effective in the classification of this kind of feature set to select the model with better performances in detecting the companies with SETA implementation. The four models are respectively Support Vector Machine (SVM), Naïve Bayes, Logistic Regression, Decision Tree, Random Forest (RF), Bagging, AdaBoost, and Light Gradient Boosting Machine (LightGBM). Besides, since the dataset is an imbalanced dataset, we choose to use the cost-sensitive method to handle the class imbalanced problem and train the selected classification models except for the Naïve Bayes model. Due to the unique operating principle of the Naïve Bayes model, the cost-sensitive method is not suitable to be applied to this model, which will be specifically demonstrated in the following.
The eight selected models cover four base classification models, SVM, Naïve Bayes, Logistic Regression and Decision Tree, and four ensemble classification models, RF, Bagging, AdaBoost and LightGBM. When training the models, we adopt some methods to avoid overfitting and boost the model performances. For example, we choose to set the maximum depth for the trees generated in the tree-related models to avoid overfitting.
Support Vector Machine is a base classification model that has overall decent performances in many classification tasks [
25]. It is a discriminative classifier that is generally used in binary classification. The idea of SVM in binary classification is to find the optimal hyperplane which can separate the m-dimensional data into two classes [
26]. In this case, based on the feature type, we employ the Radial Basis Function (RBF) kernel to train the SVM classifier.
Naïve Bayes is a base classification model applying the Bayes’ theorem. A Naïve Bayes model-based classifier could be efficiently conducted and trained without any complicated parameter estimation. Thus, it is suitable for the classification task with high dimensional input features.
Logistic Regression is a base classification model, which is a standard method to build prediction models for a binary class outcome. In the binary classification task, the logistic regression uses the sigmoid function to convert the value range into 0–1 and find the decision boundary in the converted hyperplane to classify the input data.
Decision Tree is a base classification model to classify data by generating a treelike graph with a series of rules. It uses a treelike graph to guide the input feature to one of the class labels. A unique rule used to test the input data is embedded in an internal node of a decision tree. The possible results classified by the rules of the internal nodes will generate corresponding numbers of outgoing branches connected to this internal node.
Random Forest is an ensemble classification model. This model constructs a number of decision trees and compares the results from all sub-decision trees to generate the final classification outcome. It is an application of the bagging method. The difference between random forest from the combination of the bagging method and decision tree classifier is that random forest will randomly select a subset of input features when generating nodes of sub-decision trees.
Bagging uses a majority vote to determine the class of ensemble classifiers. In this case, the bagging classifier builds decision tree classifiers on each bootstrap sample and the generated final output is the majority vote of the built sub-decision tree classifiers classification results.
Compared with the Random Forest model and the combination of the bagging method and decision tree classifier, the AdaBoost model adds weights for each sub-classifier. In determining the final result, the final output is generated by weighted voting. At the beginning of this method, each classifier has the same weight. In each iteration, the weight of the classifier in this iteration might change, and the weight change focuses on the misclassified records in previous iterations.
Light Gradient Boosting Machine is an improved version of the Gradient Boosting Decision Tree. Gradient Boosting Decision Tree is an ensemble model based on Decision Tree, which is a widely used machine learning algorithm [
27]. In Gradient Boosting Decision Tree, a series of weak learners are constructed along the gradient, and they are then combined within corresponding weights. The weighted result is the decision made by GBDT. Compared to Gradient Boosting Decision Tree, Light Gradient Boosting Machine could obtain almost the same performances with a much smaller amount of training time [
28].
The four metrics that we adopt to evaluate the performances of the classification models are separately precision (
P), recall (
R),
F-measure (
F), and accuracy (
A) [
29]. Equations of these four metrics are shown as follows:
In these four equations, tp is true positive, which is the number of correctly detected companies that have held SETA implementation; fp is false positive, which is the number of companies without SETA implementation that are incorrectly detected as companies that have held SETA implementation; fn is false negative, which is the number of companies that have held SETA implementation and are incorrectly detected as companies without SETA implementation; tn is true negative, which is the number of correctly detected companies without SETA implementation. Values of these four metrics range from 0 to 100 percent.
3.3. Performance Comparison
The classification performances of the eight classification models in the four metrics are presented in
Table 2.
The standard deviations of the model performances among the five different split results of the stratified 5-fold cross-validation method in the four metrics are presented in
Table 3.
As shown in
Table 3, the highest standard deviation of the eight classification models in the four metrics are 0.0490, less than 0.05, so these models have stable performances in the four metrics.
Higher outputs in the four metrics, precision, recall,
F-measure, and accuracy reflect better performances of the model in the classification task. As it is shown in
Table 2, the Random Forest classifier obtains the best performances in the recall,
F-measure, and accuracy, reaching, respectively, 79.60%, 71.16%, and 75.64%, which outperforms all the other classifiers. The Naïve Bayes classifier obtains the best performance in the precision, reaching 67.29%, but is much worse performances in the other three metrics, especially in the recall and
F-measure. In the precision, the Random Forest classifier also ranks the second-best in all classifiers, reaching 64.42%. Although the SVM classifier also obtains the highest ranking in the recall, such as the RF classifier, its performances in the other three metrics are much worse than the RF classifier.
Overall, the RF model could achieve more stable performances than other models in detecting the companies and organizations that have held SETA training. According to the performances of the RF classifier, it is shown that the RF classifier could obtain high performance in the recall, meaning that this classifier could well detect the companies and organizations that have held SETA implementation. By contrast, the precision values of the RF classifier and other classifiers are all not very high, indicating that there are some companies or organizations without SETA implementation that might have highly similar characteristics to the companies or organizations with SETA implementation in the tested factors.
To better understand the reasons leading to the SETA implementation for companies, we visualize the importance of all features for the RF model in the experiment, shown in
Figure 2.
As it is shown in
Figure 2, the features “Update” and “Sizec” play the most important role for the RF model in detecting the companies and organizations with SETA implementation. Following the top 2 best features, the features, “Priority”, “Numbb”, “Freq”, and “Factor”, also have relatively significant impacts on the detection task for the RF classifier. These six features are relatively more useful features in the detection task, and among these useful features, the two features in awareness related feature group respectively obtain the highest and the third highest importance. Thus, the feature group with corporate leadership’s awareness related features have a much higher importance than the other three feature groups, reflecting that there is a great willingness for the companies and organizations whose leadership has stable cyber security protection awareness to conduct SETA implementation. As for the companies’ and organizations’ nature related features, only “Sizec” has a significant impact on the detection, obtaining the second highest importance in the task, which represents that companies and organizations tend to conduct SETA implementation when their scale reaches a certain level. By contrast, the internal and external factor related features have relatively lower importance in the useful features for the classification task.
Compared to the six useful features, the other four features have relatively lower contributions to the classification, so the companies and organizations that participated in the survey are generally not considering much about the potential attacks in the stuff using externally-hosted web services and the potential incremental attack rate brought by the high dependence of products and services on online services.