Next Article in Journal
Comparison of Extrainsensitive Input Shaping and Swing-Angle-Estimation-Based Slew Control Approaches for a Tower Crane
Next Article in Special Issue
Very Short-Term Photoplethysmography-Based Heart Rate Variability for Continuous Autoregulation Assessment
Previous Article in Journal
Air Traffic Trajectory Operation Mode Mining Based on Clustering
Previous Article in Special Issue
Efficacy of Two Toothpaste in Preventing Tooth Erosive Lesions Associated with Gastroesophageal Reflux Disease
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project

by
Stavros Pitoglou
1,2,*,
Arianna Filntisi
2,
Athanasios Anastasiou
1,
George K. Matsopoulos
1,* and
Dimitrios Koutsouris
1
1
School of Electrical and Computer Engineering, National Technical University of Athens, 157 80 Athens, Greece
2
Computer Solutions SA, 115 27 Athens, Greece
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2022, 12(12), 5942; https://doi.org/10.3390/app12125942
Submission received: 18 April 2022 / Revised: 31 May 2022 / Accepted: 8 June 2022 / Published: 10 June 2022
(This article belongs to the Special Issue Advances in Biomedical Signal Processing in Health Care)

Abstract

:
The object of this paper was the application of machine learning to a clinical dataset that was anonymized using the Mondrian algorithm. (1) Background: The preservation of patient privacy is a necessity rising from the increasing digitization of health data; however, the effect of data anonymization on the performance of machine learning models remains to be explored. (2) Methods: The original EHR derived dataset was subjected to anonymization by applying the Mondrian algorithm for various k values and quasi identifier (QI) set attributes. The logistic regression, decision trees, k-nearest neighbors, Gaussian naive Bayes and support vector machine models were applied to the different dataset versions. (3) Results: The classifiers demonstrated different degrees of resilience to the anonymization, with the decision tree and the KNN models showing remarkably stable performance, as opposed to the Gaussian naïve Bayes model. The choice of the QI set attributes and the generalized information loss value played a more important role than the size of the QI set or the k value. (4) Conclusions: Data anonymization can reduce the performance of certain machine learning models, although the appropriate selection of classifier and parameter values can mitigate this effect.

1. Introduction

The digitization of healthcare workflows and the increased usage of electronic health records has led to a dramatic increase in electronically available clinical data in terms of volume, complexity, diversity and timeliness. Clinical data carry information about real patients and hold the promise of supporting a wide range of unprecedented opportunities and use cases, such as clinical decision support, health insurance, disease surveillance, population health management, adverse event monitoring, and treatment optimization. Therefore, clinical data are suitable to be reused in machine learning applications, despite the fact that they can be characterized by noise, errors, incompleteness and high dimensionality [1].
Even though the digitization of clinical data in the healthcare sector carries many benefits, it also raises challenges such as the preservation of patient privacy. Health data by default are sensitive, and concerns over the compromise of sensitive information, security and privacy are increasing year by year. Security attacks can happen in the data-gathering phase, the network phase, as well as the storage phase [2]. Health Information custodians (HICs) have faced increasing privacy breaches of different types due to either the negligence of administrative staff or the employment of weak de-identification methods [3]. It has been shown that the number of health service providers reporting cases of data privacy breaches is increasing every year. In the years 2016–2017, approximately 90 percent of healthcare providers were faced with data breaches, while a successful breach was estimated to cost around USD 3.7 million to clean up [4,5,6]. Furthermore, it has been shown that seemingly anonymous, de-identified data that is publicly available can be combined and linked to certain individuals or groups through other identifying attributes [7].
Data security controls the access to medical data throughout the data lifecycle, protecting it from any unauthorized third-party access, and is ensured through technologies such as authentication, encryption, data masking and access control. Data privacy, on the other hand, is concerned with the protection of an individual’s healthcare information from unauthorized access. Data privacy regulates data access based on privacy policies and laws and is ensured through methods such as de-identification and anonymization [1,8]. The most prominent data privacy model is k-anonymity [9].
Machine learning is a category of algorithmic methods enabling machines to solve problems without specific computer programming and can be distinguished by supervised, unsupervised and reinforcement methods in terms of their learning approach [8,10]. Despite the fact that the implementation of machine learning tools for health care data faces many challenges, the use of AI in health care is surrounded with excitement, with numerous platforms integrating clinical machine learning tools having been developed [11]. Machine learning models have been utilized in several health informatics fields, such as medical imaging, medical informatics and public health [12]. For example, deep learning models have been used to predict mortality, early readmission, long length of stay, as well as future diseases, from EHRs [13,14]. In addition, machine learning algorithms such as logistic regression, support vector machines, Gaussian naive Bayes, K-nearest neighbors and deep multilayer neural networks have been employed to predict the probability of early patient readmission from processed hospital information system (HIS) data [14].
The topic of privacy-aware machine learning, which lies at the intersection between machine learning and privacy preservation, has begun to be explored in the last few years with the advent of data privacy laws [15]. Examples of relevant research are the investigation of the impact of data anonymization on the performance of machine learning algorithms using the gradient boosting, random forest, logistic regression and linear SVC algorithms [16]. The use of interactive machine learning has also been proposed, eliciting human preferences for preserving some attribute values over others for anonymization [17]. The topic of differential privacy has been investigated by comparing two differential privacy algorithms and evaluating the results by applying three machine learning algorithms to anonymized and raw data [18]. In addition, privacy-preserving protocols for three classifiers have been proposed [19]. The effect of anonymization algorithms on the performance of machine learning classifiers has been explored in several studies [20,21,22]. A novel anonymization algorithm, information-based anonymization for classification given k (IACk) based on normalized mutual information was introduced in [20], the effect of which on the performance of machine learning models was tested using decision trees, naïve Bayes, logistic regression and SVM. Another anonymization algorithm (non-homogeneous generalization with sensitive value distribution, NSVDist), based on an information loss metric was introduced in [21]. The authors compared NSVDist with the Mondrian [23], privacy-aware information sharing (PAIS) [24] and sequential anonymization (SeqA) [25], and evaluated their effect on the performance of the naïve Bayes, SVM, W-J48 and W-JRip classifiers. A comparison of the anonymization algorithms Mondrian, optimal lattice anonymization (OLA) [26], top-down greedy anonymization (TDG) [27] and the k-nearest neighbor clustering-based anonymization method [28], regarding their impact on the performance of the k-NN, SVM, XGBoost and random forest classifiers, was made in [22]. Since the purpose of some of these studies has been the introduction of novel anonymization methods (IACk [21], NSVDist [22]), it can be noted that the effect of anonymization methods on machine learning performance has become an important consideration for their evaluation and validation.
The object of this paper was the application of machine learning algorithms to a clinical dataset that had been subjected to anonymization using the Mondrian algorithm with various parameter values. The concept of this paper and the dataset used originated in the MODELHealth project, the main object of which has been the development of a software platform for the harmonization and anonymization of electronic health record data, with the goal of utilizing them as input to machine learning models [29,30].

2. Materials and Methods

2.1. Data

The initial dataset was provided by a public Greek hospital, and contained demographic and hospitalization information for 117,181 patients. The one-hot encoding corresponding to the attributes related to patient sex (SEX_F, SEX_M) and the care encounter outcome (OUTCOME_H, OUTCOME_N, OUTCOME_I, OUTCOME_D) was removed in order to facilitate the anonymization process. The attributes of the original dataset, the attributes after the one-hot decoding and their descriptions can be seen in Table 1.

2.2. Anonymization and Information Loss Estimation

2.2.1. K-Anonymity

K-anonymity is the primary anonymization method that has been proposed for the prevention of identity disclosure in data publishing, limiting the probability of linking an individual to their records to 1/k [9,31]. Of central importance to the k-anonymity concept is the quasi-identifier (QI) set, a set of seemingly innocuous dataset attributes whose linkage with external information can lead to the reidentification of individual records, as well as the equivalence class (EQ), a set of dataset records that are indistinguishable from each other with respect to the values of the QI set. A dataset satisfies the k-anonymity constraint if each record is identical to at least k − 1 records with respect to the QI set, which means that each equivalence class EQ consists of at least k records. K-anonymity can be enforced using suppression and generalization techniques. Suppression is achieved by replacing some of the original attribute values with a specific value indicating its non-disclosure. Generalization, or re-coding, is achieved by replacing the attribute values with less specific but consistent values.

2.2.2. L-Diversity, T-Closeness

Another concept related to data privacy is sensitive attributes, the attributes that patients are not willing to be associated with, such as diagnosis codes [9]. The privacy provided by K-anonymity could be considered insufficient in some cases since potentially it can allow the disclosure of sensitive attributes that lack diversity through the use of background knowledge. Nevertheless, k-anonymity is the primary algorithm proposed for anonymization and is used as a baseline process [31,32,33,34]. L-diversity and T-closeness are the most prevalent anonymization concepts that extend k-anonymity by considering sensitive attributes, the attributes that patients are not willing to be associated with. L-diversity focuses on the representation of sensitive attribute values in the anonymized dataset, requiring that each equivalence class have at least one value for each sensitive attribute [1,9,34]. T-closeness focuses on limiting the distance between the probability distribution of the sensitive attribute values in an anonymized group and that of sensitive attribute values in the entire dataset, requiring that the distance between the two distributions be no greater than a threshold t [9,33].

2.2.3. Mondrian Algorithm

The Mondrian is a greedy anonymization algorithm that implements k-anonymization through multidimensional generalization and is applicable to categorical as well as numeric data [23]. According to this method, a k-anonymization of a given dataset is achieved in two stages, the first one focusing on the recursive partitioning of the dataset in a number of multidimensional regions covering its domain space, a process similar to the KD-tree construction method. The second stage focuses on the mapping of generalized values to each dataset partition by applying re-coding functions using summary statistics from each region. The time complexity of the Mondrian algorithm is logarithmic, outperforming other optimal algorithms implementing k-anonymity, whose worst-case complexity is exponential [23].

2.2.4. Information Loss

Anonymization provides the benefit of data privacy, but at the same time negatively affects data utility by causing information loss. Three metrics that can be used for the estimation of information loss due to anonymization are generalized information loss, discernibility and average equivalence class size [35].
Generalized information loss represents the penalty induced from the generalization of a specific attribute by taking into account the fraction of the domain values that have been generalized. The generalized information loss for an anonymized table T* was calculated according to Equation (1), where T is the original table, i = 1, …, n corresponds to an attribute, j = 1, …, |T| corresponds to a table record, Ui, Li are the upper and lower values of each arithmetic attribute i, Uij, Lij are the upper and lower values of arithmetic attribute i for the equivalence class the record j belongs in, Ni is the number of different values for each categorical attribute i, and Nij is the number of different values for categorical attribute i in the equivalence class the record j belongs in [35,36,37].
The discernibility metric captures the indistinguishability of a table record compared to the rest by assigning a penalty to each record, equal to the size of the equivalence class it belongs in. The discernibility metric for an anonymized table T* was calculated according to Equation (2), where |EQ| is the number of records in the equivalence class EQ [23,35,38].
The average equivalence class size estimates how well the equivalence class formulation approaches the optimal case, in which every equivalence class contains k records. It is calculated according to Equation (3), where |T| is the number of table records, |EQs| is the total number of equivalence classes created in the anonymized table T*, and k is the minimum equivalence class size allowed [23,35].
G I L ( T * ) = 1 | T | n × i = 1 n j = 1 | T | { U i j L i j U i L i ,   i f   i   i s   a r i t h m e t i c N i j 1 N i 1 ,   i f   i   i s   c a t e g o r i c a l  
D M ( T * ) = E Q s . t . | E Q | k | E Q | 2
C A V G ( T * ) = | T | | E Q s | k
After one-hot decoding, the original dataset was subjected to anonymization by applying the Mondrian algorithm for various k values and quasi identifier (QI) set attributes. The algorithm choice is justified due to its performance as well as the fact that the dataset does not contain sensitive attributes (Table 1). The information loss that occurred due to the anonymization process was measured with generalized information loss (GIL), discernibility metric (DM) and average class size (CAVG) metrics for each parameter combination. The different dataset versions that were generated from the anonymization process, the corresponding anonymization parameter values and the obtained values of the information loss metrics can be seen in Table 2.

2.3. Machine Learning

Machine Learning methods have been utilized extensively in the clinical field for numerous tasks, such as the prediction of clinical outcomes, modeling disease risk, decision support systems and infection management, using a variety of data formats including electronic health records [30,39,40,41].

2.3.1. Logistic Regression

Logistic regression is a statistical method used to model the nonlinear relationship between a categorical dependent variable and the combined effects of the independent variables by applying a logistic function [42,43,44]. Logistic regression has been used widely in the clinical field as a predictive model [44,45], with several studies reporting that more recent artificial intelligence methods show no performance benefit over logistic regression for clinical predictions [45,46]. Logistic regression is known to perform well in large, low-dimensional datasets, and has reportedly been the most frequently used model design technique in clinical decision support systems [47,48,49].

2.3.2. Decision Trees

A decision tree is a supervised learning method that maps the input features related to an item to a predicted target value by modeling the input features of a given item, the feature values and the target classes as a flowchart-like tree structure. The classification of an input item is achieved by following a path along the decision tree from the root node to the leaves, where the tree nodes correspond to the feature names, the arcs correspond to the possible feature values, and the leaves are labeled with the different classes [42,50,51,52]. Common algorithms for decision trees are C4.5, C5.0, and Bayesian trees [53]. Decision trees are well suited for knowledge domains that can be defined by a relatively small set of rules [54].

2.3.3. K Nearest Neighbors

The k-nearest neighbors (KNN) method is a supervised learning algorithm that maps an input vector to a predicted target value by finding the set of K labeled vectors in the feature space that have the smallest distance from the unlabeled input vector. The classification of the input vector is then based on the predominance of a particular class in this neighborhood. The KNN algorithm performs better when the data form well defined clusters, since predictions are based on distances between data [50,52,54,55]. In this paper, the KNN model was applied for K = 5 nearest neighbors.

2.3.4. Support Vector Machines

The support vector machine (SVM) is a supervised learning model characterized by the formation of a set of hyperplanes that separate the input vectors into a number of classes. The hyperplanes can then be used to determine the most probable class for unknown input data. SVMs are known to perform well with high-dimensional data [56,57].

2.3.5. Gaussian Naive Bayes

The Gaussian naive Bayes method is a machine learning method that is based on Bayes’ theorem and assumes there is independence among the input features. Despite the fact that conditional independence is rarely true in real-world problems, the Gaussian naive Bayes classifier has been reported to demonstrate high performance [39,58,59].
In this paper, the methods of logistic regression, decision trees, k-nearest neighbors, Gaussian naive Bayes and support vector machines have been applied to the different dataset versions that resulted from the anonymization of the original EHR-derived dataset using 10-fold validation. The parameter values of the applied models are listed in Table A1.

3. Results

The results were evaluated using the area under the curve (AUC) and the Matthews correlation coefficient (MCC) metrics, both of which are used extensively in the fields of medical informatics and bioinformatics. The AUC is the area under the receiver operating characteristic (ROC) curve plotting the true positive rate against the false positive rate [60,61]. The Matthews correlation coefficient (MCC) is a statistical metric that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset [62,63]. The results of the experiments are listed exhaustively in Table A2 and demonstrated in Figure 1 and Figure 2.
In Figure 1, the horizontal axis contains all of the qi, k value combinations, where qi is the ID of the quasi identifier set as presented in Table 2. In Figure 1a, the AUC and MCC scores for the test dataset are plotted against k for each of the qi values, while Figure 1b depicts AUC and MCC plotted against qi for each of the k values. The parameters qi = 0, k = 1 represent the case in which the machine learning models were applied to the non-anonymized dataset. It can be observed that most machine learning models had better AUC, MCC scores with the non-anonymized dataset, which means that the information loss occurring from the anonymization process in general reduced the predictive ability of the models. However, the five models were affected to various degrees, with Gaussian naive Bayes suffering the most prominent performance reduction, the decision tree and KNN classifiers showing the most stable performance across the different anonymization parameter values, and logistic regression presenting a performance improvement as a result of the anonymization process for qi = 3b, 4.
In Figure 2, the AUC score and the MCC score are plotted juxtaposed with the GIL value against the anonymization parameter values, separately for each tested model. Figure 2a depicts AUC, MCC and GIL plotted against k for each qi value separately, while Figure 2b depicts AUC, MCC and GIL plotted against qi for each k value. It can be observed that lower GIL values were associated with higher AUC, MCC values in the cases of the logistic regression and Gaussian naive Bayes models. On the other hand, the decision tree, the KNN and the SVM classifiers were less affected by the variation in the GIL values.
The demonstrated results indicate that for a specific qi value, the k value did not affect the performance of the tested machine learning models significantly. Contrarily, all five models demonstrated lower performance for qi = 3a, which corresponded to the QI attributes (AGE, SEX, OUTCOME). However, the performance of most models was noticeably better for qi = 2, 3b, 4, values, which corresponded to the QI attributes (AGE, SEX), (AGE, SEX, CURADM_DAYS) and (AGE, SEX, CURADM_DAYS, PREVADM_DAYS), respectively. This indicates that the OUTCOME input variable plays a more prominent role that the AGE, SEX, CURADM_DAYS and PREVADM_DAYS attributes regarding the prediction of readmission. It can also be deduced that the choice of the QI set attributes is more significant than the size of the QI set as well as the k value in terms of predictive power.
In order to quantify the relationship between the GIL metric and the performance metrics AUC, MCC, two linear regression models were built for each applied machine learning model and were trained on 70% of the data. The linear regression models plotted against the test data are depicted in Figure 3, and their equations are presented in Table A3.
The statistical significance of the presented results was evaluated by performing a series of statistical tests. The descriptive statistics of the AUC and MCC metrics of the experiments with different ML models are presented in Table 3 and Figure 4. Both parametric one-way ANOVA analysis (Welch test, [64]), as well as non-parametric (Kruskal–Wallis, [65]), dealing with the uncertain distribution normality of the results, showed statistically significant differences among the different algorithms (p < 0.001). Details of the analysis as well as pairwise comparisons are presented in Table A4, Table A5, Table A6, Table A7 and Table A8 of Appendix A.
As expected, there was a statistically significant negative correlation between the information loss (GIL), as impact of the anonymization process, and the model performance (GIL- Test MCC Pearson’s r = −0.182 [−0.020, −0.335 95%CI], p = 0.028). Figure 5 depicts the correlation scatterplot and the respective value densities.

4. Discussion

In this paper, we presented a series of experiments designed to explore the predictive potential of anonymized medical datasets subjected to various degrees of manipulation. The original dataset used consisted of 117,181 records derived from EHR data from a Greek hospital and nine attributes related to the patient and the hospitalization (Table 1). The data were collected in the context of the MODELHealth project, which focused on the development of a system for the extraction of electronic health record data from a hospital database, the transformation of the data through harmonization and anonymization, and their loading into a central database in order for them to be used as inputs to machine learning models [29,30].
The dataset was subjected to one-hot decoding and anonymization using the Mondrian algorithm. Five attributes of the dataset were used as candidates for anonymization, with four combinations of them being used to form the quasi identifier (QI) set. In addition, seven values in the interval 2–30 were used for the k anonymization parameter, which was the minimum number of records with the same values in the QI set (Table 2). The non-anonymized dataset as well as the 28 dataset versions resulting from the different anonymization parameter combinations were fed as input into five machine learning models, namely the decision tree, the logistic regression, the k-nearest neighbors, the support vector machine and the Gaussian naïve Bayes classifiers, the parameters of which have been presented in Table A1.
The main emerging finding is that the loss of predictive power (that is predicted class pattern consistency in a machine learning classification context) as a function of the information loss due to aggregation and suppression anonymization processes varies considerably, depending on the nature of the classification algorithm used. Indeed, in our experiments, there were classifiers that showed great resiliency even under significant information loss, and others that directly or indirectly were critically affected, with their accuracy metrics plunging towards non-significance (Figure 1 and Figure 2). More specifically, the decision tree and the k-nearest neighbors classifiers demonstrated noticeably stable performance regardless of the anonymization parameters, as opposed to the Gaussian naïve Bayes and the support vector machine classifiers. Remarkably, the logistic regression model demonstrated improved performance for various (qi, k) combinations of the anonymized dataset in comparison with the non-anonymized dataset, indicating that anonymization could have a similar effect with the regularization of machine learning models, although further experiments should be performed in order to reinforce that statement.
The experimental results showcased the choice of QI set attributes as well as the GIL value resulting from the anonymization process as the factors that seem to play a more significant role regarding the prediction results, rather than the size of the QI set or the k value. Indeed, the inclusion of the OUTCOME attribute, representing the hospitalization outcome, in the anonymized attribute set resulted in performance deterioration for all tested machine learning models. Contrarily, the rest of the attributes involved in the anonymization process, which were the patient’s age and sex, the number of days of the current admission, and the cumulative number of days of previous hospital admissions, did not have a similarly strong effect on the performance of the classifiers.
Our study reached conclusions similar to previously published work regarding the effect of anonymization on the performance of machine learning models. Indeed, the resilience of tree-based models to the information loss induced by anonymization has been mentioned in [22]. A difference in our approach compared to other published work was its stronger focus on various quasi identifier sets and their effect on the classification performance. It could be said that just as the selection of input features of a machine learning model plays an important role in its performance, the selection of the quasi identifier set can be a significant factor for the performance of machine learning on anonymized datasets. In addition, in the context of our experiments, the features used as input for the machine learning models were a superset of the quasi identifier set, an approach that differed from similar published work, where the whole set of input features was anonymized.
The presented study can be extended by the application of deep learning neural networks and experimentation with their architecture and hyperparameters in order to investigate their resilience to the anonymization process. In addition, the effects of different anonymization algorithms on the predictive ability of machine learning models can be explored.

5. Conclusions

The anonymization of clinical data can have a negative impact on the performance of some machine learning models. However, the selection of appropriate models and parameter values can compensate for this effect, providing the opportunity to benefit from the application of machine learning while protecting patient privacy.

Author Contributions

Conceptualization, S.P. and A.F.; methodology, S.P. and A.F.; software, S.P. and A.F.; validation, S.P., A.F. and A.A.; writing—original draft preparation, S.P. and A.F.; writing—review and editing, S.P., A.F., A.A., G.K.M. and D.K.; supervision, S.P., G.K.M. and D.K.; project administration, S.P.; funding acquisition, D.K., G.K.M. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was co-funded by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship, and Innovation, under the call RESEARCH-CREATE-INNOVATE (Project Code: T1EDK-04066).

Informed Consent Statement

Not Applicable. Data used for this research purposes qualify as “anonymous information” per the definition provided in Recital 26 of the General Data Protection Regulation (GDPR) of the European Union (‘…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable’).

Data Availability Statement

Raw data were generated in the context of the MODELHealth project, which has been co-funded by the European Regional Development Fund of the European Union and Greek national funds. Derived data supporting the findings of this study are available from the corresponding author S.P., upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. The machine learning models and the corresponding parameter values applied in the context of this paper.
Table A1. The machine learning models and the corresponding parameter values applied in the context of this paper.
Machine Learning ModelParameters
Logistic RegressionC = 1.0, class_weight = None, dual = False, fit_intercept = True, intercept_scaling = 1, l1_ratio = None,
max_iter = 100, multi_class = ‘auto’, n_jobs = None,
penalty = ‘l2′, random_state = None, solver = ‘liblinear’, tol = 0.0001, verbose = 0, warm_start = False
Decision Tree Classifierccp_alpha = 0.0, class_weight = None, criterion = ‘gini’,
max_depth = None, max_features = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None,
min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0,
presort = ‘deprecated’, random_state = None, splitter = ‘best’
KNeighborsClassifieralgorithm = ‘auto’, leaf_size = 30, metric = ‘minkowski’, metric_params = None, n_jobs = None, n_neighbors = 5, p = 2, weights = ‘uniform’
GaussianNBpriors = None, var_smoothing = 1 × 10−9
SVCC = 1.0, break_ties = False, cache_size = 200, class_weight = None, coef0 = 0.0, decision_function_shape = ‘ovr’, degree = 3, gamma = ‘auto’, kernel = ‘rbf’, max_iter = −1, probability = False, random_state = None, shrinking = True, tol = 0.001, verbose = False
Table A2. The results of the application of five machine learning models to the dataset for various values of the parameters QI, k. The performance of the models was evaluated with the area under the curve (AUC) and Matthews correlation coefficient (MCC) metrics. The generalized information loss (GIL), discernibility metric (DM) and average class size (CAVG) captured the information loss occurring due to the anonymization process.
Table A2. The results of the application of five machine learning models to the dataset for various values of the parameters QI, k. The performance of the models was evaluated with the area under the curve (AUC) and Matthews correlation coefficient (MCC) metrics. The generalized information loss (GIL), discernibility metric (DM) and average class size (CAVG) captured the information loss occurring due to the anonymization process.
ClassifierQIkGILDMCAVGTrain
AUC
Validation
AUC
Test
AUC
Test
MCC
Logistic
Regression
010000.7230.7230.7230.456
220.394346,873,092492.3530.7360.7290.7350.477
30.394346,873,108331.0170.7170.7090.7160.438
50.394346,873,168202.0340.7250.7160.7240.454
100.394346,873,396102.7890.7260.7190.7250.457
150.394346,875,01470.3780.730.7230.7290.465
200.394346,878,57454.250.7240.7190.7230.454
300.394346,884,56837.5580.7190.7170.7190.445
3a20.352254,366,790209.250.6850.6610.6820.365
30.352254,366,928144.1330.7010.6850.70.401
50.352254,367,48291.5470.6920.6740.690.38
100.353254,371,33450.5090.6890.6680.6870.374
150.353254,378,570360.6790.6490.6760.352
200.354254,353,26828.7210.690.6720.6880.377
300.355254,372,51020.1340.6860.6670.6840.369
3b20.21475,443,86221.2750.7470.7320.7460.493
30.21575,448,47816.5020.7480.7190.7450.493
50.21675,461,50211,830.7530.7290.750.502
100.21975,530,6067.730.750.7270.7480.497
150.2275,632,8046.0050.7520.7210.7490.499
200.22275,820,2945.2080.7360.7080.7330.467
300.22576,159,9064.0990.7450.7180.7420.487
420.0857,964,1103.7890.7780.7330.7730.548
30.0938,016,1963.4340.7790.7260.7740.549
50.1048,145,4523.0120.7820.7310.7770.554
100.1168,563,0302.5250.7780.7250.7730.546
150.1239,082,2862.3040.7760.7170.770.541
200.1269,621,1382.1380.7670.7130.7620.525
300.13110,880,7701.9530.7660.7210.7610.522
Decision Tree Classifier010000.9590.6580.9290.861
220.394346,873,092492.3530.9410.670.9140.83
30.394346,873,108331.0170.9340.6550.9060.814
50.394346,873,168202.0340.9380.6570.910.821
100.394346,873,396102.7890.9360.6630.9080.818
150.394346,875,01470.3780.940.6570.9110.824
200.394346,878,57454.250.9410.6690.9140.83
300.394346,884,56837.5580.940.660.9120.825
3a20.352254,366,790209.250.9160.650.8890.781
30.352254,366,928144.1330.9210.6540.8940.792
50.352254,367,48291.5470.9170.6540.8910.784
100.353254,371,334505090.9180.6490.8920.785
150.353254,378,570360.910.6360.8820.766
200.354254,353,26828.7210.9110.640.8840.774
300.355254,372,51020.1340.920.6570.8940.791
3b20.21475,443,86221.2750.9330.6620.9060.813
30.21575,448,47816.5020.9360.6660.9090.821
50.21675,461,50211.830.9370.6670.910.821
100.21975,530,6067.730.9390.6870.9140.829
150.2275,632,8046.0050.9390.6740.9120.826
200.22275,820,2945.2080.930.6460.9020.806
300.22576,159,9064.0990.9330.6750.9070.817
420.0857,964,1103.7890.9460.6640.9180.837
30.0938,016,1963.4340.9410.6790.9150.835
50.1048,145,4523.0120.9430.6860.9180.837
100.1168,563,0302.5250.940.6710.9130.827
150.1239,082,2862.3040.9410.6710.9140.83
200.1269,621,1382.1380.9330.670.9070.819
300.13110,880,7701.9530.9320.6760.9060.814
KNN010000.7930.7120.7850.57
220.394346,873,092492.3530.7840.7020.7760.553
30.394346,873,108331.0170.7720.6860.7630.527
50.394346,873,168202.0340.7820.6910.7730.546
100.394346,873,396102.7890.7780.7020.7710.542
150.394346,875,01470.3780.7780.7020.7710.542
300.394346,884,56837.5580.7790.6940.7710.542
200.394346,878,57454.250.7720.6830.7630.527
3a20.352254,366,790209.250.7640.6710.7550.51
30.352254,366,928144.1330.7750.6910.7670.534
50.352254,367,48291.5470.7630.6730.7540.508
100.353254,371,33450.5090.7640.6710.7550.51
150.353254,378,570360.7580.6640.7490.497
200.354254,353,26828.7210.7560.6610.7460.493
300.355254,372,51020.1340.7610.6680.7520.505
3b20.21475,443,86221.2750.7830.7040.7750.551
30.21575,448,47816.5020.7880.7120.780.563
50.21675,461,50211.830.7710.6910.7630.526
100.21975,530,6067.730.7890.7080.7810.562
150.2275,632,8046.0050.7780.6930.770.541
200.22275,820,2945.2080.7710.680.7620.525
300.22576,159,9064.0990.7810.6950.7720.545
420.0857,964,1103.7890.7820.6930.7730.547
30.0938,016,1963.4340.7850.6940.7760.552
50.1048,145,4523.0120.7880.710.780.561
100.1168,563,0302.5250.780.7020.7730.545
150.1239,082,2862.3040.7920.70.7830.566
200.1269,621,1382.1380.7850.7020.7770.553
300.13110,880,7701.9530.7760.6960.7680.537
Gaussian NB010000.7080.7080.7080.431
220.394346,873,092492.3530.5610.550.560.184
30.394346,873,108331.0170.5580.5480.5570.196
50.394346,873,168202.0340.5540.5470.5530.192
100.394346,873,396102.7890.5390.5290.5380.154
150.394346,875,01470.3780.560.5430.5590.179
200.394346,878,57454.250.570.550.5680.191
300.394346,884,56837.5580.560.5490.5590.193
3a20.352254,366,790209.250.5590.5480.5580.212
30.352254,366,928144.1330.5540.5430.5530.192
50.352254,367,48291.5470.5630.5510.5610.209
100.353254,371,33450.5090.5540.5380.5530.182
150.353254,378,570360.5630.5490.5610.209
200.354254,353,26828.7210.5610.5510.560.205
300.355254,372,51020.1340.5530.5390.5510.186
3b20.21475,443,86221.2750.5680.5440.5660.218
30.21575,448,47816.5020.5770.5540.5750.24
50.21675,461,50211.830.560.5340.5570.206
100.21975,530,6067.730.5770.550.5740.246
150.2275,632,8046.0050.5860.550.5820.237
200.22275,820,2945.2080.5680.5390.5650.218
300.22576,159,9064.0990.5680.5440.5660.22
420.0857,964,1103.7890.6180.550.6110.347
30.0938,016,1963.4340.6260.5440.6180.363
50.1048,145,4523.0120.6410.5580.6330.383
100.1168,563,0302.5250.6270.5550.620.355
150.1239,082,2862.3040.6380.5720.6320.379
200.1269,621,1382.1380.6520.5810.6450.388
300.13110,880,7701.9530.6610.610.6560.397
SVC010000.7110.7110.7110.437
220.394346,873,092492.3530.7090.7090.7090.431
30.394346,873,108331.0170.6860.6860.6860.387
50.394346,873,168202.0340.6880.6880.6880.391
100.394346,873,396102.7890.6950.6950.6950.404
150.394346,875,01470.3780.7030.7030.7030.421
200.394346,878,57454.250.7010.70.7010.416
300.394346,884,56837.5580.690.6910.690.397
3a20.352254,366,790209.250.5710.5710.5710.221
30.352254,366,928144.1330.5840.570.5830.222
50.352254,367,48291.5470.5880.5780.5870.22
100.353254,371,33450.5090.5830.5810.5830.217
150.353254,378,570360.5680.5670.5680.21
200.354254,353,26828.7210.5930.590.5920.213
300.355254,372,51020.1340.5790.5780.5790.176
3b20.21475,443,86221.2750.6950.6950.6950.402
30.21575,448,47816.5020.70.70.70.416
50.21675,461,50211.830.6960.6960.6960.407
100.21975,530,6067.730.6940.6940.6940.405
150.2275,632,8046.0050.7030.7030.7030.422
200.22275,820,2945.2080.6810.6810.6810.379
300.22576,159,9064.0990.6980.6970.6980.409
420.0857,964,1103.7890.720.7210.720.444
30.0938,016,1963.4340.7140.7120.7140.433
50.1048,145,4523.0120.7070.7040.7070.423
100.1168,563,0302.5250.7030.6990.7020.416
150.1239,082,2862.3040.6960.6960.6960.407
200.1269,621,1382.1380.690.6890.690.396
300.13110,880,7701.9530.6960.6950.6950.405
Table A3. The parameters of the linear models fitted to model the relationship between AUC and GIL, as well as MCC and GIL, for each machine learning model used in the context of this paper.
Table A3. The parameters of the linear models fitted to model the relationship between AUC and GIL, as well as MCC and GIL, for each machine learning model used in the context of this paper.
Machine Learning ModelLinear Regression
AUC vs. GIL
Linear Regression
MCC vs. GIL
Logistic RegressionyAUC = 0.77886 − 0.1914 × xGILyMCC = 0.54762 − 0.3393 × xGIL
Decision Tree ClassifieryAUC = 0.9182 − 0.03648 × xGILyMCC = 0.84385 − 0.1185 × xGIL
KNeighborsClassifieryAUC = 0.78136 − 0.04867 × xGILyMCC = 0.56195 − 0.0921 × xGIL
GaussianNByAUC = 0.659 − 0.2924 × xGILyMCC = 0.41258 − 0.6357 × xGIL
SVCyAUC = 0.72082 − 0.1865 × xGILyMCC = 0.44355 − 0.26539528 × xGIL
Table A4. The results of one-way ANOVA analysis (Welch’s test).
Table A4. The results of one-way ANOVA analysis (Welch’s test).
fdf1df2p
Test MCC908465.9<0.001
Test AUC1163465.6<0.001
Table A5. The results of the Games–Howell post hoc test on test MCC.
Table A5. The results of the Games–Howell post hoc test on test MCC.
DecisionTreeClassifierGaussianNBKNNLogisticRegressionSVC
Decision Tree classifierMean difference-0.563 ***0.277 ***0.3497 ***0.452 ***
p-value-<0.001<0.001<0.001<0.001
Gaussian NBMean difference -−0.285 ***−0.2129 ***−0.111 ***
p-value -<0.001<0.001<0.001
KNNMean difference -0.0723 ***0.174 ***
p-value -<0.0001<0.001
Logistic
Regression
Mean difference -0.102 ***
p-value -<0.001
SVCMean difference -
p-value -
Note. *** p < 0.001.
Table A6. The results of the Games–Howell pos hoc test on train AUC.
Table A6. The results of the Games–Howell pos hoc test on train AUC.
DecisionTreeClassifierGaussianNBKNNLogisticRegressionSVC
Decision Tree classifierMean difference-0.348 ***0.156 ***0.1995 ***0.2629 ***
p-value-<0.001<0.001<0.001<0.001
Gaussian NBMean difference -−0.191 ***−0.1481 ***−0.0848 ***
p-value -<0.001<0.001<0.001
KNNMean difference -0.0431 ***0.1065 ***
p-value -<0.0001<0.001
Logistic
Regression
Mean difference -0.0633 ***
p-value -<0.001
SVCMean difference -
p-value -
Note. *** p < 0.001.
Table A7. The results of the Kruskal–Wallis test (non parametric one-way ANOVA).
Table A7. The results of the Kruskal–Wallis test (non parametric one-way ANOVA).
x2dfp
Test MCC1254<0.001
Test AUC1284<0.001
Table A8. The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons for test MCC.
Table A8. The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons for test MCC.
Wp
DecisionTreeClassifierGaussianNB−9.25<0.001
DecisionTreeClassifierKNeighborsClassifier−9.33<0.001
DecisionTreeClassifierLogisticRegression−9.25<0.001
DecisionTreeClassifierSVC−9.25<0.001
GaussianNBKNeighborsClassifier9.33<0.001
GaussianNBLogisticRegression8.54<0.001
GaussianNBSVC6.37<0.001
KNeighborsClassifierLogisticRegression−6.80<0.001
KNeighborsClassifierSVC−9.33<0.001
LogisticRegressionSVC−5.97<0.001
Table A9. The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons for test AUC.
Table A9. The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons for test AUC.
Wp
DecisionTreeClassifierGaussianNB−9.25<0.001
DecisionTreeClassifierKNeighborsClassifier−9.33<0.001
DecisionTreeClassifierLogisticRegression−9.25<0.001
DecisionTreeClassifierSVC−9.25<0.001
GaussianNBKNeighborsClassifier9.33<0.001
GaussianNBLogisticRegression9.1<0.001
GaussianNBSVC7.46<0.001
KNeighborsClassifierLogisticRegression−6.88<0.001
KNeighborsClassifierSVC−9.33<0.001
LogisticRegressionSVC−6.32<0.001

References

  1. Abouelmehdi, K.; Beni-Hssane, A.; Khaloufi, H.; Saadi, M. Big Data Security and Privacy in Healthcare: A Review. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2017; Volume 113, pp. 73–80. [Google Scholar]
  2. Priya, R.; Sivasankaran, S.; Ravisasthiri, P.; Sivachandiran, S. A Survey on Security Attacks in Electronic Healthcare Systems. In Proceedings of the 2017 IEEE International Conference on Communication and Signal Processing, ICCSP, Chennai, India, 6–8 April 2017; pp. 691–694. [Google Scholar]
  3. Khokhar, R.H.; Chen, R.; Fung, B.C.M.; Lui, S.M. Quantifying the Costs and Benefits of Privacy-Preserving Health Data Publishing. J. Biomed. Inform. 2014, 50, 107–121. [Google Scholar] [CrossRef] [PubMed]
  4. Pitoglou, S.; Giannouli, D.; Costarides, V.; Androutsou, T.; Anastasiou, A. Cybercrime and Private Health Data. In Encyclopedia of Criminal Activities and the Deep Web; IGI Global: Hershey, PA, USA, 2020; pp. 763–787. [Google Scholar]
  5. Kruse, C.S.; Frederick, B.; Jacobson, T.; Monticone, D.K. Cybersecurity in Healthcare: A Systematic Review of Modern Threats and Trends. Technol. Health Care 2017, 25, 1–10. [Google Scholar] [CrossRef] [PubMed]
  6. Ponemon Institute, LLC. Sixth Annual Benchmark Study on Privacy & Security of Healthcare Data. Available online: https://www.ponemon.org/blog/sixth-annual-benchmark-study-on-privacy-security-of-healthcare-data-1 (accessed on 8 May 2020).
  7. Samarati, P. Protecting Respondents’ Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 2001, 13, 1010–1027. [Google Scholar] [CrossRef]
  8. Hathaliya, J.J.; Tanwar, S. An Exhaustive Survey on Security and Privacy Issues in Healthcare 4.0. Comput. Commun. 2020, 153, 311–335. [Google Scholar] [CrossRef]
  9. Gkoulalas-Divanis, A.; Loukides, G.; Sun, J. Publishing Data from Electronic Health Records While Preserving Privacy: A Survey of Algorithms. J. Biomed. Inform. 2014, 50, 4–19. [Google Scholar] [CrossRef] [PubMed]
  10. Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.-Y. Logistic Regression Was as Good as Machine Learning for Predicting Major Chronic Diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef]
  11. Ngiam, K.Y.; Khor, I.W. Big Data and Machine Learning Algorithms for Health-Care Delivery. Lancet Oncol. 2019, 20, e262–e273. [Google Scholar] [CrossRef]
  12. Ravi, D.; Wong, C.; Deligianni, F.; Berthelot, M.; Andreu-Perez, J.; Lo, B.; Yang, G.Z. Deep Learning for Health Informatics. IEEE J. Biomed. Health Inform. 2017, 21, 4–21. [Google Scholar] [CrossRef]
  13. Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and Accurate Deep Learning with Electronic Health Records. npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef]
  14. Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094. [Google Scholar] [CrossRef]
  15. Al-Rubaie, M.; Chang, J.M. Privacy-Preserving Machine Learning: Threats and Solutions. IEEE Secur. Priv. 2019, 17, 49–58. [Google Scholar] [CrossRef]
  16. Malle, B.; Kieseberg, P.; Weippl, E.; Holzinger, A. The Right to Be Forgotten: Towards Machine Learning on Perturbed Knowledge Bases. In Availability, Reliability, and Security in Information Systems, Proceedings of the CD-ARES 2016, Salzburg, Austria, 31 August–2 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9817, pp. 251–266. [Google Scholar]
  17. Malle, B.; Kieseberg, P.; Holzinger, A. Interactive Anonymization for Privacy Aware Machine Learning. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery ECML-PKDD, Skopje, North Macedonia, 18–20 September 2017; pp. 15–26. [Google Scholar]
  18. Jaidan, D.N.; Carrere, M.; Chemli, Z.; Poisvert, R. Data Anonymization for Privacy Aware Machine Learning. In Machine Learning, Optimization, and Data Science, Proceedings of the LOD 2019, Siena, Italy, 10–13 September 2019; Springer: Amsterdam, The Netherlands, 2019; Volume 11943 LNCS, pp. 725–737. [Google Scholar]
  19. Bost, R.; Ada Popa, R.; Tu, S.; Goldwasser, S. Machine Learning Classification over Encrypted Data. In Network and Distributed System Security Symposium; Internet Society: Reston, VA, USA, 2015; p. 4325. [Google Scholar]
  20. Li, J.; Liu, J.; Baig, M.; Wong, R.C.W. Information Based Data Anonymization for Classification Utility. Data Knowl. Eng. 2011, 70, 1030–1045. [Google Scholar] [CrossRef]
  21. Last, M.; Tassa, T.; Zhmudyak, A.; Shmueli, E. Improving Accuracy of Classification Models Induced from Anonymized Datasets. Inf. Sci. 2014, 256, 138–161. [Google Scholar] [CrossRef]
  22. Slijepčević, D.; Henzl, M.; Daniel Klausner, L.; Dam, T.; Kieseberg, P.; Zeppelzauer, M. K-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers. Comput. Secur. 2021, 111, 102488. [Google Scholar] [CrossRef]
  23. LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. Mondrian Multidimensional K-Anonymity. In Proceedings of the International Conference on Data Engineering, Atlanta, GA, USA, 3–7 April 2006; Volume 2006, p. 25. [Google Scholar]
  24. Mohammed, N.; Fung, B.C.M.; Hung, P.C.K.; Lee, C.K. Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; ACM Press: New York, NY, USA, 2009; pp. 1285–1293. [Google Scholar]
  25. Goldberger, J.; Tassa, T. Efficient Anonymizations with Enhanced Utility. Trans. Data Priv. 2010, 3, 149–175. [Google Scholar]
  26. El Emam, K.; Dankar, F.K.; Issa, R.; Jonker, E.; Amyot, D.; Cogo, E.; Corriveau, J.P.; Walker, M.; Chowdhury, S.; Vaillancourt, R.; et al. A Globally Optimal K-Anonymity Method for the De-Identification of Health Data. J. Am. Med. Inform. Assoc. 2009, 16, 670–682. [Google Scholar] [CrossRef]
  27. Xu, J.; Wang, W.; Pei, J.; Wang, X.; Shi, B.; Fu, A.W.C. Utility-Based Anonymization Using Local Recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; Association for Computing Machinery: New York, NY, USA, 2006; Volume 2006, pp. 785–790. [Google Scholar]
  28. Lin, J.L.; Wei, M.C. An Efficient Clustering Method for K-Anonymization. In Proceedings of the ACM International Conference Proceeding Series, Nantes, France, 29 March 2008; ACM Press: New York, NY, USA, 2008; Volume 331, pp. 46–50. [Google Scholar]
  29. Pitoglou, S.; Anastasiou, A.; Androutsou, T.; Giannouli, D.; Kostalas, E.; Matsopoulos, G.; Koutsouris, D. MODELHealth: Facilitating Machine Learning on Big Health Data Networks. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2019; pp. 2174–2177. [Google Scholar]
  30. Pitoglou, S. Machine Learning in Healthcare, Introduction and Real World Application Considerations. Int. J. Reliab. Qual. E-Healthcare 2018, 7, 27–36. [Google Scholar] [CrossRef]
  31. Samarati, P.; Sweeney, L. Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement through Generalization and Suppression; Technical Report SRI-CSL-98-04; Computer Science Laboratory, SRI International: Menlo Park, CA, USA, 1998. [Google Scholar]
  32. Aggarwal, N.; Agrawal, R.K. First and Second Order Statistics Features for Classification of Magnetic Resonance Brain Images. J. Signal Inf. Process. 2012, 3, 146–153. [Google Scholar] [CrossRef]
  33. Ninghui, L.; Tiancheng, L.; Venkatasubramanian, S. T-Closeness: Privacy beyond k-Anonymity and ℓ-Diversity. In Proceedings of the International Conference on Data Engineering, Istanbul, Turkey, 15 April 2006–20 April 2007; pp. 106–115. [Google Scholar]
  34. Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. ℓ-Diversity: Privacy beyond k-Anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3. [Google Scholar] [CrossRef]
  35. Ayala-Rivera, V.; McDonagh, P.; Cerqueus, T.; Murphy, L. A Systematic Comparison and Evaluation of K-Anonymization Algorithms for Practitioners. Trans. Data Priv. 2014, 7, 337–370. [Google Scholar]
  36. Iyengar, V.S. Transforming Data to Satisfy Privacy Constraints. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; ACM Press: New York, NY, USA, 2002; p. 279. [Google Scholar]
  37. Nergiz, M.E.; Clifton, C. Thoughts on K-Anonymization. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW 2006), Atlanta, GA, USA, 3–7 April 2006; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2006. [Google Scholar]
  38. Bayardo, R.J.; Agrawal, R. Data Privacy through Optimal K-Anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2005; pp. 217–228. [Google Scholar]
  39. Cirkovic, B.R.A.; Cvetkovic, A.M.; Ninkovic, S.M.; Filipovic, N.D. Prediction Models for Estimation of Survival Rate and Relapse for Breast Cancer Patients. In Proceedings of the 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE, Belgrade, Serbia, 2–4 November 2015; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2015. [Google Scholar]
  40. Lee, Y.; Ragguett, R.M.; Mansur, R.B.; Boutilier, J.J.; Rosenblat, J.D.; Trevizol, A.; Brietzke, E.; Lin, K.; Pan, Z.; Subramaniapillai, M.; et al. Applications of Machine Learning Algorithms to Predict Therapeutic Outcomes in Depression: A Meta-Analysis and Systematic Review. J. Affect. Disord. 2018, 241, 519–532. [Google Scholar] [CrossRef]
  41. Luz, C.F.; Vollmer, M.; Decruyenaere, J.; Nijsten, M.W.; Glasner, C.; Sinha, B. Machine Learning in Infection Management Using Routine Electronic Health Records: Tools, Techniques, and Reporting of Future Technologies. Clin. Microbiol. Infect. 2020, 26, 1291–1299. [Google Scholar] [CrossRef]
  42. Nisbet, R.; Miner, G.; Yale, K. Basic Algorithms for Data Mining: A Brief Overview. In Handbook of Statistical Analysis and Data Mining Applications; Elsevier: Amsterdam, The Netherlands, 2018; pp. 121–147. [Google Scholar]
  43. Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; Wiley: Hoboken, NJ, USA, 2013; ISBN 978-0-470-58247-3. [Google Scholar]
  44. Spitznagel, E.L. 6 Logistic Regression. Handb. Stat. 2007, 27, 187–209. [Google Scholar]
  45. Hassanipour, S.; Ghaem, H.; Arab-Zozani, M.; Seif, M.; Fararouei, M.; Abdzadeh, E.; Sabetian, G.; Paydar, S. Comparison of Artificial Neural Network and Logistic Regression Models for Prediction of Outcomes in Trauma Patients: A Systematic Review and Meta-Analysis. Injury 2019, 50, 244–250. [Google Scholar] [CrossRef]
  46. Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A Systematic Review Shows No Performance Benefit of Machine Learning over Logistic Regression for Clinical Prediction Models. J. Clin. Epidemiol. 2019, 110, 12–22. [Google Scholar] [CrossRef]
  47. Sun, X.; Douiri, A.; Gulliford, M. Applying Machine Learning Algorithms to Electronic Health Records to Predict Pneumonia after Respiratory Tract Infection. J. Clin. Epidemiol. 2022, 145, 154–163. [Google Scholar] [CrossRef]
  48. Austin, P.C.; Harrell, F.E.; Steyerberg, E.W. Predictive Performance of Machine and Statistical Learning Methods: Impact of Data-Generating Processes on External Validity in the “Large N, Small p” Setting. Stat. Methods Med. Res. 2021, 30, 1465–1483. [Google Scholar] [CrossRef]
  49. Fernandes, M.; Vieira, S.M.; Leite, F.; Palos, C.; Finkelstein, S.; Sousa, J.M.C. Clinical Decision Support Systems for Triage in the Emergency Department Using Intelligent Systems: A Review. Artif. Intell. Med. 2020, 102, 101762. [Google Scholar] [CrossRef]
  50. Talia, D.; Trunfio, P.; Marozzo, F. Introduction to Data Mining. In Data Analysis in the Cloud; Elsevier: Amsterdam, The Netherlands, 2016; pp. 1–25. [Google Scholar]
  51. Quinlan, J.R. Simplifying Decision Trees. Int. J. Man. Mach. Stud. 1987, 27, 221–234. [Google Scholar] [CrossRef]
  52. Nisbet, R.; Miner, G.; Yale, K. Chapter 9—Classification. In Handbook of Statistical Analysis and Data Mining Applications; Elsevier: Amsterdam, The Netherlands, 2018; Volume 9, pp. 169–186. [Google Scholar] [CrossRef]
  53. Richter, A.N.; Khoshgoftaar, T.M. A Review of Statistical and Machine Learning Methods for Modeling Cancer Risk Using Structured Clinical Data. Artif. Intell. Med. 2018, 90, 1–14. [Google Scholar] [CrossRef]
  54. Salcedo-Bernal, A.; Villamil-Giraldo, M.P.; Moreno-Barbosa, A.D. Clinical Data Analysis: An Opportunity to Compare Machine Learning Methods. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2016; Volume 100, pp. 731–738. [Google Scholar]
  55. Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
  56. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  57. Pisner, D.A.; Schnyer, D.M. Support Vector Machine. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 2020; pp. 101–121. [Google Scholar]
  58. Zhang, H. The Optimality of Naïve Bayes. In Proceedings of the FLAIRS2004 Conference, Miami Beach, FL, USA, 1 January 2004; Volume 1, p. 3. [Google Scholar]
  59. Hand, D.J.; Yu, K. Idiot’s Bayes: Not So Stupid after All? Int. Stat. Rev. Rev. Int. Stat. 2001, 69, 385. [Google Scholar] [CrossRef]
  60. Bradley, A.P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
  61. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  62. Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
  63. Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal Classifier for Imbalanced Data Using Matthews Correlation Coefficient Metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef]
  64. Welch, B.L. The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef]
  65. Kruskal, W.H.; Wallis, W.A. Use of Ranks in One-Criterion Variance Analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
Figure 1. The results of applying five machine learning models to the dataset for various anonymization parameter combinations (qi, k), where qi is the code of the quasi identifier set, and k is the minimum size of the equivalence class (Table 2). (a) The area under the curve (AUC) score results (first row) and the Matthews correlation coefficient (MCC) score results (second row) for the test set plotted against k for each qi value. (b) The AUC score results (first row) and the Matthews correlation coefficient (MCC) score results (second row) for the test set plotted against qi for each k value.
Figure 1. The results of applying five machine learning models to the dataset for various anonymization parameter combinations (qi, k), where qi is the code of the quasi identifier set, and k is the minimum size of the equivalence class (Table 2). (a) The area under the curve (AUC) score results (first row) and the Matthews correlation coefficient (MCC) score results (second row) for the test set plotted against k for each qi value. (b) The AUC score results (first row) and the Matthews correlation coefficient (MCC) score results (second row) for the test set plotted against qi for each k value.
Applsci 12 05942 g001aApplsci 12 05942 g001b
Figure 2. The prediction results of the tested machine learning models demonstrated with the AUC and MCC metrics, in juxtaposition with the GIL value. (a) The results of the logistic regression (LR), decision tree (DT), Gaussian naïve Bayes (GNB), k-nearest neighbors (KNN) and support vector machine (SVC) classifiers, plotted against k, for each qi value. (b) The results of the same models plotted against qi for each k value.
Figure 2. The prediction results of the tested machine learning models demonstrated with the AUC and MCC metrics, in juxtaposition with the GIL value. (a) The results of the logistic regression (LR), decision tree (DT), Gaussian naïve Bayes (GNB), k-nearest neighbors (KNN) and support vector machine (SVC) classifiers, plotted against k, for each qi value. (b) The results of the same models plotted against qi for each k value.
Applsci 12 05942 g002aApplsci 12 05942 g002b
Figure 3. The linear regression model representing the relationship between the performance metrics area under curve (AUC), Matthews correlation coefficient (MCC) and the generalized information loss (GIL) metric for all tested machine learning models. (a) AUC as a function of GIL. (b) MCC as a function of GIL.
Figure 3. The linear regression model representing the relationship between the performance metrics area under curve (AUC), Matthews correlation coefficient (MCC) and the generalized information loss (GIL) metric for all tested machine learning models. (a) AUC as a function of GIL. (b) MCC as a function of GIL.
Applsci 12 05942 g003
Figure 4. The AUC and MCC results of the five tested classifiers depicted through histograms (a,c) and violin plots (b,d).
Figure 4. The AUC and MCC results of the five tested classifiers depicted through histograms (a,c) and violin plots (b,d).
Applsci 12 05942 g004
Figure 5. The scatterplot depicting the correlation between GIL and MCC (bottom left) and the respective densities of the GIL (upper left) and MCC (bottom right) metric values.
Figure 5. The scatterplot depicting the correlation between GIL and MCC (bottom left) and the respective densities of the GIL (upper left) and MCC (bottom right) metric values.
Applsci 12 05942 g005
Table 1. The attributes of the original dataset, the attributes after the one-hot decoding and their descriptions.
Table 1. The attributes of the original dataset, the attributes after the one-hot decoding and their descriptions.
Original Dataset AttributesAttributes after One-Hot DecodingAttribute TypeValuesAttribute Description
AGEAGENumerical0–114Patient age
SEX_FSEXCategorical[Female, Patient sex
SEX_MMale]
CURADM_DAYSCURADM_DAYSNumerical1–307Number of days during the current stay at the hospital
OUTCOME_HOUTCOMECategorical[Healing, Hospitalization (care encounter) outcome
OUTCOME_NNo change,
OUTCOME_IImprovement,
OUTCOME_DDeterioration]
CURRICU_FLAGCURRICU_FLAGCategorical[0, 1]The patient had to be transferred to ICU during the current hospitalization
PREVADM_NOPREVADM_NONumerical0–170Number of previous admissions to the hospital
PREVADM_DAYSPREVADM_DAYSNumerical0–627Cumulative number of days of previous hospital admissions
PREVICU_DAYSPREVICU_DAYSNumerical0–315Cumulative days of ICU treatment during previous hospital admissions
READMISSION
_30_DAYS
READMISSION
_30_DAYS
Categorical0–1Readmission within 30 days or not
Table 2. The different dataset versions that were generated from the anonymization process, the corresponding anonymization parameter values and the obtained values of the information loss metrics.
Table 2. The different dataset versions that were generated from the anonymization process, the corresponding anonymization parameter values and the obtained values of the information loss metrics.
Dataset VersionQI SetQI IDkGILDMCAVG
S0[]01000
S2.1[AGE, SEX]220.394346,873,092492.353
S1.2[AGE, SEX]230.394346,873,108331.017
S2.3[AGE, SEX]250.394346,873,168202.034
S2.4[AGE, SEX]2100.394346,873,396102.789
S2.5[AGE, SEX]2150.394346,875,01470.378
S2.6[AGE, SEX]2200.394346,878,57454.25
S2.7[AGE, SEX]2300.394346,884,56837.558
S3a.1[AGE, SEX, OUTCOME]3a20.352254,366,790209.25
S3a.2[AGE, SEX, OUTCOME]3a30.352254,366,928144.133
S3a.3[AGE, SEX, OUTCOME]3a50.352254,367,48291.547
S3a.4[AGE, SEX, OUTCOME]3a100.353254,371,33450.509
S3a.5[AGE, SEX, OUTCOME]3a150.353254,378,57036
S3a.6[AGE, SEX, OUTCOME]3a200.354254,353,26828.721
S3a.7[AGE, SEX, OUTCOME]3a300.355254,372,51020.134
S3b.1[AGE, SEX, CURADM_DAYS]3b20.21475,443,86221.275
S3b.2[AGE, SEX, CURADM_DAYS]3b30.21575,448,47816.502
S3b.3[AGE, SEX, CURADM_DAYS]3b50.21675,461,50211.83
S3b.4[AGE, SEX, CURADM_DAYS]3b100.21975,530,6067.73
S3b.5[AGE, SEX, CURADM_DAYS]3b150.21975,530,6067.73
S3b.6[AGE, SEX, CURADM_DAYS]3b200.22275,820,2945.208
S3b.7[AGE, SEX, CURADM_DAYS]3b300.22576,159,9064.099
S4.1[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]420.0857,964,1103.789
S4.2[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]430.0938,016,1963.434
S4.3[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]450.1048,145,4523.012
S4.4[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]4100.1168,563,0302.525
S4.5[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]4150.1239,082,2862.304
S4.6[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]4200.1269,621,1382.138
S4.7[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]4300.13110,880,7701.953
Table 3. The mean, median, standard deviation (SD) and 95% confidence interval of the AUC and MCC metric test set results of the experiments with the five tested machine learning models.
Table 3. The mean, median, standard deviation (SD) and 95% confidence interval of the AUC and MCC metric test set results of the experiments with the five tested machine learning models.
95% Confidence Interval
ClassifierMeanLowerUpperMedianSD
AUCDecisionTreeClassifier0.9060.9020.9100.9090.0111
GaussianNB0.5830.5680.5970.5650.0402
KNeighborsClassifier0.7680.7650.7720.7710.0103
LogisticRegression0.7310.7200.7420.7330.0311
SVC0.6700.6510.6890.6950.0524
MCCDecisionTreeClassifier0.8150.8070.8230.8210.0217
GaussianNB0.2520.2220.2830.2120.0838
KNeighborsClassifier0.5370.5300.5450.5420.0208
LogisticRegression0.4650.4430.4880.4670.0620
SVC0.3630.3310.3950.4050.0886
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pitoglou, S.; Filntisi, A.; Anastasiou, A.; Matsopoulos, G.K.; Koutsouris, D. Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project. Appl. Sci. 2022, 12, 5942. https://doi.org/10.3390/app12125942

AMA Style

Pitoglou S, Filntisi A, Anastasiou A, Matsopoulos GK, Koutsouris D. Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project. Applied Sciences. 2022; 12(12):5942. https://doi.org/10.3390/app12125942

Chicago/Turabian Style

Pitoglou, Stavros, Arianna Filntisi, Athanasios Anastasiou, George K. Matsopoulos, and Dimitrios Koutsouris. 2022. "Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project" Applied Sciences 12, no. 12: 5942. https://doi.org/10.3390/app12125942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop