Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project

Pitoglou, Stavros; Filntisi, Arianna; Anastasiou, Athanasios; Matsopoulos, George K.; Koutsouris, Dimitrios

doi:10.3390/app12125942

Open AccessArticle

Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project

by

Stavros Pitoglou

^1,2,*

,

Arianna Filntisi

²

,

Athanasios Anastasiou

¹

,

George K. Matsopoulos

^1,*

and

Dimitrios Koutsouris

¹

School of Electrical and Computer Engineering, National Technical University of Athens, 157 80 Athens, Greece

²

Computer Solutions SA, 115 27 Athens, Greece

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 5942; https://doi.org/10.3390/app12125942

Submission received: 18 April 2022 / Revised: 31 May 2022 / Accepted: 8 June 2022 / Published: 10 June 2022

(This article belongs to the Special Issue Advances in Biomedical Signal Processing in Health Care)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The object of this paper was the application of machine learning to a clinical dataset that was anonymized using the Mondrian algorithm. (1) Background: The preservation of patient privacy is a necessity rising from the increasing digitization of health data; however, the effect of data anonymization on the performance of machine learning models remains to be explored. (2) Methods: The original EHR derived dataset was subjected to anonymization by applying the Mondrian algorithm for various k values and quasi identifier (QI) set attributes. The logistic regression, decision trees, k-nearest neighbors, Gaussian naive Bayes and support vector machine models were applied to the different dataset versions. (3) Results: The classifiers demonstrated different degrees of resilience to the anonymization, with the decision tree and the KNN models showing remarkably stable performance, as opposed to the Gaussian naïve Bayes model. The choice of the QI set attributes and the generalized information loss value played a more important role than the size of the QI set or the k value. (4) Conclusions: Data anonymization can reduce the performance of certain machine learning models, although the appropriate selection of classifier and parameter values can mitigate this effect.

Keywords:

machine learning; anonymization; Mondrian

1. Introduction

The digitization of healthcare workflows and the increased usage of electronic health records has led to a dramatic increase in electronically available clinical data in terms of volume, complexity, diversity and timeliness. Clinical data carry information about real patients and hold the promise of supporting a wide range of unprecedented opportunities and use cases, such as clinical decision support, health insurance, disease surveillance, population health management, adverse event monitoring, and treatment optimization. Therefore, clinical data are suitable to be reused in machine learning applications, despite the fact that they can be characterized by noise, errors, incompleteness and high dimensionality [1].

Even though the digitization of clinical data in the healthcare sector carries many benefits, it also raises challenges such as the preservation of patient privacy. Health data by default are sensitive, and concerns over the compromise of sensitive information, security and privacy are increasing year by year. Security attacks can happen in the data-gathering phase, the network phase, as well as the storage phase [2]. Health Information custodians (HICs) have faced increasing privacy breaches of different types due to either the negligence of administrative staff or the employment of weak de-identification methods [3]. It has been shown that the number of health service providers reporting cases of data privacy breaches is increasing every year. In the years 2016–2017, approximately 90 percent of healthcare providers were faced with data breaches, while a successful breach was estimated to cost around USD 3.7 million to clean up [4,5,6]. Furthermore, it has been shown that seemingly anonymous, de-identified data that is publicly available can be combined and linked to certain individuals or groups through other identifying attributes [7].

Data security controls the access to medical data throughout the data lifecycle, protecting it from any unauthorized third-party access, and is ensured through technologies such as authentication, encryption, data masking and access control. Data privacy, on the other hand, is concerned with the protection of an individual’s healthcare information from unauthorized access. Data privacy regulates data access based on privacy policies and laws and is ensured through methods such as de-identification and anonymization [1,8]. The most prominent data privacy model is k-anonymity [9].

Machine learning is a category of algorithmic methods enabling machines to solve problems without specific computer programming and can be distinguished by supervised, unsupervised and reinforcement methods in terms of their learning approach [8,10]. Despite the fact that the implementation of machine learning tools for health care data faces many challenges, the use of AI in health care is surrounded with excitement, with numerous platforms integrating clinical machine learning tools having been developed [11]. Machine learning models have been utilized in several health informatics fields, such as medical imaging, medical informatics and public health [12]. For example, deep learning models have been used to predict mortality, early readmission, long length of stay, as well as future diseases, from EHRs [13,14]. In addition, machine learning algorithms such as logistic regression, support vector machines, Gaussian naive Bayes, K-nearest neighbors and deep multilayer neural networks have been employed to predict the probability of early patient readmission from processed hospital information system (HIS) data [14].

The topic of privacy-aware machine learning, which lies at the intersection between machine learning and privacy preservation, has begun to be explored in the last few years with the advent of data privacy laws [15]. Examples of relevant research are the investigation of the impact of data anonymization on the performance of machine learning algorithms using the gradient boosting, random forest, logistic regression and linear SVC algorithms [16]. The use of interactive machine learning has also been proposed, eliciting human preferences for preserving some attribute values over others for anonymization [17]. The topic of differential privacy has been investigated by comparing two differential privacy algorithms and evaluating the results by applying three machine learning algorithms to anonymized and raw data [18]. In addition, privacy-preserving protocols for three classifiers have been proposed [19]. The effect of anonymization algorithms on the performance of machine learning classifiers has been explored in several studies [20,21,22]. A novel anonymization algorithm, information-based anonymization for classification given k (IACk) based on normalized mutual information was introduced in [20], the effect of which on the performance of machine learning models was tested using decision trees, naïve Bayes, logistic regression and SVM. Another anonymization algorithm (non-homogeneous generalization with sensitive value distribution, NSVDist), based on an information loss metric was introduced in [21]. The authors compared NSVDist with the Mondrian [23], privacy-aware information sharing (PAIS) [24] and sequential anonymization (SeqA) [25], and evaluated their effect on the performance of the naïve Bayes, SVM, W-J48 and W-JRip classifiers. A comparison of the anonymization algorithms Mondrian, optimal lattice anonymization (OLA) [26], top-down greedy anonymization (TDG) [27] and the k-nearest neighbor clustering-based anonymization method [28], regarding their impact on the performance of the k-NN, SVM, XGBoost and random forest classifiers, was made in [22]. Since the purpose of some of these studies has been the introduction of novel anonymization methods (IACk [21], NSVDist [22]), it can be noted that the effect of anonymization methods on machine learning performance has become an important consideration for their evaluation and validation.

The object of this paper was the application of machine learning algorithms to a clinical dataset that had been subjected to anonymization using the Mondrian algorithm with various parameter values. The concept of this paper and the dataset used originated in the MODELHealth project, the main object of which has been the development of a software platform for the harmonization and anonymization of electronic health record data, with the goal of utilizing them as input to machine learning models [29,30].

2. Materials and Methods

2.1. Data

The initial dataset was provided by a public Greek hospital, and contained demographic and hospitalization information for 117,181 patients. The one-hot encoding corresponding to the attributes related to patient sex (SEX_F, SEX_M) and the care encounter outcome (OUTCOME_H, OUTCOME_N, OUTCOME_I, OUTCOME_D) was removed in order to facilitate the anonymization process. The attributes of the original dataset, the attributes after the one-hot decoding and their descriptions can be seen in Table 1.

2.2. Anonymization and Information Loss Estimation

2.2.1. K-Anonymity

K-anonymity is the primary anonymization method that has been proposed for the prevention of identity disclosure in data publishing, limiting the probability of linking an individual to their records to 1/k [9,31]. Of central importance to the k-anonymity concept is the quasi-identifier (QI) set, a set of seemingly innocuous dataset attributes whose linkage with external information can lead to the reidentification of individual records, as well as the equivalence class (EQ), a set of dataset records that are indistinguishable from each other with respect to the values of the QI set. A dataset satisfies the k-anonymity constraint if each record is identical to at least k − 1 records with respect to the QI set, which means that each equivalence class EQ consists of at least k records. K-anonymity can be enforced using suppression and generalization techniques. Suppression is achieved by replacing some of the original attribute values with a specific value indicating its non-disclosure. Generalization, or re-coding, is achieved by replacing the attribute values with less specific but consistent values.

2.2.2. L-Diversity, T-Closeness

Another concept related to data privacy is sensitive attributes, the attributes that patients are not willing to be associated with, such as diagnosis codes [9]. The privacy provided by K-anonymity could be considered insufficient in some cases since potentially it can allow the disclosure of sensitive attributes that lack diversity through the use of background knowledge. Nevertheless, k-anonymity is the primary algorithm proposed for anonymization and is used as a baseline process [31,32,33,34]. L-diversity and T-closeness are the most prevalent anonymization concepts that extend k-anonymity by considering sensitive attributes, the attributes that patients are not willing to be associated with. L-diversity focuses on the representation of sensitive attribute values in the anonymized dataset, requiring that each equivalence class have at least one value for each sensitive attribute [1,9,34]. T-closeness focuses on limiting the distance between the probability distribution of the sensitive attribute values in an anonymized group and that of sensitive attribute values in the entire dataset, requiring that the distance between the two distributions be no greater than a threshold t [9,33].

2.2.3. Mondrian Algorithm

The Mondrian is a greedy anonymization algorithm that implements k-anonymization through multidimensional generalization and is applicable to categorical as well as numeric data [23]. According to this method, a k-anonymization of a given dataset is achieved in two stages, the first one focusing on the recursive partitioning of the dataset in a number of multidimensional regions covering its domain space, a process similar to the KD-tree construction method. The second stage focuses on the mapping of generalized values to each dataset partition by applying re-coding functions using summary statistics from each region. The time complexity of the Mondrian algorithm is logarithmic, outperforming other optimal algorithms implementing k-anonymity, whose worst-case complexity is exponential [23].

2.2.4. Information Loss

Anonymization provides the benefit of data privacy, but at the same time negatively affects data utility by causing information loss. Three metrics that can be used for the estimation of information loss due to anonymization are generalized information loss, discernibility and average equivalence class size [35].

Generalized information loss represents the penalty induced from the generalization of a specific attribute by taking into account the fraction of the domain values that have been generalized. The generalized information loss for an anonymized table T* was calculated according to Equation (1), where T is the original table, i = 1, …, n corresponds to an attribute, j = 1, …, |T| corresponds to a table record, U_i, L_i are the upper and lower values of each arithmetic attribute i, U_ij, L_ij are the upper and lower values of arithmetic attribute i for the equivalence class the record j belongs in, N_i is the number of different values for each categorical attribute i, and N_ij is the number of different values for categorical attribute i in the equivalence class the record j belongs in [35,36,37].

The discernibility metric captures the indistinguishability of a table record compared to the rest by assigning a penalty to each record, equal to the size of the equivalence class it belongs in. The discernibility metric for an anonymized table T* was calculated according to Equation (2), where |EQ| is the number of records in the equivalence class EQ [23,35,38].

The average equivalence class size estimates how well the equivalence class formulation approaches the optimal case, in which every equivalence class contains k records. It is calculated according to Equation (3), where |T| is the number of table records, |EQs| is the total number of equivalence classes created in the anonymized table T*, and k is the minimum equivalence class size allowed [23,35].

G I L (T *) = \frac{1}{| T | n} \times \sum_{i = 1}^{n} \sum_{j = 1}^{| T |} {\begin{matrix} \frac{U_{i j} - L_{i j}}{U_{i} - L_{i}}, i f i i s a r i t h m e t i c \\ \frac{N_{i j} - 1}{N_{i} - 1}, i f i i s c a t e g o r i c a l \end{matrix}

(1)

D M (T *) = \sum_{\forall E Q s . t . | E Q | \geq k} {| E Q |}^{2}

(2)

C_{A V G} (T *) = \frac{| T |}{| E Q s | k}

(3)

After one-hot decoding, the original dataset was subjected to anonymization by applying the Mondrian algorithm for various k values and quasi identifier (QI) set attributes. The algorithm choice is justified due to its performance as well as the fact that the dataset does not contain sensitive attributes (Table 1). The information loss that occurred due to the anonymization process was measured with generalized information loss (GIL), discernibility metric (DM) and average class size (C_AVG) metrics for each parameter combination. The different dataset versions that were generated from the anonymization process, the corresponding anonymization parameter values and the obtained values of the information loss metrics can be seen in Table 2.

2.3. Machine Learning

Machine Learning methods have been utilized extensively in the clinical field for numerous tasks, such as the prediction of clinical outcomes, modeling disease risk, decision support systems and infection management, using a variety of data formats including electronic health records [30,39,40,41].

2.3.1. Logistic Regression

Logistic regression is a statistical method used to model the nonlinear relationship between a categorical dependent variable and the combined effects of the independent variables by applying a logistic function [42,43,44]. Logistic regression has been used widely in the clinical field as a predictive model [44,45], with several studies reporting that more recent artificial intelligence methods show no performance benefit over logistic regression for clinical predictions [45,46]. Logistic regression is known to perform well in large, low-dimensional datasets, and has reportedly been the most frequently used model design technique in clinical decision support systems [47,48,49].

2.3.2. Decision Trees

A decision tree is a supervised learning method that maps the input features related to an item to a predicted target value by modeling the input features of a given item, the feature values and the target classes as a flowchart-like tree structure. The classification of an input item is achieved by following a path along the decision tree from the root node to the leaves, where the tree nodes correspond to the feature names, the arcs correspond to the possible feature values, and the leaves are labeled with the different classes [42,50,51,52]. Common algorithms for decision trees are C4.5, C5.0, and Bayesian trees [53]. Decision trees are well suited for knowledge domains that can be defined by a relatively small set of rules [54].

2.3.3. K Nearest Neighbors

The k-nearest neighbors (KNN) method is a supervised learning algorithm that maps an input vector to a predicted target value by finding the set of K labeled vectors in the feature space that have the smallest distance from the unlabeled input vector. The classification of the input vector is then based on the predominance of a particular class in this neighborhood. The KNN algorithm performs better when the data form well defined clusters, since predictions are based on distances between data [50,52,54,55]. In this paper, the KNN model was applied for K = 5 nearest neighbors.

2.3.4. Support Vector Machines

The support vector machine (SVM) is a supervised learning model characterized by the formation of a set of hyperplanes that separate the input vectors into a number of classes. The hyperplanes can then be used to determine the most probable class for unknown input data. SVMs are known to perform well with high-dimensional data [56,57].

2.3.5. Gaussian Naive Bayes

The Gaussian naive Bayes method is a machine learning method that is based on Bayes’ theorem and assumes there is independence among the input features. Despite the fact that conditional independence is rarely true in real-world problems, the Gaussian naive Bayes classifier has been reported to demonstrate high performance [39,58,59].

In this paper, the methods of logistic regression, decision trees, k-nearest neighbors, Gaussian naive Bayes and support vector machines have been applied to the different dataset versions that resulted from the anonymization of the original EHR-derived dataset using 10-fold validation. The parameter values of the applied models are listed in Table A1.

3. Results

The results were evaluated using the area under the curve (AUC) and the Matthews correlation coefficient (MCC) metrics, both of which are used extensively in the fields of medical informatics and bioinformatics. The AUC is the area under the receiver operating characteristic (ROC) curve plotting the true positive rate against the false positive rate [60,61]. The Matthews correlation coefficient (MCC) is a statistical metric that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset [62,63]. The results of the experiments are listed exhaustively in Table A2 and demonstrated in Figure 1 and Figure 2.

In Figure 1, the horizontal axis contains all of the qi, k value combinations, where qi is the ID of the quasi identifier set as presented in Table 2. In Figure 1a, the AUC and MCC scores for the test dataset are plotted against k for each of the qi values, while Figure 1b depicts AUC and MCC plotted against qi for each of the k values. The parameters qi = 0, k = 1 represent the case in which the machine learning models were applied to the non-anonymized dataset. It can be observed that most machine learning models had better AUC, MCC scores with the non-anonymized dataset, which means that the information loss occurring from the anonymization process in general reduced the predictive ability of the models. However, the five models were affected to various degrees, with Gaussian naive Bayes suffering the most prominent performance reduction, the decision tree and KNN classifiers showing the most stable performance across the different anonymization parameter values, and logistic regression presenting a performance improvement as a result of the anonymization process for qi = 3b, 4.

In Figure 2, the AUC score and the MCC score are plotted juxtaposed with the GIL value against the anonymization parameter values, separately for each tested model. Figure 2a depicts AUC, MCC and GIL plotted against k for each qi value separately, while Figure 2b depicts AUC, MCC and GIL plotted against qi for each k value. It can be observed that lower GIL values were associated with higher AUC, MCC values in the cases of the logistic regression and Gaussian naive Bayes models. On the other hand, the decision tree, the KNN and the SVM classifiers were less affected by the variation in the GIL values.

The demonstrated results indicate that for a specific qi value, the k value did not affect the performance of the tested machine learning models significantly. Contrarily, all five models demonstrated lower performance for qi = 3a, which corresponded to the QI attributes (AGE, SEX, OUTCOME). However, the performance of most models was noticeably better for qi = 2, 3b, 4, values, which corresponded to the QI attributes (AGE, SEX), (AGE, SEX, CURADM_DAYS) and (AGE, SEX, CURADM_DAYS, PREVADM_DAYS), respectively. This indicates that the OUTCOME input variable plays a more prominent role that the AGE, SEX, CURADM_DAYS and PREVADM_DAYS attributes regarding the prediction of readmission. It can also be deduced that the choice of the QI set attributes is more significant than the size of the QI set as well as the k value in terms of predictive power.

In order to quantify the relationship between the GIL metric and the performance metrics AUC, MCC, two linear regression models were built for each applied machine learning model and were trained on 70% of the data. The linear regression models plotted against the test data are depicted in Figure 3, and their equations are presented in Table A3.

The statistical significance of the presented results was evaluated by performing a series of statistical tests. The descriptive statistics of the AUC and MCC metrics of the experiments with different ML models are presented in Table 3 and Figure 4. Both parametric one-way ANOVA analysis (Welch test, [64]), as well as non-parametric (Kruskal–Wallis, [65]), dealing with the uncertain distribution normality of the results, showed statistically significant differences among the different algorithms (p < 0.001). Details of the analysis as well as pairwise comparisons are presented in Table A4, Table A5, Table A6, Table A7 and Table A8 of Appendix A.

As expected, there was a statistically significant negative correlation between the information loss (GIL), as impact of the anonymization process, and the model performance (GIL- Test MCC Pearson’s r = −0.182 [−0.020, −0.335 95%CI], p = 0.028). Figure 5 depicts the correlation scatterplot and the respective value densities.

4. Discussion

In this paper, we presented a series of experiments designed to explore the predictive potential of anonymized medical datasets subjected to various degrees of manipulation. The original dataset used consisted of 117,181 records derived from EHR data from a Greek hospital and nine attributes related to the patient and the hospitalization (Table 1). The data were collected in the context of the MODELHealth project, which focused on the development of a system for the extraction of electronic health record data from a hospital database, the transformation of the data through harmonization and anonymization, and their loading into a central database in order for them to be used as inputs to machine learning models [29,30].

The dataset was subjected to one-hot decoding and anonymization using the Mondrian algorithm. Five attributes of the dataset were used as candidates for anonymization, with four combinations of them being used to form the quasi identifier (QI) set. In addition, seven values in the interval 2–30 were used for the k anonymization parameter, which was the minimum number of records with the same values in the QI set (Table 2). The non-anonymized dataset as well as the 28 dataset versions resulting from the different anonymization parameter combinations were fed as input into five machine learning models, namely the decision tree, the logistic regression, the k-nearest neighbors, the support vector machine and the Gaussian naïve Bayes classifiers, the parameters of which have been presented in Table A1.

The main emerging finding is that the loss of predictive power (that is predicted class pattern consistency in a machine learning classification context) as a function of the information loss due to aggregation and suppression anonymization processes varies considerably, depending on the nature of the classification algorithm used. Indeed, in our experiments, there were classifiers that showed great resiliency even under significant information loss, and others that directly or indirectly were critically affected, with their accuracy metrics plunging towards non-significance (Figure 1 and Figure 2). More specifically, the decision tree and the k-nearest neighbors classifiers demonstrated noticeably stable performance regardless of the anonymization parameters, as opposed to the Gaussian naïve Bayes and the support vector machine classifiers. Remarkably, the logistic regression model demonstrated improved performance for various (qi, k) combinations of the anonymized dataset in comparison with the non-anonymized dataset, indicating that anonymization could have a similar effect with the regularization of machine learning models, although further experiments should be performed in order to reinforce that statement.

The experimental results showcased the choice of QI set attributes as well as the GIL value resulting from the anonymization process as the factors that seem to play a more significant role regarding the prediction results, rather than the size of the QI set or the k value. Indeed, the inclusion of the OUTCOME attribute, representing the hospitalization outcome, in the anonymized attribute set resulted in performance deterioration for all tested machine learning models. Contrarily, the rest of the attributes involved in the anonymization process, which were the patient’s age and sex, the number of days of the current admission, and the cumulative number of days of previous hospital admissions, did not have a similarly strong effect on the performance of the classifiers.

Our study reached conclusions similar to previously published work regarding the effect of anonymization on the performance of machine learning models. Indeed, the resilience of tree-based models to the information loss induced by anonymization has been mentioned in [22]. A difference in our approach compared to other published work was its stronger focus on various quasi identifier sets and their effect on the classification performance. It could be said that just as the selection of input features of a machine learning model plays an important role in its performance, the selection of the quasi identifier set can be a significant factor for the performance of machine learning on anonymized datasets. In addition, in the context of our experiments, the features used as input for the machine learning models were a superset of the quasi identifier set, an approach that differed from similar published work, where the whole set of input features was anonymized.

The presented study can be extended by the application of deep learning neural networks and experimentation with their architecture and hyperparameters in order to investigate their resilience to the anonymization process. In addition, the effects of different anonymization algorithms on the predictive ability of machine learning models can be explored.

5. Conclusions

The anonymization of clinical data can have a negative impact on the performance of some machine learning models. However, the selection of appropriate models and parameter values can compensate for this effect, providing the opportunity to benefit from the application of machine learning while protecting patient privacy.

Author Contributions

Conceptualization, S.P. and A.F.; methodology, S.P. and A.F.; software, S.P. and A.F.; validation, S.P., A.F. and A.A.; writing—original draft preparation, S.P. and A.F.; writing—review and editing, S.P., A.F., A.A., G.K.M. and D.K.; supervision, S.P., G.K.M. and D.K.; project administration, S.P.; funding acquisition, D.K., G.K.M. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was co-funded by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship, and Innovation, under the call RESEARCH-CREATE-INNOVATE (Project Code: T1EDK-04066).

Informed Consent Statement

Not Applicable. Data used for this research purposes qualify as “anonymous information” per the definition provided in Recital 26 of the General Data Protection Regulation (GDPR) of the European Union (‘…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable’).

Data Availability Statement

Raw data were generated in the context of the MODELHealth project, which has been co-funded by the European Regional Development Fund of the European Union and Greek national funds. Derived data supporting the findings of this study are available from the corresponding author S.P., upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. The machine learning models and the corresponding parameter values applied in the context of this paper.

Machine Learning Model	Parameters
Logistic Regression	C = 1.0, class_weight = None, dual = False, fit_intercept = True, intercept_scaling = 1, l1_ratio = None, max_iter = 100, multi_class = ‘auto’, n_jobs = None, penalty = ‘l2′, random_state = None, solver = ‘liblinear’, tol = 0.0001, verbose = 0, warm_start = False
Decision Tree Classifier	ccp_alpha = 0.0, class_weight = None, criterion = ‘gini’, max_depth = None, max_features = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, presort = ‘deprecated’, random_state = None, splitter = ‘best’
KNeighborsClassifier	algorithm = ‘auto’, leaf_size = 30, metric = ‘minkowski’, metric_params = None, n_jobs = None, n_neighbors = 5, p = 2, weights = ‘uniform’
GaussianNB	priors = None, var_smoothing = 1 × 10⁻⁹
SVC	C = 1.0, break_ties = False, cache_size = 200, class_weight = None, coef0 = 0.0, decision_function_shape = ‘ovr’, degree = 3, gamma = ‘auto’, kernel = ‘rbf’, max_iter = −1, probability = False, random_state = None, shrinking = True, tol = 0.001, verbose = False

Table A2. The results of the application of five machine learning models to the dataset for various values of the parameters QI, k. The performance of the models was evaluated with the area under the curve (AUC) and Matthews correlation coefficient (MCC) metrics. The generalized information loss (GIL), discernibility metric (DM) and average class size (C_AVG) captured the information loss occurring due to the anonymization process.

Classifier	QI	k	GIL	DM	C_AVG	Train AUC	Validation AUC	Test AUC	Test MCC
Logistic Regression	0	1	0	0	0	0.723	0.723	0.723	0.456
	2	2	0.394	346,873,092	492.353	0.736	0.729	0.735	0.477
		3	0.394	346,873,108	331.017	0.717	0.709	0.716	0.438
		5	0.394	346,873,168	202.034	0.725	0.716	0.724	0.454
		10	0.394	346,873,396	102.789	0.726	0.719	0.725	0.457
		15	0.394	346,875,014	70.378	0.73	0.723	0.729	0.465
		20	0.394	346,878,574	54.25	0.724	0.719	0.723	0.454
		30	0.394	346,884,568	37.558	0.719	0.717	0.719	0.445
	3a	2	0.352	254,366,790	209.25	0.685	0.661	0.682	0.365
		3	0.352	254,366,928	144.133	0.701	0.685	0.7	0.401
		5	0.352	254,367,482	91.547	0.692	0.674	0.69	0.38
		10	0.353	254,371,334	50.509	0.689	0.668	0.687	0.374
		15	0.353	254,378,570	36	0.679	0.649	0.676	0.352
		20	0.354	254,353,268	28.721	0.69	0.672	0.688	0.377
		30	0.355	254,372,510	20.134	0.686	0.667	0.684	0.369
	3b	2	0.214	75,443,862	21.275	0.747	0.732	0.746	0.493
		3	0.215	75,448,478	16.502	0.748	0.719	0.745	0.493
		5	0.216	75,461,502	11,83	0.753	0.729	0.75	0.502
		10	0.219	75,530,606	7.73	0.75	0.727	0.748	0.497
		15	0.22	75,632,804	6.005	0.752	0.721	0.749	0.499
		20	0.222	75,820,294	5.208	0.736	0.708	0.733	0.467
		30	0.225	76,159,906	4.099	0.745	0.718	0.742	0.487
	4	2	0.085	7,964,110	3.789	0.778	0.733	0.773	0.548
		3	0.093	8,016,196	3.434	0.779	0.726	0.774	0.549
		5	0.104	8,145,452	3.012	0.782	0.731	0.777	0.554
		10	0.116	8,563,030	2.525	0.778	0.725	0.773	0.546
		15	0.123	9,082,286	2.304	0.776	0.717	0.77	0.541
		20	0.126	9,621,138	2.138	0.767	0.713	0.762	0.525
		30	0.131	10,880,770	1.953	0.766	0.721	0.761	0.522
Decision Tree Classifier	0	1	0	0	0	0.959	0.658	0.929	0.861
	2	2	0.394	346,873,092	492.353	0.941	0.67	0.914	0.83
		3	0.394	346,873,108	331.017	0.934	0.655	0.906	0.814
		5	0.394	346,873,168	202.034	0.938	0.657	0.91	0.821
		10	0.394	346,873,396	102.789	0.936	0.663	0.908	0.818
		15	0.394	346,875,014	70.378	0.94	0.657	0.911	0.824
		20	0.394	346,878,574	54.25	0.941	0.669	0.914	0.83
		30	0.394	346,884,568	37.558	0.94	0.66	0.912	0.825
	3a	2	0.352	254,366,790	209.25	0.916	0.65	0.889	0.781
		3	0.352	254,366,928	144.133	0.921	0.654	0.894	0.792
		5	0.352	254,367,482	91.547	0.917	0.654	0.891	0.784
		10	0.353	254,371,334	50509	0.918	0.649	0.892	0.785
		15	0.353	254,378,570	36	0.91	0.636	0.882	0.766
		20	0.354	254,353,268	28.721	0.911	0.64	0.884	0.774
		30	0.355	254,372,510	20.134	0.92	0.657	0.894	0.791
	3b	2	0.214	75,443,862	21.275	0.933	0.662	0.906	0.813
		3	0.215	75,448,478	16.502	0.936	0.666	0.909	0.821
		5	0.216	75,461,502	11.83	0.937	0.667	0.91	0.821
		10	0.219	75,530,606	7.73	0.939	0.687	0.914	0.829
		15	0.22	75,632,804	6.005	0.939	0.674	0.912	0.826
		20	0.222	75,820,294	5.208	0.93	0.646	0.902	0.806
		30	0.225	76,159,906	4.099	0.933	0.675	0.907	0.817
	4	2	0.085	7,964,110	3.789	0.946	0.664	0.918	0.837
		3	0.093	8,016,196	3.434	0.941	0.679	0.915	0.835
		5	0.104	8,145,452	3.012	0.943	0.686	0.918	0.837
		10	0.116	8,563,030	2.525	0.94	0.671	0.913	0.827
		15	0.123	9,082,286	2.304	0.941	0.671	0.914	0.83
		20	0.126	9,621,138	2.138	0.933	0.67	0.907	0.819
		30	0.131	10,880,770	1.953	0.932	0.676	0.906	0.814
KNN	0	1	0	0	0	0.793	0.712	0.785	0.57
	2	2	0.394	346,873,092	492.353	0.784	0.702	0.776	0.553
		3	0.394	346,873,108	331.017	0.772	0.686	0.763	0.527
		5	0.394	346,873,168	202.034	0.782	0.691	0.773	0.546
		10	0.394	346,873,396	102.789	0.778	0.702	0.771	0.542
		15	0.394	346,875,014	70.378	0.778	0.702	0.771	0.542
		30	0.394	346,884,568	37.558	0.779	0.694	0.771	0.542
		20	0.394	346,878,574	54.25	0.772	0.683	0.763	0.527
	3a	2	0.352	254,366,790	209.25	0.764	0.671	0.755	0.51
		3	0.352	254,366,928	144.133	0.775	0.691	0.767	0.534
		5	0.352	254,367,482	91.547	0.763	0.673	0.754	0.508
		10	0.353	254,371,334	50.509	0.764	0.671	0.755	0.51
		15	0.353	254,378,570	36	0.758	0.664	0.749	0.497
		20	0.354	254,353,268	28.721	0.756	0.661	0.746	0.493
		30	0.355	254,372,510	20.134	0.761	0.668	0.752	0.505
	3b	2	0.214	75,443,862	21.275	0.783	0.704	0.775	0.551
		3	0.215	75,448,478	16.502	0.788	0.712	0.78	0.563
		5	0.216	75,461,502	11.83	0.771	0.691	0.763	0.526
		10	0.219	75,530,606	7.73	0.789	0.708	0.781	0.562
		15	0.22	75,632,804	6.005	0.778	0.693	0.77	0.541
		20	0.222	75,820,294	5.208	0.771	0.68	0.762	0.525
		30	0.225	76,159,906	4.099	0.781	0.695	0.772	0.545
	4	2	0.085	7,964,110	3.789	0.782	0.693	0.773	0.547
		3	0.093	8,016,196	3.434	0.785	0.694	0.776	0.552
		5	0.104	8,145,452	3.012	0.788	0.71	0.78	0.561
		10	0.116	8,563,030	2.525	0.78	0.702	0.773	0.545
		15	0.123	9,082,286	2.304	0.792	0.7	0.783	0.566
		20	0.126	9,621,138	2.138	0.785	0.702	0.777	0.553
		30	0.131	10,880,770	1.953	0.776	0.696	0.768	0.537
Gaussian NB	0	1	0	0	0	0.708	0.708	0.708	0.431
	2	2	0.394	346,873,092	492.353	0.561	0.55	0.56	0.184
		3	0.394	346,873,108	331.017	0.558	0.548	0.557	0.196
		5	0.394	346,873,168	202.034	0.554	0.547	0.553	0.192
		10	0.394	346,873,396	102.789	0.539	0.529	0.538	0.154
		15	0.394	346,875,014	70.378	0.56	0.543	0.559	0.179
		20	0.394	346,878,574	54.25	0.57	0.55	0.568	0.191
		30	0.394	346,884,568	37.558	0.56	0.549	0.559	0.193
	3a	2	0.352	254,366,790	209.25	0.559	0.548	0.558	0.212
		3	0.352	254,366,928	144.133	0.554	0.543	0.553	0.192
		5	0.352	254,367,482	91.547	0.563	0.551	0.561	0.209
		10	0.353	254,371,334	50.509	0.554	0.538	0.553	0.182
		15	0.353	254,378,570	36	0.563	0.549	0.561	0.209
		20	0.354	254,353,268	28.721	0.561	0.551	0.56	0.205
		30	0.355	254,372,510	20.134	0.553	0.539	0.551	0.186
	3b	2	0.214	75,443,862	21.275	0.568	0.544	0.566	0.218
		3	0.215	75,448,478	16.502	0.577	0.554	0.575	0.24
		5	0.216	75,461,502	11.83	0.56	0.534	0.557	0.206
		10	0.219	75,530,606	7.73	0.577	0.55	0.574	0.246
		15	0.22	75,632,804	6.005	0.586	0.55	0.582	0.237
		20	0.222	75,820,294	5.208	0.568	0.539	0.565	0.218
		30	0.225	76,159,906	4.099	0.568	0.544	0.566	0.22
	4	2	0.085	7,964,110	3.789	0.618	0.55	0.611	0.347
		3	0.093	8,016,196	3.434	0.626	0.544	0.618	0.363
		5	0.104	8,145,452	3.012	0.641	0.558	0.633	0.383
		10	0.116	8,563,030	2.525	0.627	0.555	0.62	0.355
		15	0.123	9,082,286	2.304	0.638	0.572	0.632	0.379
		20	0.126	9,621,138	2.138	0.652	0.581	0.645	0.388
		30	0.131	10,880,770	1.953	0.661	0.61	0.656	0.397
SVC	0	1	0	0	0	0.711	0.711	0.711	0.437
	2	2	0.394	346,873,092	492.353	0.709	0.709	0.709	0.431
		3	0.394	346,873,108	331.017	0.686	0.686	0.686	0.387
		5	0.394	346,873,168	202.034	0.688	0.688	0.688	0.391
		10	0.394	346,873,396	102.789	0.695	0.695	0.695	0.404
		15	0.394	346,875,014	70.378	0.703	0.703	0.703	0.421
		20	0.394	346,878,574	54.25	0.701	0.7	0.701	0.416
		30	0.394	346,884,568	37.558	0.69	0.691	0.69	0.397
	3a	2	0.352	254,366,790	209.25	0.571	0.571	0.571	0.221
		3	0.352	254,366,928	144.133	0.584	0.57	0.583	0.222
		5	0.352	254,367,482	91.547	0.588	0.578	0.587	0.22
		10	0.353	254,371,334	50.509	0.583	0.581	0.583	0.217
		15	0.353	254,378,570	36	0.568	0.567	0.568	0.21
		20	0.354	254,353,268	28.721	0.593	0.59	0.592	0.213
		30	0.355	254,372,510	20.134	0.579	0.578	0.579	0.176
	3b	2	0.214	75,443,862	21.275	0.695	0.695	0.695	0.402
		3	0.215	75,448,478	16.502	0.7	0.7	0.7	0.416
		5	0.216	75,461,502	11.83	0.696	0.696	0.696	0.407
		10	0.219	75,530,606	7.73	0.694	0.694	0.694	0.405
		15	0.22	75,632,804	6.005	0.703	0.703	0.703	0.422
		20	0.222	75,820,294	5.208	0.681	0.681	0.681	0.379
		30	0.225	76,159,906	4.099	0.698	0.697	0.698	0.409
	4	2	0.085	7,964,110	3.789	0.72	0.721	0.72	0.444
		3	0.093	8,016,196	3.434	0.714	0.712	0.714	0.433
		5	0.104	8,145,452	3.012	0.707	0.704	0.707	0.423
		10	0.116	8,563,030	2.525	0.703	0.699	0.702	0.416
		15	0.123	9,082,286	2.304	0.696	0.696	0.696	0.407
		20	0.126	9,621,138	2.138	0.69	0.689	0.69	0.396
		30	0.131	10,880,770	1.953	0.696	0.695	0.695	0.405

Table A3. The parameters of the linear models fitted to model the relationship between AUC and GIL, as well as MCC and GIL, for each machine learning model used in the context of this paper.

Machine Learning Model	Linear Regression AUC vs. GIL	Linear Regression MCC vs. GIL
Logistic Regression	y_AUC = 0.77886 − 0.1914 × x_GIL	y_MCC = 0.54762 − 0.3393 × x_GIL
Decision Tree Classifier	y_AUC = 0.9182 − 0.03648 × x_GIL	y_MCC = 0.84385 − 0.1185 × x_GIL
KNeighborsClassifier	y_AUC = 0.78136 − 0.04867 × x_GIL	y_MCC = 0.56195 − 0.0921 × x_GIL
GaussianNB	y_AUC = 0.659 − 0.2924 × x_GIL	y_MCC = 0.41258 − 0.6357 × x_GIL
SVC	y_AUC = 0.72082 − 0.1865 × x_GIL	y_MCC = 0.44355 − 0.26539528 × x_GIL

Table A4. The results of one-way ANOVA analysis (Welch’s test).

	f	df1	df2	p
Test MCC	908	4	65.9	<0.001
Test AUC	1163	4	65.6	<0.001

Table A5. The results of the Games–Howell post hoc test on test MCC.

		DecisionTreeClassifier	GaussianNB	KNN	LogisticRegression	SVC
Decision Tree classifier	Mean difference	-	0.563 ***	0.277 ***	0.3497 ***	0.452 ***
Decision Tree classifier	p-value	-	<0.001	<0.001	<0.001	<0.001
Gaussian NB	Mean difference		-	−0.285 ***	−0.2129 ***	−0.111 ***
Gaussian NB	p-value		-	<0.001	<0.001	<0.001
KNN	Mean difference			-	0.0723 ***	0.174 ***
KNN	p-value			-	<0.0001	<0.001
Logistic Regression	Mean difference				-	0.102 ***
Logistic Regression	p-value				-	<0.001
SVC	Mean difference					-
SVC	p-value					-

Note. *** p < 0.001.

Table A6. The results of the Games–Howell pos hoc test on train AUC.

		DecisionTreeClassifier	GaussianNB	KNN	LogisticRegression	SVC
Decision Tree classifier	Mean difference	-	0.348 ***	0.156 ***	0.1995 ***	0.2629 ***
Decision Tree classifier	p-value	-	<0.001	<0.001	<0.001	<0.001
Gaussian NB	Mean difference		-	−0.191 ***	−0.1481 ***	−0.0848 ***
Gaussian NB	p-value		-	<0.001	<0.001	<0.001
KNN	Mean difference			-	0.0431 ***	0.1065 ***
KNN	p-value			-	<0.0001	<0.001
Logistic Regression	Mean difference				-	0.0633 ***
Logistic Regression	p-value				-	<0.001
SVC	Mean difference					-
SVC	p-value					-

Note. *** p < 0.001.

Table A7. The results of the Kruskal–Wallis test (non parametric one-way ANOVA).

	x²	df	p
Test MCC	125	4	<0.001
Test AUC	128	4	<0.001

Table A8. The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons for test MCC.

		W	p
DecisionTreeClassifier	GaussianNB	−9.25	<0.001
DecisionTreeClassifier	KNeighborsClassifier	−9.33	<0.001
DecisionTreeClassifier	LogisticRegression	−9.25	<0.001
DecisionTreeClassifier	SVC	−9.25	<0.001
GaussianNB	KNeighborsClassifier	9.33	<0.001
GaussianNB	LogisticRegression	8.54	<0.001
GaussianNB	SVC	6.37	<0.001
KNeighborsClassifier	LogisticRegression	−6.80	<0.001
KNeighborsClassifier	SVC	−9.33	<0.001
LogisticRegression	SVC	−5.97	<0.001

Table A9. The results of the Dwass-Steel-Critchlow-Fligner pairwise comparisons for test AUC.

		W	p
DecisionTreeClassifier	GaussianNB	−9.25	<0.001
DecisionTreeClassifier	KNeighborsClassifier	−9.33	<0.001
DecisionTreeClassifier	LogisticRegression	−9.25	<0.001
DecisionTreeClassifier	SVC	−9.25	<0.001
GaussianNB	KNeighborsClassifier	9.33	<0.001
GaussianNB	LogisticRegression	9.1	<0.001
GaussianNB	SVC	7.46	<0.001
KNeighborsClassifier	LogisticRegression	−6.88	<0.001
KNeighborsClassifier	SVC	−9.33	<0.001
LogisticRegression	SVC	−6.32	<0.001

References

Abouelmehdi, K.; Beni-Hssane, A.; Khaloufi, H.; Saadi, M. Big Data Security and Privacy in Healthcare: A Review. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2017; Volume 113, pp. 73–80. [Google Scholar]
Priya, R.; Sivasankaran, S.; Ravisasthiri, P.; Sivachandiran, S. A Survey on Security Attacks in Electronic Healthcare Systems. In Proceedings of the 2017 IEEE International Conference on Communication and Signal Processing, ICCSP, Chennai, India, 6–8 April 2017; pp. 691–694. [Google Scholar]
Khokhar, R.H.; Chen, R.; Fung, B.C.M.; Lui, S.M. Quantifying the Costs and Benefits of Privacy-Preserving Health Data Publishing. J. Biomed. Inform. 2014, 50, 107–121. [Google Scholar] [CrossRef] [PubMed]
Pitoglou, S.; Giannouli, D.; Costarides, V.; Androutsou, T.; Anastasiou, A. Cybercrime and Private Health Data. In Encyclopedia of Criminal Activities and the Deep Web; IGI Global: Hershey, PA, USA, 2020; pp. 763–787. [Google Scholar]
Kruse, C.S.; Frederick, B.; Jacobson, T.; Monticone, D.K. Cybersecurity in Healthcare: A Systematic Review of Modern Threats and Trends. Technol. Health Care 2017, 25, 1–10. [Google Scholar] [CrossRef] [PubMed]
Ponemon Institute, LLC. Sixth Annual Benchmark Study on Privacy & Security of Healthcare Data. Available online: https://www.ponemon.org/blog/sixth-annual-benchmark-study-on-privacy-security-of-healthcare-data-1 (accessed on 8 May 2020).
Samarati, P. Protecting Respondents’ Identities in Microdata Release. IEEE Trans. Knowl. Data Eng. 2001, 13, 1010–1027. [Google Scholar] [CrossRef]
Hathaliya, J.J.; Tanwar, S. An Exhaustive Survey on Security and Privacy Issues in Healthcare 4.0. Comput. Commun. 2020, 153, 311–335. [Google Scholar] [CrossRef]
Gkoulalas-Divanis, A.; Loukides, G.; Sun, J. Publishing Data from Electronic Health Records While Preserving Privacy: A Survey of Algorithms. J. Biomed. Inform. 2014, 50, 4–19. [Google Scholar] [CrossRef] [PubMed]
Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.-Y. Logistic Regression Was as Good as Machine Learning for Predicting Major Chronic Diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef]
Ngiam, K.Y.; Khor, I.W. Big Data and Machine Learning Algorithms for Health-Care Delivery. Lancet Oncol. 2019, 20, e262–e273. [Google Scholar] [CrossRef]
Ravi, D.; Wong, C.; Deligianni, F.; Berthelot, M.; Andreu-Perez, J.; Lo, B.; Yang, G.Z. Deep Learning for Health Informatics. IEEE J. Biomed. Health Inform. 2017, 21, 4–21. [Google Scholar] [CrossRef]
Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and Accurate Deep Learning with Electronic Health Records. npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef]
Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094. [Google Scholar] [CrossRef]
Al-Rubaie, M.; Chang, J.M. Privacy-Preserving Machine Learning: Threats and Solutions. IEEE Secur. Priv. 2019, 17, 49–58. [Google Scholar] [CrossRef]
Malle, B.; Kieseberg, P.; Weippl, E.; Holzinger, A. The Right to Be Forgotten: Towards Machine Learning on Perturbed Knowledge Bases. In Availability, Reliability, and Security in Information Systems, Proceedings of the CD-ARES 2016, Salzburg, Austria, 31 August–2 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9817, pp. 251–266. [Google Scholar]
Malle, B.; Kieseberg, P.; Holzinger, A. Interactive Anonymization for Privacy Aware Machine Learning. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery ECML-PKDD, Skopje, North Macedonia, 18–20 September 2017; pp. 15–26. [Google Scholar]
Jaidan, D.N.; Carrere, M.; Chemli, Z.; Poisvert, R. Data Anonymization for Privacy Aware Machine Learning. In Machine Learning, Optimization, and Data Science, Proceedings of the LOD 2019, Siena, Italy, 10–13 September 2019; Springer: Amsterdam, The Netherlands, 2019; Volume 11943 LNCS, pp. 725–737. [Google Scholar]
Bost, R.; Ada Popa, R.; Tu, S.; Goldwasser, S. Machine Learning Classification over Encrypted Data. In Network and Distributed System Security Symposium; Internet Society: Reston, VA, USA, 2015; p. 4325. [Google Scholar]
Li, J.; Liu, J.; Baig, M.; Wong, R.C.W. Information Based Data Anonymization for Classification Utility. Data Knowl. Eng. 2011, 70, 1030–1045. [Google Scholar] [CrossRef]
Last, M.; Tassa, T.; Zhmudyak, A.; Shmueli, E. Improving Accuracy of Classification Models Induced from Anonymized Datasets. Inf. Sci. 2014, 256, 138–161. [Google Scholar] [CrossRef]
Slijepčević, D.; Henzl, M.; Daniel Klausner, L.; Dam, T.; Kieseberg, P.; Zeppelzauer, M. K-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers. Comput. Secur. 2021, 111, 102488. [Google Scholar] [CrossRef]
LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. Mondrian Multidimensional K-Anonymity. In Proceedings of the International Conference on Data Engineering, Atlanta, GA, USA, 3–7 April 2006; Volume 2006, p. 25. [Google Scholar]
Mohammed, N.; Fung, B.C.M.; Hung, P.C.K.; Lee, C.K. Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; ACM Press: New York, NY, USA, 2009; pp. 1285–1293. [Google Scholar]
Goldberger, J.; Tassa, T. Efficient Anonymizations with Enhanced Utility. Trans. Data Priv. 2010, 3, 149–175. [Google Scholar]
El Emam, K.; Dankar, F.K.; Issa, R.; Jonker, E.; Amyot, D.; Cogo, E.; Corriveau, J.P.; Walker, M.; Chowdhury, S.; Vaillancourt, R.; et al. A Globally Optimal K-Anonymity Method for the De-Identification of Health Data. J. Am. Med. Inform. Assoc. 2009, 16, 670–682. [Google Scholar] [CrossRef]
Xu, J.; Wang, W.; Pei, J.; Wang, X.; Shi, B.; Fu, A.W.C. Utility-Based Anonymization Using Local Recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; Association for Computing Machinery: New York, NY, USA, 2006; Volume 2006, pp. 785–790. [Google Scholar]
Lin, J.L.; Wei, M.C. An Efficient Clustering Method for K-Anonymization. In Proceedings of the ACM International Conference Proceeding Series, Nantes, France, 29 March 2008; ACM Press: New York, NY, USA, 2008; Volume 331, pp. 46–50. [Google Scholar]
Pitoglou, S.; Anastasiou, A.; Androutsou, T.; Giannouli, D.; Kostalas, E.; Matsopoulos, G.; Koutsouris, D. MODELHealth: Facilitating Machine Learning on Big Health Data Networks. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; Institute of Electrical and Electronics Engineers: Piscataway, NJ, USA, 2019; pp. 2174–2177. [Google Scholar]
Pitoglou, S. Machine Learning in Healthcare, Introduction and Real World Application Considerations. Int. J. Reliab. Qual. E-Healthcare 2018, 7, 27–36. [Google Scholar] [CrossRef]
Samarati, P.; Sweeney, L. Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement through Generalization and Suppression; Technical Report SRI-CSL-98-04; Computer Science Laboratory, SRI International: Menlo Park, CA, USA, 1998. [Google Scholar]
Aggarwal, N.; Agrawal, R.K. First and Second Order Statistics Features for Classification of Magnetic Resonance Brain Images. J. Signal Inf. Process. 2012, 3, 146–153. [Google Scholar] [CrossRef]
Ninghui, L.; Tiancheng, L.; Venkatasubramanian, S. T-Closeness: Privacy beyond k-Anonymity and ℓ-Diversity. In Proceedings of the International Conference on Data Engineering, Istanbul, Turkey, 15 April 2006–20 April 2007; pp. 106–115. [Google Scholar]
Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. ℓ-Diversity: Privacy beyond k-Anonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3. [Google Scholar] [CrossRef]
Ayala-Rivera, V.; McDonagh, P.; Cerqueus, T.; Murphy, L. A Systematic Comparison and Evaluation of K-Anonymization Algorithms for Practitioners. Trans. Data Priv. 2014, 7, 337–370. [Google Scholar]
Iyengar, V.S. Transforming Data to Satisfy Privacy Constraints. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; ACM Press: New York, NY, USA, 2002; p. 279. [Google Scholar]
Nergiz, M.E.; Clifton, C. Thoughts on K-Anonymization. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW 2006), Atlanta, GA, USA, 3–7 April 2006; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2006. [Google Scholar]
Bayardo, R.J.; Agrawal, R. Data Privacy through Optimal K-Anonymization. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2005; pp. 217–228. [Google Scholar]
Cirkovic, B.R.A.; Cvetkovic, A.M.; Ninkovic, S.M.; Filipovic, N.D. Prediction Models for Estimation of Survival Rate and Relapse for Breast Cancer Patients. In Proceedings of the 2015 IEEE 15th International Conference on Bioinformatics and Bioengineering, BIBE, Belgrade, Serbia, 2–4 November 2015; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2015. [Google Scholar]
Lee, Y.; Ragguett, R.M.; Mansur, R.B.; Boutilier, J.J.; Rosenblat, J.D.; Trevizol, A.; Brietzke, E.; Lin, K.; Pan, Z.; Subramaniapillai, M.; et al. Applications of Machine Learning Algorithms to Predict Therapeutic Outcomes in Depression: A Meta-Analysis and Systematic Review. J. Affect. Disord. 2018, 241, 519–532. [Google Scholar] [CrossRef]
Luz, C.F.; Vollmer, M.; Decruyenaere, J.; Nijsten, M.W.; Glasner, C.; Sinha, B. Machine Learning in Infection Management Using Routine Electronic Health Records: Tools, Techniques, and Reporting of Future Technologies. Clin. Microbiol. Infect. 2020, 26, 1291–1299. [Google Scholar] [CrossRef]
Nisbet, R.; Miner, G.; Yale, K. Basic Algorithms for Data Mining: A Brief Overview. In Handbook of Statistical Analysis and Data Mining Applications; Elsevier: Amsterdam, The Netherlands, 2018; pp. 121–147. [Google Scholar]
Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; Wiley: Hoboken, NJ, USA, 2013; ISBN 978-0-470-58247-3. [Google Scholar]
Spitznagel, E.L. 6 Logistic Regression. Handb. Stat. 2007, 27, 187–209. [Google Scholar]
Hassanipour, S.; Ghaem, H.; Arab-Zozani, M.; Seif, M.; Fararouei, M.; Abdzadeh, E.; Sabetian, G.; Paydar, S. Comparison of Artificial Neural Network and Logistic Regression Models for Prediction of Outcomes in Trauma Patients: A Systematic Review and Meta-Analysis. Injury 2019, 50, 244–250. [Google Scholar] [CrossRef]
Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A Systematic Review Shows No Performance Benefit of Machine Learning over Logistic Regression for Clinical Prediction Models. J. Clin. Epidemiol. 2019, 110, 12–22. [Google Scholar] [CrossRef]
Sun, X.; Douiri, A.; Gulliford, M. Applying Machine Learning Algorithms to Electronic Health Records to Predict Pneumonia after Respiratory Tract Infection. J. Clin. Epidemiol. 2022, 145, 154–163. [Google Scholar] [CrossRef]
Austin, P.C.; Harrell, F.E.; Steyerberg, E.W. Predictive Performance of Machine and Statistical Learning Methods: Impact of Data-Generating Processes on External Validity in the “Large N, Small p” Setting. Stat. Methods Med. Res. 2021, 30, 1465–1483. [Google Scholar] [CrossRef]
Fernandes, M.; Vieira, S.M.; Leite, F.; Palos, C.; Finkelstein, S.; Sousa, J.M.C. Clinical Decision Support Systems for Triage in the Emergency Department Using Intelligent Systems: A Review. Artif. Intell. Med. 2020, 102, 101762. [Google Scholar] [CrossRef]
Talia, D.; Trunfio, P.; Marozzo, F. Introduction to Data Mining. In Data Analysis in the Cloud; Elsevier: Amsterdam, The Netherlands, 2016; pp. 1–25. [Google Scholar]
Quinlan, J.R. Simplifying Decision Trees. Int. J. Man. Mach. Stud. 1987, 27, 221–234. [Google Scholar] [CrossRef]
Nisbet, R.; Miner, G.; Yale, K. Chapter 9—Classification. In Handbook of Statistical Analysis and Data Mining Applications; Elsevier: Amsterdam, The Netherlands, 2018; Volume 9, pp. 169–186. [Google Scholar] [CrossRef]
Richter, A.N.; Khoshgoftaar, T.M. A Review of Statistical and Machine Learning Methods for Modeling Cancer Risk Using Structured Clinical Data. Artif. Intell. Med. 2018, 90, 1–14. [Google Scholar] [CrossRef]
Salcedo-Bernal, A.; Villamil-Giraldo, M.P.; Moreno-Barbosa, A.D. Clinical Data Analysis: An Opportunity to Compare Machine Learning Methods. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2016; Volume 100, pp. 731–738. [Google Scholar]
Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Pisner, D.A.; Schnyer, D.M. Support Vector Machine. In Machine Learning; Elsevier: Amsterdam, The Netherlands, 2020; pp. 101–121. [Google Scholar]
Zhang, H. The Optimality of Naïve Bayes. In Proceedings of the FLAIRS2004 Conference, Miami Beach, FL, USA, 1 January 2004; Volume 1, p. 3. [Google Scholar]
Hand, D.J.; Yu, K. Idiot’s Bayes: Not So Stupid after All? Int. Stat. Rev. Rev. Int. Stat. 2001, 69, 385. [Google Scholar] [CrossRef]
Bradley, A.P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Boughorbel, S.; Jarray, F.; El-Anbari, M. Optimal Classifier for Imbalanced Data Using Matthews Correlation Coefficient Metric. PLoS ONE 2017, 12, e0177678. [Google Scholar] [CrossRef]
Welch, B.L. The generalization of ‘student’s’ problem when several different population varlances are involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef]
Kruskal, W.H.; Wallis, W.A. Use of Ranks in One-Criterion Variance Analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]

Figure 1. The results of applying five machine learning models to the dataset for various anonymization parameter combinations (qi, k), where qi is the code of the quasi identifier set, and k is the minimum size of the equivalence class (Table 2). (a) The area under the curve (AUC) score results (first row) and the Matthews correlation coefficient (MCC) score results (second row) for the test set plotted against k for each qi value. (b) The AUC score results (first row) and the Matthews correlation coefficient (MCC) score results (second row) for the test set plotted against qi for each k value.

Figure 2. The prediction results of the tested machine learning models demonstrated with the AUC and MCC metrics, in juxtaposition with the GIL value. (a) The results of the logistic regression (LR), decision tree (DT), Gaussian naïve Bayes (GNB), k-nearest neighbors (KNN) and support vector machine (SVC) classifiers, plotted against k, for each qi value. (b) The results of the same models plotted against qi for each k value.

Figure 3. The linear regression model representing the relationship between the performance metrics area under curve (AUC), Matthews correlation coefficient (MCC) and the generalized information loss (GIL) metric for all tested machine learning models. (a) AUC as a function of GIL. (b) MCC as a function of GIL.

Figure 4. The AUC and MCC results of the five tested classifiers depicted through histograms (a,c) and violin plots (b,d).

Figure 5. The scatterplot depicting the correlation between GIL and MCC (bottom left) and the respective densities of the GIL (upper left) and MCC (bottom right) metric values.

Table 1. The attributes of the original dataset, the attributes after the one-hot decoding and their descriptions.

Original Dataset Attributes	Attributes after One-Hot Decoding	Attribute Type	Values	Attribute Description
AGE	AGE	Numerical	0–114	Patient age
SEX_F	SEX	Categorical	[Female,	Patient sex
SEX_M	SEX	Categorical	Male]	Patient sex
CURADM_DAYS	CURADM_DAYS	Numerical	1–307	Number of days during the current stay at the hospital
OUTCOME_H	OUTCOME	Categorical	[Healing,	Hospitalization (care encounter) outcome
OUTCOME_N			No change,
OUTCOME_I			Improvement,
OUTCOME_D			Deterioration]
CURRICU_FLAG	CURRICU_FLAG	Categorical	[0, 1]	The patient had to be transferred to ICU during the current hospitalization
PREVADM_NO	PREVADM_NO	Numerical	0–170	Number of previous admissions to the hospital
PREVADM_DAYS	PREVADM_DAYS	Numerical	0–627	Cumulative number of days of previous hospital admissions
PREVICU_DAYS	PREVICU_DAYS	Numerical	0–315	Cumulative days of ICU treatment during previous hospital admissions
READMISSION _30_DAYS	READMISSION _30_DAYS	Categorical	0–1	Readmission within 30 days or not

Table 2. The different dataset versions that were generated from the anonymization process, the corresponding anonymization parameter values and the obtained values of the information loss metrics.

Dataset Version	QI Set	QI ID	k	GIL	DM	C_AVG
S0	[]	0	1	0	0	0
S2.1	[AGE, SEX]	2	2	0.394	346,873,092	492.353
S1.2	[AGE, SEX]	2	3	0.394	346,873,108	331.017
S2.3	[AGE, SEX]	2	5	0.394	346,873,168	202.034
S2.4	[AGE, SEX]	2	10	0.394	346,873,396	102.789
S2.5	[AGE, SEX]	2	15	0.394	346,875,014	70.378
S2.6	[AGE, SEX]	2	20	0.394	346,878,574	54.25
S2.7	[AGE, SEX]	2	30	0.394	346,884,568	37.558
S3a.1	[AGE, SEX, OUTCOME]	3a	2	0.352	254,366,790	209.25
S3a.2	[AGE, SEX, OUTCOME]	3a	3	0.352	254,366,928	144.133
S3a.3	[AGE, SEX, OUTCOME]	3a	5	0.352	254,367,482	91.547
S3a.4	[AGE, SEX, OUTCOME]	3a	10	0.353	254,371,334	50.509
S3a.5	[AGE, SEX, OUTCOME]	3a	15	0.353	254,378,570	36
S3a.6	[AGE, SEX, OUTCOME]	3a	20	0.354	254,353,268	28.721
S3a.7	[AGE, SEX, OUTCOME]	3a	30	0.355	254,372,510	20.134
S3b.1	[AGE, SEX, CURADM_DAYS]	3b	2	0.214	75,443,862	21.275
S3b.2	[AGE, SEX, CURADM_DAYS]	3b	3	0.215	75,448,478	16.502
S3b.3	[AGE, SEX, CURADM_DAYS]	3b	5	0.216	75,461,502	11.83
S3b.4	[AGE, SEX, CURADM_DAYS]	3b	10	0.219	75,530,606	7.73
S3b.5	[AGE, SEX, CURADM_DAYS]	3b	15	0.219	75,530,606	7.73
S3b.6	[AGE, SEX, CURADM_DAYS]	3b	20	0.222	75,820,294	5.208
S3b.7	[AGE, SEX, CURADM_DAYS]	3b	30	0.225	76,159,906	4.099
S4.1	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	2	0.085	7,964,110	3.789
S4.2	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	3	0.093	8,016,196	3.434
S4.3	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	5	0.104	8,145,452	3.012
S4.4	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	10	0.116	8,563,030	2.525
S4.5	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	15	0.123	9,082,286	2.304
S4.6	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	20	0.126	9,621,138	2.138
S4.7	[AGE, SEX, CURADM_DAYS, PREVADM_DAYS]	4	30	0.131	10,880,770	1.953

Table 3. The mean, median, standard deviation (SD) and 95% confidence interval of the AUC and MCC metric test set results of the experiments with the five tested machine learning models.

			95% Confidence Interval
	Classifier	Mean	Lower	Upper	Median	SD
AUC	DecisionTreeClassifier	0.906	0.902	0.910	0.909	0.0111
	GaussianNB	0.583	0.568	0.597	0.565	0.0402
	KNeighborsClassifier	0.768	0.765	0.772	0.771	0.0103
	LogisticRegression	0.731	0.720	0.742	0.733	0.0311
	SVC	0.670	0.651	0.689	0.695	0.0524
MCC	DecisionTreeClassifier	0.815	0.807	0.823	0.821	0.0217
	GaussianNB	0.252	0.222	0.283	0.212	0.0838
	KNeighborsClassifier	0.537	0.530	0.545	0.542	0.0208
	LogisticRegression	0.465	0.443	0.488	0.467	0.0620
	SVC	0.363	0.331	0.395	0.405	0.0886

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pitoglou, S.; Filntisi, A.; Anastasiou, A.; Matsopoulos, G.K.; Koutsouris, D. Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project. Appl. Sci. 2022, 12, 5942. https://doi.org/10.3390/app12125942

AMA Style

Pitoglou S, Filntisi A, Anastasiou A, Matsopoulos GK, Koutsouris D. Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project. Applied Sciences. 2022; 12(12):5942. https://doi.org/10.3390/app12125942

Chicago/Turabian Style

Pitoglou, Stavros, Arianna Filntisi, Athanasios Anastasiou, George K. Matsopoulos, and Dimitrios Koutsouris. 2022. "Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project" Applied Sciences 12, no. 12: 5942. https://doi.org/10.3390/app12125942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. Anonymization and Information Loss Estimation

2.2.1. K-Anonymity

2.2.2. L-Diversity, T-Closeness

2.2.3. Mondrian Algorithm

2.2.4. Information Loss

2.3. Machine Learning

2.3.1. Logistic Regression

2.3.2. Decision Trees

2.3.3. K Nearest Neighbors

2.3.4. Support Vector Machines

2.3.5. Gaussian Naive Bayes

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI