An Asymmetric Ensemble Method for Determining the Importance of Individual Factors of a Univariate Problem

Mišić, Jelena; Kemiveš, Aleksandar; Ranđelović, Milan; Ranđelović, Dragan

doi:10.3390/sym15112050

Open AccessArticle

An Asymmetric Ensemble Method for Determining the Importance of Individual Factors of a Univariate Problem

by

Jelena Mišić

^1,2,

Aleksandar Kemiveš

^3,4,

Milan Ranđelović

⁵ and

Dragan Ranđelović

^2,*

¹

Faculty of Electronic Engineering, University of Niš, Aleksandra Medvedeva 14, 18000 Niš, Serbia

²

Faculty of Diplomacy and Security, University Union-Nikola Tesla Belgrade, Travnička 2, 11000 Belgrade, Serbia

³

Department for Postgraduate Studies, Singidunum University, Danijelova 32, 11000 Belgrade, Serbia

⁴

PUC Infostan Technologies, City of Belgrade, Danijelova 33, 11000 Belgrade, Serbia

⁵

Science Technology Park Niš, Aleksandra Medvedeva 2a, 18000 Niš, Serbia

^*

Author to whom correspondence should be addressed.

Symmetry 2023, 15(11), 2050; https://doi.org/10.3390/sym15112050

Submission received: 25 September 2023 / Revised: 28 October 2023 / Accepted: 30 October 2023 / Published: 11 November 2023

(This article belongs to the Special Issue Symmetry in Optimization Theory, Algorithm and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This study proposes an innovative model that determines the importance of selected factors of a univariate problem. The proposed model has been developed based on the example of determining the impact of non-medical factors on the quality of inpatient treatment, but it is generally applicable to any process of binary classification. In addition, an ensemble stacking model that involves the asymmetric use of two different well-known algorithms is proposed to determine the importance of individual factors. This model is constructed so that the standard logistic regression is first applied as mandatory. Further, the classification algorithms are implemented if the defined conditions are met. Finally, feature selection algorithms, which belong to the optimization group of algorithms, are applied as a combinatorial algorithm. The proposed model is verified through a case study conducted using real data obtained from health institutions in the region connected to the city of Nis, Republic of Serbia. The obtained results show that the proposed model can achieve better results than each of the methods included in it and surpasses several state-of-the-art ensemble algorithms in the field of machine learning. The proposed solution has been implemented in the form of a modern mobile application.

Keywords:

binary classification algorithm; logistic regression; feature selection; ensemble method; factors for successful inpatient treatment

1. Introduction

The World Health Organization (WHO) has adopted a program, Health21, as a general health policy framework for the WHO European region in the 21st century [1,2]. To this end, since 2012, the Republic of Serbia has adopted this type of plan [3] as a strategic and operational document of the National Health Insurance Fund. The aforementioned program is the ground plan for the implementation of healthcare networking in the Republic of Serbia, and it has been determined based on the following factors (listed in the Law on Health Care of the Republic of Serbia [4]): plan development, health, population, number and age structure of the population, the number of existing institutions, capacity and distribution of health institutions, range of urbanization, and development and transport connectivity of individual areas for equal access to healthcare. This plan should be implemented in all levels of healthcare, starting from general practitioners (GPs) or GP surgeries through to specialized clinical centers. In addition to this ground plan, the Government of the Republic of Serbia adopted an action plan for the prevention, treatment, and control of cardiovascular diseases on the national level in the Republic of Serbia, valid until 2020 [5]. This was considered necessary since the share of cardiovascular diseases is dominant in Serbia [5], as well as in the EU [2]. Therefore, the crucial task is to select the most important factors that have an impact on the successful inpatient treatment of cardiovascular patients at each level of expertise, including those that belong to the group of so-called non-medical factors [6,7,8]. Various authors deal with the successful organization of treatment in different ways, beginning from considering the influence of factors of different natures in different types of diseases [9,10,11], sociological determinants [12], and the impact of political, economic, environmental, and other external influences [13], both non-medical and medical factors [14].

The considered topic, dealing with the influence of observed factors on hospital treatment, belongs to the group of so-called binary classification problems, which have two classes of outcomes: successful and unsuccessful. According to the related literature [15,16], such a type of problem can be solved using the predictive methods of classical logistic regression, as well as the classification methodology based on supervised machine learning (ML). Although statistics is the foundation of ML, not all ML methods have been derived from statistics. The main difference between them is their purpose. Namely, ML algorithms are designed to make as many accurate predictions as possible. Statistical models have been designed for inference about the relationships between variables, but none of them are prioritized, as the selection of a particular method depends on the desired outcomes. Considering the concrete binary classification problem discussed in this manuscript, there is still an open question of which of the two methodologies is better to apply in a particular case, regression or classification [17]. Classical statistical methods can often provide inaccurate results in manual hypothesis testing, which makes them impractical and even invalid when operating with large numbers of variables, where the user specifies variables, functional form, and type of interaction, which may influence the resulting models. Moreover, these methods involve various assumptions, such as assumptions on linearity and probability distribution. In contrast, data mining and ML are independent of the number of observed factors, data size, and probability distribution. Also, using ML, feature selection algorithms can be employed for optimization in classification tasks, which provides a possibility to classify the outcome for new samples and, thus, make an accurate prediction.

In recent years, in both machine and statistical learning fields, good results have been achieved by using ensemble methods that can leverage good characteristics of both types of the aforementioned methods while mitigating their shortcomings and limitations [18,19]. Among the ensemble methods, the best results have been achieved through the application of stacking group methods, which use the aggregation of different algorithms to obtain better predictive models than could be obtained from any of the algorithms individually and are superior to most known methods [20,21]. Due to all the above and considering the fact that the application of ML-based algorithms has been imperative in the 21st century, especially in the field of healthcare [22], this study proposes an asymmetric-based procedure that uses a stacking ensemble method to develop an appropriate model of optimization for solving the considered problem. The proposed ensemble model uses the logistic regression model as obligatory. If defined as necessary, the conditions of its fitting are fulfilled, and it uses different methods from the classification group of supervised ML to make one reduction in the dimensions of the problem. This is achieved by using one of the feature selection algorithms for the combiner algorithm. The application of such methods in all areas of human activity, and especially in healthcare, is gaining importance due to numerous advantages, including accurate forecasting, a constant model, precise results, and error reduction.

The main motivation for this study lies in the fact that the most widely used logistic regression methods for solving the prediction and binary classification problem face a quality problem in the case of poor fit of the model result and actual data, as well as in the fact that previously proposed solutions to this problem require using other methodologies to achieve a good prediction result. Among these methodologies, the most common are feature selection and ensemble algorithms. To this end, the proposed model is a stacking ensemble that uses logistic regression and feature selection algorithms.

It is known that each prediction depends not only on the selected methodology but also on the selected dataset and types of input variables (categorical, numerical, or both types), the prevalence of some classes, and the software used [23]. Because of this, the authors considered the possibility of using one type of stacking ensemble of ML generic procedure as expedient because it is applicable and suitable for different classes of predictor types and prevalence and, thus, could be a possible solution to various problems. Using this approach, the authors design a model that uses both of the two mentioned methodologies to exploit their advantages and eliminate their limitations. In the process of evaluation of the proposed method using a concrete case study, 10 cross-validations are used, as well as software from known manufacturers.

The authors put forward a basic logical hypothesis that, for each process that depends on several factors, there must be a difference between the relative impacts of factors on the outcome variable. This indisputable fact must be considered in the determination of the relative importance of individual factors for the successful treatment of patients. In the problem in this study, it is expected that among various non-medical factors for successful inpatient treatment that can be often found in the literature [12,13], inevitably, the most important should be the level of expertise of an institution and the number of days of the applied treatment, which are included in the conducted case study. In addition, the fundamental hypothesis of this research is the possibility of aggregating several algorithms of different types to construct an ensemble procedure that has better characteristics than each of the included algorithms individually and, with respect to each other, are well-known ensembles (e.g., random forest, Adaboost, and XGboost), which are state-of-the-art techniques in the ML field.

The main contributions of this research can be summarized as follows:

○: An innovative generic optimization procedure with very good values of classification quality measures that can be used to solve both classic prediction problems and in discriminative classification, which essentially determine the importance of individual factors in a multivariate problem in the general case, is proposed.
○: The proposed algorithm belongs to the class of generic algorithms, which practically allows its application to a wide range of problems. In general, generic modeling could represent the development of the concept of a model library.
○: A modern multi-agent application for solving a specific problem is developed by assessing the influence of certain factors on the success of hospital treatment. The developed application is available to the public for use and further development. Also, this application can be used to solve other similar problems in the field of healthcare but also in other fields of human activity.

The rest of this paper is organized as follows. Section 1 presents the introductory considerations. Section 2 gives the background review, including the state-of-the-art methodologies used to solve the mentioned problem, namely logistic regression and classification and feature selection. Section 3 describes the materials and data used for training and testing in the case study and introduces the proposed ensemble algorithm. Section 4 presents the case study and discusses the obtained results. Section 5 presents technical solutions and the practical implementation of the proposed method. Finally, Section 6 concludes this study.

2. Background Review

This section provides a literature review of the state-of-the-art methods in the field of determining the effects of selected factors on inpatient treatment success. The authors provide a review of recent studies on the problem of binary classification, which is the research subject of this study. In the literature [24], descriptive statistics, regression, data mining and machine learning, and ensemble models of the newest multi-objective strategies can be used for solving this problem. A summary of the review is presented in Table 1 after a short description.

This study uses state-of-the-art methods in the field of application of ML classification algorithms to solve the considered problem. Moreover, two common subgroups of ML-based methods are combined to develop an ensemble model, namely classification algorithms and feature selection algorithms.

Bearing in mind that the main aim of this study is to present contributions in terms of proposing one innovative generic ensemble methodology of ML that is estimated to solve the problem of determining the importance of selected non-medical factors for inpatient treatment and the fact that the inclusion of the other factors are independent of their number and nature and does not change the validity of the proposed procedure, the authors list all non-medical factors that could be found in the literature and study their different combinations for different types of patient treatment, focusing on the state-of-the-art methods in the field of binary classification.

2.1. Literature Review of Different Methodologies That Deal with Patient Treatment

In the related literature, different applications of regression analysis have been reported, such as the application of linear and logistic regression to the determination of factors that influence treatments of diseases and conditions to improve patient care and clinical practice [25] and the analysis of nine probable risk factors for coronary heart disease using a multiple logistic model [26]. The application of various data mining techniques has also been presented in the literature, for instance, the analysis of different factors that can affect costs, revenues, and operational efficiency of patient care [27], determination of the factor that enables the assessment of the effectiveness of treatment [28], and finding the factors that reduce the cost of providing healthcare [29]. The ML-based methods can also be found in uses like determining the factors that affect the success of treatment in various areas of healthcare, such as cancer, epileptic seizures, diabetic retinopathy, gastrointestinal disease, and brain strokes [30], and usage of the increasing amount of health data provided by the Internet of Things about the factors that can improve patient outcomes [31]. It is also possible to find references that deal with the estimation of the successfulness of the treatment of heart diseases [32,33] as well as with other types of diseases [9,10,11], and the impact of social, political, and economic factors [12,13,14,34]. These methodologies could be found in health information exchange-based risk surveillance systems of patients, for instance, in the case of the state Maine [35], and in quality control of the application of complex mixtures of treatment, including herbal medicines, as presented in [36]. Using ML in the estimation of the successfulness of inpatient treatment was the research topic in [37], as well as in one very extensive collection of articles in a Special Issue of the journal Algorithms [38]. Particularly interesting is a review in [39] that analyzed the application of ensemble methods in classification and applications of recent regression-developed methods. To the best of the authors’ knowledge, there have been no studies on determining the influence on inpatient treatment at the medical institution level and the length of patients’ treatment with several other non-medical parameters, including the education level of a patient, location, place of residence of a patient, and a patient’s gender and age, affecting the treatment quality of patients with cardiovascular disease and defining the treatment outcome. There are several taxonomies of measures [40] used to assess the treatment quality of healthcare institutions in an organizational sense. The most commonly used model is the Donabedian model [41]. In practice, it is important to know that the patient treatment quality has several different factors, and the taxonomy of medical, social, and economic factors could be found in the related literature. One critical, systematic review of the existing literature on the application of classification modeling methods related to the general medical application was conducted by Khan et al. in [42]. The application of classification modeling methods related to the prediction of the length of hospital stay was considered by Zikos et al. [43], as well as in Zikos’s doctoral thesis [44]. Samaneh Sheikh-Nia solved the same problem using standard and ensemble-based classification techniques. As mentioned in Section 2.1, many studies on the application of classification methods in determining the influence of different types of factors, from medical and social to economic, including both classification and prediction, could be found in the literature [45]. These methods were used in the diagnosis and prediction of the development of various diseases, such as breast cancer [46], HIV [47], and COVID-19 [48], but the authors have not found any application similar to the proposed generic methodology.

At the end of this literature review, we must refer to the corresponding literature about new soft computing strategies and the research tendencies of generally solving multi-objective optimization problems (MOAs) using multi-objective evolutionary algorithms (MOEAs), and another research tendency connected with deep learning are convolution neural networks (CNNs) [49]. Both of them are able to optimize and simplify the problem of binary classification, which is the research subject of this paper.

For instance, in [49], an intelligent system based on the composition of the two CNNs for the automatic extraction and identification of brain tumors from 2D CE MRI images was designed. Reference [50] proposed an improved two-archive many-objective ABC algorithm, while reference [51] considered an innovative game utility function to balance convergence and diversity and thus promote the genetic selection of parents for inheritance so that the population can rapidly approach the true Pareto front in one MOEA algorithm. A discrete Jaya MOEA algorithm to address the flexible job shop scheduling problem considering the minimization of makespan, total workload of machines, and workload of critical machines as performance measures for solving MOAs is also given in [51]. The MOEA algorithms have been applied to the prediction of treatment of cancer to minimize the objectives of cancerous cell density and the approved drug amount to optimize the medical remedy of a tumor; this type of solution was proposed in [53]. In [54], a multi-objective model based on the genetic algorithm (GA) was applied to evaluate site suitability for new clinics in some urban areas of Tehran. Reference [55] studied the admission process of patients on anti-COVID-19 treatment, considering two main criteria: the admission time and the readiness of the hospital accepting the patients. In [56], the authors considered a multi-objective integrated planning and scheduling model for operating rooms under uncertainty. In [57], a review of the applications of the genetic algorithm in the fields of disease screening, diagnosis, treatment planning, pharmacovigilance, prognosis, and healthcare management was provided.

In paper [58], we can find that the meta-heuristic methods, such as MOEAs, have usually been used as a search strategy in feature selection wrapper methods since they allow minimizing the cardinality of the attribute subset and simultaneously maximizing the predictive capacity of the model for regression and classification purposes. In solving high-dimensional problems, performing the wrapper-type feature selection commonly requires excessive time for computation and has a high computational cost. To address these limitations, a multi-surrogate methodology has been used to assist MOEAs for the feature selection purpose.

Because of the fact that the proposed ensemble algorithm in this paper is based on filter feature selection and addresses a simpler univariate problem in binary classification, the authors decided to adopt a stacking ensemble methodology for solving the considered problem. The authors left the usage meta-heuristic strategy and MOEAs, from this group, for future work. In addition, since the CNN models have been designed for image data and could be the most efficient and flexible models for image classification problems in deep learning, they could also be considered in future work for application in such types of problems.

The methods used in the proposed solution have been selected on the basis that they have been reported as the best-performing algorithms in the field of binary classification. In future work, the authors could also consider using other methods and implementing them in the proposed strategy.

2.2. State of the Art

In addition to the known and widely applied conventional statistic regression methodology in prediction modeling, ML is the current trend. ML relies on statistical analysis and artificial intelligence to learn concepts, including models and rules, based on the induction of logical rules that can be understood by humans. This learning process involves dividing a dataset used for learning into a learning set and a test set, where the test set is used to verify the validity of the learned knowledge. Predictive accuracy is the primary measure of the correctness of the learned knowledge, representing the percentage of success in classifying new rules using the learned rules. The goal of prediction is to create a model that can draw conclusions about a unique aspect of a dependent variable based on a combination of independent variables. The selection of variables from the available dataset affects the precision and accuracy of the generated prediction models. Therefore, in the data preprocessing phase, various techniques are used to select relevant variables and assess their importance for the predictor’s output, as well as filter feature selection methods, which are employed in the proposed prediction model to reduce the number of input variables and, thus, reduce the cost and improve the prediction characteristics of the model. In classification problems, sensitivity quantifies the avoidance of false negatives, while specificity does the same for false positives. The compromise between these measures, which is otherwise difficult to achieve, is shown by the so-called receiver operating characteristic curve.

In this case study, logistic regression is selected from the conventional statistic group of methods and used as a basic method. The basic measures of goodness of fit of the proposed model with the considered data are generated using the Hosmer and Lemeshow test. In the case that this test returns unsatisfactory results, the relevant literature [59,60,61] suggests using the classification test that exists in supervised ML-based classification as a so-called function method. This method can be implemented in classification discrimination using classification and feature selection algorithms and, after that, evaluated with the most important classification measures, such as area under the curve (AUC) and accuracy.

2.2.1. ML-Based Classification Method

Classification is a widely studied topic in ML-based systems, and it has been used to help domain experts identify knowledge from large datasets. Classification algorithms are predictive methods that use supervised ML. These methods group labeled instances into at least two classes (attributes) of objects and predict the value of a required categorical type of class (attribute) based on the values of the other predictive attributes. The classification algorithm analyzes the attribute values and discovers relationships between them to achieve accurate prediction results. Common classification algorithms include regression-based methods (e.g., linear regression, isotonic regression, and logistic regression), decision trees (e.g., J48, ID3, random forest, and C4.5), Bayesian classifiers (e.g., naive Bayes, Bayesian logistic regression, and Bayesian network), artificial neural networks (single-layer perceptron, multi-layer perceptron, and support vector machine), and classifiers based on association rules (e.g., PART, JRip, and M5Rules) [62]. The main goal of ML related to data is to select an appropriate classification algorithm for a specific application. In this study, a classifier that classifies results into two classes, positive and negative, is used. The possible prediction results are presented in the form of a confusion matrix, presented in Table 2.

Table 1 shows that the total sum of positive and negative cases is equal to the number of members in the set being classified, denoted by N, which can be calculated as N = TP + FN + FP + TN. Common quality evaluation metrics for a two-class classifier, including accuracy, precision, recall, and F1 measure, are used in this study, and they are, respectively, calculated using Equations (1)–(4).

A c c u r a c y = (T P + T N) / N

(1)

P r e c i s i o n = T P / (T P + F P)

(2)

R e c a l l = T P / (T P + F N)

(3)

F 1 m e a s u r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

Also, the receiver operating characteristic (ROC) curve, which has been commonly used to evaluate the performance of classifiers in predicting outcomes, is used in this study. In the ROC diagram, the false positive rate is presented on the x-axis, and the true positive rate is given on the y-axis. It should be noted that certain points on the ROC curve have specific meanings [63,64]; for instance, a point (0, 1) represents a perfect prediction, a point (1, 1) indicates that a classifier labels everything as positive, and a point (1, 0) shows that a classifier labels everything incorrectly. The area under the ROC curve (AUC) is a measure of the diagnostic accuracy of a classifier model, and generally, the AUC values greater than 70% indicate a good classification performance.

According to the previous work [65], for naive Bayes or neural network classifiers, the ROC output is a probability or score, whereas for a discrete classifier, only a single point is generated, which represents the degree to which an instance belongs to a certain class. In practice, classification is an ML and data mining task that involves separating instances in a dataset into predetermined classes based on the input variable values [66].

To realize this task, the classification procedure involves several steps: selecting classifiers to apply the classification algorithm, selecting a class attribute (output variable), splitting the dataset into training and test sets, training the classifier on the training set where the class attribute values are known, and testing the classifier on the test set where the class attribute values are unknown. In the testing phase, the classifier classifies the test samples based on the predetermined class attribute classes. If the classifier makes a high percentage of errors on the test dataset, it can be concluded that an efficient and unstable model has been created. In such a case, it is necessary to improve the trained model by modifying the applied classification process. Previous research has shown that the most commonly used classifiers include Bayes networks, decision trees, neural networks, and others [67].

This study proposes a classification model that combines the aforementioned classification algorithms, so a brief description of each of them is given in the following.

Naive Bayes

The Bayes classifier, unlike Bayes networks [68], produces a prediction model that is strongly independent of assumptions and provides a straightforward and easy-to-understand approach for displaying, using, and inducing probabilistic knowledge [69]. The main benefits of a naive Bayes model include its simplicity, efficiency, ease of interpretation, and suitability for small datasets.

Decisions trees

Decision trees [70] divide data into nodes and leaves until the entire dataset is analyzed. The ID3 [71] and C4.5 [72] algorithms are the most commonly used decision tree algorithms. The advantages of decision tree classifiers include their simplicity, the ability to work with numerical and categorical variables, fast classification of new samples, and flexibility.

LogitBoost

The LogitBoost [73] algorithm has been widely applied in practice because it represents an ensemble boosting algorithm and can accurately measure values important for classification functions. It is based on the principle that finding multiple simple rules can be more efficient than finding a single complex and precise rule. This algorithm represents a general method for improving the accuracy of ML-based algorithms.

Logistic regression

Calibration is the process of adjusting the result of a classification algorithm’s posterior probabilities to match the true prior probability distribution of the target classes. Many authors suggest calibrating ML or statistical models to predict the probability that the outcome is one for every given data row [74,75]. Calibration is used to transform classifier scores into class membership probabilities in the classification process. Univariate calibration methods, such as logistic regression, transform classifier scores into class membership probabilities in the two-class case. Logistic regression [76] is a statistical technique that analyzes a dataset where one or more independent variables determine an outcome measured with a dichotomous variable that only contains data coded as one or zero. It requires neither a linear relationship between the dependent and independent variables nor independent variables to be normally distributed. It is based on the theoretical assumptions given in Equations (5)–(9).

Logistic regression methodology aims to identify the most suitable model that can describe the relationship between a dichotomous characteristic of interest (dependent variable or outcome variable) and a set of independent variables (predictor or explanatory variables). The logistic regression algorithm generates coefficients (with their standard errors and significance levels) that can be used to define a formula for predicting the logit transformation of the probability of the presence of the characteristic of interest, which is expressed as follows:

l o g i t (p) = b_{0} + b_{1} X_{1} + b_{2} X_{2} + \dots + b_{k} X_{k}

(5)

where p is the probability of the presence of the characteristic of interest;

b_{0}

,

b_{1}

,

b_{2}

, …,

b_{k}

are the coefficients of the regression equation;

X_{1}

,

X_{2}

, …,

X_{k}

, denote independent variables.

The logit transformation is defined as the logged odds as follows:

o d d s = \frac{p}{1 - p} = \frac{p r o b a b i l i t y o f c h a r a c t e r i s t i c s ’ p r e s e n c e}{p r o b a b i l i t y o f c h a r a c t e r i s t i c s ’ a b s e n c e}

(6)

l o g i t (p) = l n (\frac{p}{1 - p})

(7)

Taking the exponential of both sides of Equations (5) and (7) yields the following:

o d d s = \frac{p}{1 - p} = e^{b_{0}} \cdot e^{b_{1} X_{1}} \cdot e^{b_{2} X_{2}} \dots \cdot \cdot e^{b_{k} X_{k}}

(8)

when a variable

X_{i}

increases by one unit, whereas all other parameters remain constant, the odds will increase by a factor of

e^{b_{i}}

, which is calculated by the following:

e^{b_{i} (1 + X_{i})} - e^{b_{i} X_{i}} = e^{b_{i} X_{i}} = e^{b_{i} (1 + X_{i}) - b_{i} X_{i}} = e^{b_{i} + b_{i} X_{i} - b_{i} X_{i}} = e^{b_{i}}

(9)

This factor

e^{b_{i}}

represents the odds ratio (OR) for an independent variable

X_{i}

, and it defines a relative amount by which the odds of the outcome increase (OR greater than one) or decrease (OR less than one) when the value of the independent variable is increased by one unit.

Statistical programs, such as IBM SPSS v19 [77], offer various methods for performing logistic regression.

The authors have used the Enter method for their proposed model as the default method in the SPSS package.

2.2.2. ML-Based Feature Selection Techniques

Many classification methods are highly sensitive to data dimensionality and the ratio of instances to features. However, even less sensitive methods can benefit from data dimensionality reduction. Attribute ranking evaluates each attribute independently of others but does not consider dependencies between attributes. In contrast, subset selection searches for a set of attributes that together provide the best result. Feature selection methods can be realized using three groups of methods [78]:

Filtering methods, of which the most known are Infogain and Gainratio;
Wrapping methods, of which the most representative ones are BestFirst and LinearForwardSelection;
Embedding methods that include different types of decision tree algorithms, such as J48 and PART.

In their proposed model, a filter–ranker evaluation approach is adopted.

Filter–ranker methods

To reduce the number of attributes in the model and determine an optimal subset of attributes that provide the best possible predictive performance, this study adopts a filter–ranker evaluation approach. This approach ranks the attributes based on their importance, helping to identify the most relevant attributes for a particular model. By using this approach, a smaller set of attributes with strong predictive characteristics can be selected. The Weka software [79] is used to reduce the volume of information by applying various algorithms and techniques. This reduction in the amount of information can potentially include the suggested filter–ranker evaluation approach. In ML, a large number of attributes can make it challenging to apply techniques such as regression or classification to the collected data. Therefore, feature selection, as a data modeling technique, is used in this study to solve the problem of irrelevant and redundant attributes. This approach involves evaluating different attributes using various measures, such as ChiSquare, Relief, and GainRatio, to rank them in terms of relevance. Different measures [80] are used in the proposed model through appropriate classifiers, and they are briefly described in the following.

GainRatio

Entropy is a measure of the disorder or uncertainty in a system, and it has often been used in information theory as a measure of the amount of information contained in a message or a dataset. In the context of decision trees and attribute selection, entropy is used as a measure of the impurity of a set of examples. The goal is to select the attribute that leads to the greatest reduction in entropy, which in turn leads to a more homogeneous subset of examples. The entropy of Y is calculated as follows:

H (Y) = - \sum_{y \subset Y} p (y) \cdot {l o g}_{_{2}} (p (y))

(10)

Because entropy is used as a measure of impurity in a training set S, it is possible to create a measure that reflects the amount of additional information about an attribute provided by the class, which indicates the extent to which the entropy of the attribute decreases [81].

InfoGain is a measure that evaluates the worth of an attribute by calculating the amount of information obtained about the class when the attribute is known. It is defined as the difference between the entropy of the class before and after splitting on the attribute, which can be expressed by the following:

I n f o G a i n (C l a s s, A t t r i b u t e) = H (C l a s s) - H (C l a s s | A t t r i b u t e)

(11)

where symbol H denotes the information entropy, which is calculated using Equation (6).

The GainRatio [82] represents a modified version of InfoGain, which is a non-symmetric measure designed to address the bias of InfoGain. The calculation formula of GainRatio [83] is obtained using Equations (10) and (11) as follows:

G a i n R a t i o = \frac{I n f o G a i n}{H (C l a s s)}

(12)

Equation (11) shows that when an attribute “Attribute” needs to be predicted, the InfoGain is normalized by dividing it by the entropy of “Class”, and vice versa. This normalization ensures that the GainRatio values, obtained through Equation (12), always fall within a range of [0, 1]. If the GainRatio is equal to one, the knowledge of “Class” completely predicts “Attribute”, and if the GainRatio is equal to zero, there is no relationship between “Attribute” and “Class”.

ChiSquaredAttributeEval

ChiSquaredAttributeEval is a measure based on the chi-square test used to test the independence of two events for given data of two variables. For the observed value O and the expected value E, the chi-square measure [84] shows the deviation between these two values, and it is defined by the following:

χ_{c}^{2} = \sum_{i} \frac{(O_{i} - E_{i})^{2}}{E_{i}}

(13)

where c is degrees of freedom,

O_{i}

is observed value, and

E_{i}

is expected value; the number of degrees of freedom refers to the total number of observations reduced for the number of independent constraints that are imposed with the observations, and it can be defined as a total number of observations minus the number of independent constraints imposed on the observations.

Relief

Relief is a measure used for attribute estimation [85,86,87], which estimates the attribute value by repeatedly sampling the instances and considering the value of the obtained attributes from the nearest instances of the same or different class. This measure assigns a weighted score to each attribute based on its ability to discriminate between classes and then selects the attributes whose weights exceed a user-defined threshold as matching attributes.

3. Materials and Methods

As mentioned in Section 1, due to the fast and significant development of advanced computer-based solutions for different impact predictions and factors and their effect on inpatient treatment quality, the mortality of cardio patients in particular caused by inadequate healthcare has been a hot research topic in the information field since the beginning of the 21st century. Application of ML and especially ensemble methods to the prediction, including technical implementation of obtained solutions in the form of mobile software tools, is one of the current trends in the field of data prediction. However, as mentioned in Section 2, there have still been fewer studies on ensemble methods that combine ML-based methods, especially in the field of healthcare, that deal with non-medical factors. Therefore, additional research on aggregated methods is needed, which is the main motivation of this study.

This study introduces an efficient ensemble stacking ML procedure for the prediction of the impact of selected non-medical factors on inpatient treatment quality. The proposed model is trained and tested through a case study that uses the data obtained from the Institute of Public Health in Nis, which were acquired in the region connected with the city of Nis, including the Toplica area, Republic of Serbia. The collected raw data were first classified into two classes, those with a positive outcome of a patient’s treatment and those with a negative outcome of a patient’s treatment. This was also performed with the normalized data.

For the SPSS v19 and Weka v3.6 data analysis carried out in the case study, the authors used a PC with an Intel i7-9700kf processor, 32GB RAM memory, and a 64-bit Windows 11 pro operating system.

For the development of the proposed application described in Section 5, the authors used the development environment PyCharm community edition pc-223.8836.43 for Python 3.9 with libraries jupyterlab, python-weka-wrapper3, and python-javabridge.

3.1. Materials

The materials used in this paper are the dataset used for training and testing, which was generated in the performed case study that the authors considered when solving the given problem and for checking the stated main hypothesis of this paper.

Data Acquired during the period 2006 to 2009 by the Institute of Public Health in Nis

Aiming to evaluate the individual impacts of social and health factors affecting the patient treatment quality, this study considers several parameters, including education level (high level is one group, while all other levels of education belong to a second group), medical institution level (e.g., clinical centers have a higher level while all other medical institutions have a lower level), place of residence (high level of housing is in an urban environment, i.e., city of Nis), gender (implies a high level for female), age of patients (older than 50 is notated as a high level), and the length of patient treatment (treatment longer than 15 days is noted as a high level). These parameters affect the treatment quality of patients with cardiovascular disease and define the treatment outcome, which can be positive or negative.

The case study was conducted using data acquired during the period from 2006 to 2009 by the Institute of Public Health in Nis and dispensary medical institutions, including the Clinical Center of Nis, Institute for Prevention and Rehabilitation of Niska Banja, Military Hospital Nis, and Special Hospital of Soko Banja, as well as districts (Ozren and Toplica) that included Medical Center of Prokuplje and the Health Center of Kursumlija.

Data analysis was performed using an innovative ensemble ML-based generic procedure that combines two techniques, namely conventional logistic regression analysis and classification performed using common application classification and the feature selection algorithms.

The selected feature selection algorithms are based on the filter group ranked model and select a ranked sub-set of attributes according to the prediction accuracy estimation given by the selected classifier.

All of these data are shown in Table 3.

In Table 3, the meaning of the listed factors is as follows:

Education has the value of one for a high education level of patients;
HospitalType has a value of one for treatment at the Clinical Center—Nis and a value of zero for all other hospitals;
Gender has a value of one for female patients and a value of zero for male patients;
Age is the patients’ age in years—older than 50 is notated as a high level;
DaysofTreatment is the number of days of a patient’s hospital stay—longer than 15 days is noted as a high level;
UrbanHousing has the value 1 for patients living in the city;
Outcome has a value of “true” for a positive outcome of a patient’s treatment, but a value of “false” for a negative outcome of a patient’s treatment.

3.2. Methods

As we mentioned in the introduction of Section 3, the application of ensemble ML methods in the prediction of different functions that solve different problems in different fields of human life, including their technical implementation in the form of useful software application, is a current trend, although the use of logistic regression in data prediction and classification and consequently for the considered problem in this paper is still the dominant methodology. Namely, when using logistic regression in prediction and classification, the problem of a poor fit of the model and data can often occur, which is usually determined using the Hosmer–Lemeshow test, so if its value is less than 0.05, the question of the quality of the prediction arises. An important question in that case is can a quality prediction be improved with the help of some other methodologies? We can find in the literature [59,61,87,88] that the following methodologies are useful to solve such a problem and to improve the accuracy of a regression model:

Handling null/missing values;
Data visualization;
Feature selection and scaling;
Use of ensemble and boosting algorithms;
hyper-parameter tuning.

Also, the authors Hosmer and others in articles [59,60], Harrell in [87], and Steyerberg and others in [88] remarked that the Hosmer–Lemeshow test is obsolete because it requires arbitrary binning of predicted probabilities, does not detect a lack of calibration and does not fully penalize the extreme overfitting of the model. They claimed that better methods are available, such as the methods proposed in [59]. More importantly, this kind of assessment just addresses overall model calibration, i.e., agreement between predicted and observed parameters, and does not address lack of fit because of improper transforming of a predictor. For that matter, the previously mentioned AUC measure could be used to compare two models with the purpose of finding one that is more flexible than the others being tested. Practically, in this way, the stated problem is translated into the problem of predictive discrimination, which is binary classification, for which the AUC measure for ROC in the proposed ensemble algorithm could be much more appropriate.

The selection of the algorithm for the stacking model of ML is, generally speaking, conditioned by the following factors:

The type of problem we are solving;
The characteristics of the set of attributes (features);
The volume of data available.

In our case study, the prediction, i.e., a binary classification problem, has been applied on the dataset that has 26,581 instances, and the majority of included factors are categorical variables.

Because of the above-mentioned facts and notes mentioned in the introduction about the predomination of regression and classification methods with all their advantages and disadvantages [15,16] in solving binary classification problems, the authors of this paper chose stacking ensemble as the ML methodology for solving the considered problem given in the presented case study and for this task. We also decided to use logistic regression and classification from one side and the best one from the best-known groups of naive Bayes, decision tree, and logit boost (boosting) from the other side, as well as adding one from the filter group of feature selection algorithms—gain ratio, chi-square, and relief classifier—as the combiner algorithm, i.e., a final estimator that enables dimension reduction in such a way as to decrease noise and increase accuracy in solving the stated problem. The proposed algorithm belongs to the generic algorithms’ family [89,90,91,92], which practically allow the reuse of a wide range of different problems with relatively minor reorganization, and, in general, the generic modeling could represent a development of the concept of a model library.

3.2.1. Ensemble Prediction Methods

Ensemble methods that are used in ML [93] are based on the idea that a combination of algorithms of different types can achieve better results than each of the included algorithms individually. The simplest form of this type of prediction method in the form of a decision has an ensemble with the application of an odd number of independent models that compare their results and finally determine the solution through a simple majority. Of course, this kind of prediction evolves by using different, more complex ways of aggregating in an ensemble whereby some of them use the obtained models for different (including stochastic-based selections) subsets of the considered set of data. As mentioned, it is possible to find more types of ensemble methods in different kinds of taxonomies, of which the most familiar are:

Bootstrap aggregating (bagging);
Boosting;
Stacking.

Practically, we can conclude that we can find the following in [93]:

There are three main types of ensemble learning methods: bagging, boosting, and stacking. Ensemble learning combines multiple ML models into a single model, with the aim of increasing the performance of the model. Bagging aims to decrease variance, boosting aims to decrease bias, and stacking aims to improve prediction accuracy.
The prediction of an ensemble method usually requires more computation than evaluating the prediction of a single model. It can be concluded that using an ensemble methodology is a way to compensate for poor learning algorithms that perform a lot of extra computations, and the alternative is to undergo additional learning in one non-ensemble system. An ensemble system may be made more efficient in terms of overall accuracy improvement by increasing computation complexity, storage, or communication resources as a consequence of the usage of two or more methods, in comparison with the same increase in resources for a single method usage. It has to be underlined that many problems do not have real-time working issues, as is true in the case study examined in this paper.

Stacking

Stacking an ensemble algorithm involves training a model that can make predictions using a combination of several ML algorithms. Thereby, all of the included algorithms are trained using the available data, and then an algorithm that is composed of some of them in combination is trained to make a final estimation and prediction including all the predictions of these algorithms as the basic estimators and as additional inputs or using cross-validated predictions from these base estimators to prevent overfitting [94]. The logistic regression model is often used as the combiner algorithm in practice. In this way, stacking ensemble algorithms typically yields performance better than any single one of the trained algorithms included [95]. It can be successfully used on both supervised [96] (which is the case in this article) and unsupervised [20] learning tasks.

3.2.2. Ensemble Prediction Method of Selected Factor Effect on Inpatient Treatment Quality

In Section 3.2, the authors explain and discuss the impact of a poor fit of a model to its data and the impact of possible data prevalence on the choice of regression or classification as the dominant method in solving binary classification problems. Therefore, the authors decided to use a method of ML stacking that incorporates both of them into the proposed model. The authors decided that in the proposed model, which is described in this Section 3.2.2, we would use stacking (sometimes called stacked generalization) that involves models training to be able to combine the predictions of several other learning algorithms using some combiner algorithms. In this paper, the proposed stacking ensemble method includes two types of ML algorithms in asymmetric form: one is an obligatory logistical regression algorithm at the beginning, and the other one is a classification algorithm to be used if and when the stated conditions are fulfilled. Finally, by using the combiner algorithm, i.e., the classification process uses several algorithms of feature selection for basic ranked classification, which enables an optimization of the whole procedure through dimensional reduction of the considered problem.

The authors began this procedure with successive applications of the logistic regression and classification on the starting set of data to determine their suitability for the application and regression. They are controlled in the regression using the overall percentage in classification table (OPCT) and Hosmer and Lemeshow test of goodness of fit of the model with the data using its indicator of significance (HLSig) with set conditions ((OPCT) > 0.5 and (HLSig > 0.05)). The set condition for classification is AUC (AUC > 0.6), which means the minimum satisfaction of classification performances and evaluates the accuracy of the basic prediction formula with a defined number of significant factors. After that, the authors proposed an enhancement of the regression model by including one stacking ensemble ML in the procedure, whereas the second member of the ensemble, the best of some three classification algorithms suitable for the considered problem, is included, and at the end, the combiner algorithm is included as the one of three selected filter algorithms of feature selection that gives the best AUC value. This proposal is in agreement with the previously mentioned new conclusions in [80], which we discuss in the introduction of Section 3.2 (methods where the AUC is the measure that is preferred over accuracy as it is a much better indicator of model performance). At the end of this procedure, the authors included the logistic regression and classification for fine calibration according to the value of classification accuracy measure AUC, as well as the parameters OPCT and HLSig for regression, determining a potentially smaller number of significant factors than were present initially and classification with a better value of the most important AUC measure. These factors are those that should be included in the prediction formula. In this way, the authors have constructed one optimized generic procedure, which is given with the algorithm presented in Figure 1, described in Algorithm 1, as well as with the procedure shown in Figure 2.

Algorithm 1: Determining the importance of predictors for successful inpatient treatment.

* Input data for each instance with n factors and preprocess the data.
NEXT
** Perform regression and determine $l \leq n$ non-colinear input factors;
Determine classification algorithms with the highest value of AUC from three different types;
Check regressions’ goodness $H L s i g \geq 0.5$ and $A U C \geq 0.7$
IF NO No valid prediction GOTO END
ELSE
Check regressions’ goodness $H L s i g \geq 0.5$
IF YES Prediction with l factors GOTO END
ELSE
NEXT
***. Apply feature selection procedure using 3 different types of filter algorithms;
Create table with the values of AUC calculated with the classification algorithm from step 2;
For each filter algorithm given in the rows, calculate the number of 1-l factors in columns;
Determine the maximum value of AUC and determine $k \leq l$ factors notated with AUC1.
**** Check classification goodness $A U C 1 \geq A U C$
IF NO Prediction with l factors GOTO END
ELSE
Check regression goodness in classification $O P C T \geq 0.5$
IF NO Prediction with l factors GOTO END
ELSE
Prediction with K factors
END

* Step 1. Input data in the form of a table with n independent non-medical factors and one that is the dependent variable and represents the outcome of cardiovascular patients and can be true in the case of successful treatment and false in the opposite case. Clean and normalize data. ** Step 2. Perform logistic regression. The Enter method is used to create a model with n predictors and the dependent variable, which is the treatment outcome. In this method, all predictors are included in the model unless there is a problem of collinearity, in which case some predictors may be excluded, and in this case, we will have l <= n factors. The classification table is used to calculate the OPCT, which should ideally be greater than 0.5, and the Hosmer and Lemeshow tests are used to assess the goodness of fit of the model, with a condition that the HLSig indicator should be greater than 0.05, indicating a good calibration of the model to the given data. If both of these two conditions are not satisfied, the procedure foresees a preprocessing to try to fix this deficiency with some of the following procedures:

Identifying and handling the missing values.
Encoding the categorical data.
Splitting the dataset.

If the preprocessing is unsuccessful, the output from the procedure is without valid prediction. After that, in this step of the algorithm, we also realize classification with three selected algorithms of different types. We choose the best and evaluate whether its AUC measure is greater than or equal to 0.6. In the case that the condition for AUC is not fulfilled, the output from the model is determined with l factors, because we already determined in the previous IF block that in this path of the proposed algorithm, both regression measures meet the set conditions, i.e., the HLSig indicator is greater than 0.05 and OPCT is greater than 0.5.

This evaluation of fulfillment of these conditions simultaneously in the way which is explained and shown in Figure 1 is crucial for assessing the performance of the proposed ensemble model in subsequent steps, because the next steps depend on whether or not they satisfied the required values for crossing certain thresholds, as is given in Figure 1.

The algorithm leads into the next step with a value of AUC greater than or equal to 0.6, which means that it is possible to make a good prediction for given data set.

*** In step 3 of this algorithm, we used three selected algorithms of different types from the group of feature selection filter methods with the basic aim to use only the necessary k <= l factors in this classification ensemble algorithm to achieve its optimal features. This is done so that, with the best of the three classification algorithms from step 2, the value for each of the three selected filter algorithms is determined by excluding one factor at a time, starting from the lowest in rank. Using the one selected algorithm of feature selection that gives the best AUC1 value, we determine these selected k factors.

**** In Step 4 of this algorithm, which represents a definitive decision block, we repeat classification with the selected best algorithm from the step 2 of this algorithm and evaluate whether the new AUC1 measure is equal to or greater than the AUC. In the case that the condition is not met, the prediction formula includes all l determined significant factors calculated through logistic regression in step 2. In the opposite case, if repeated logistic regression fulfilled the already known conditions OPCT > 0.5 and HLSig > 0.05, the output is a prediction formula with k factors.

4. Results and Discussion

In order to assess the impact of selected non-medical factors affecting successful inpatient care, the authors have considered the following indicators: education and place of housing of patients, level of the medical institution (clinical centers of higher levels and other medical institutions of lower levels), the gender and age of patients, and days of patients’ treatment with cardiovascular disease with a positive outcome. The case study is based on the data acquired during the period from 2006 to 2009 from the Institute of Public Health in Nis, Republic of Serbia, and dispensary medical institutions in health jurisdictions connected with the city of Nis, Republic of Serbia, such as the Clinical Center—Nis, Institute for Prevention and Rehabilitation—Niska Banja, Military Hospital Nis, and Special Hospital—Soko Banja, and from the district Toplica, such as Medical Center of Prokuplje and the Health Center—Kursumlija. The data were divided into those for training, which are from 2006 to 2007, with 11833 instances, and the others for testing, which are from 2006 to 2009, with 26,581 instances.

Data analysis was performed using two methodologies organized in one ensemble ML model, as has already been described in Section 3.2.2. This fourth Section, Results and Discussion, will be divided into five Section 4.1, Section 4.2, Section 4.3, Section 4.4 and Section 4.5, to enable us to clarify the steps of the proposed procedure and provide a better understanding of the obtained results. It will provide a concise and precise description of the experimental results and their interpretation, as well as a discussion of the experimental results.

4.1. Input Data for Considered Case Study

Input data in the form of an Excel table of xlsx and csv type, in the form with n factors, which is already described in Section 3.1 of this paper, Materials, were cleaned and normalized.

4.2. Using Logistic Regression Analysis and Classification Algorithms

4.2.1. Using Logistic Regression

Table 4 shows odds ratio (OR) values and their 95% confidence interval (CI) for assessing the impact of the examined factors on the positive outcome of treatment of cardiovascular diseases in inpatient healthcare institutions in the Nis and Toplica regions during the period from 2006 to 2007 and the results of logistic regression analysis.

Notations in Table 4 and in other tables of this Section produced in SPSS v19 are translated as stated below in the process of normalization:

UrbanHousing (1) = urban place of housing of patients;
Education (1)= high level of education of patients;
HospitalType (1) = treatment at the Clinical Center—Nis;
Gender (1) = female gender (1);
Age = patient age (years > 50, old patients (1));
DaysOfTreatment = length of hospital stay (days > 15(1));
Outcome (1/0) = positive/negative outcome of treatment.

Also, the meaning of the abbreviations in the columns are:

B—Denotes the unstandardized regression weight.

S.E.—Measures how much the unstandardized regression weight can vary by. It is similar to a standard deviation of a mean.

Wald—Denotes the test statistic for the individual predictor variable, like multiple linear regression has a t test and logistic regression has a χ² test, and it determines the Sig. value.

df—This is the number of degrees of freedom for the model. There is one degree of freedom for each predictor in the model.

Sig.—Determines significant variables. p value below 0.050 is significant.

Exp(B) or OR—Denotes the odds ratio that represents the measurement of likelihood and denotes that for every one unit increase in Variable 1, the odds of a participant having a “1” in the dependent variable increases by a factor of 4.31.

95% CI OR—This is the 95% CI for the odds ratio, which means that with these values, we are 95% certain that the true value of the odds ratio is between these units. But, if the CI does not contain a 1, the Sig. value will end up being less than 0.050.

Multivariate logistic regression analysis was used to examine the correlation between a positive therapeutic outcome as a dependent variable and the age of patients, gender, the number of hospitalization days, level of education of patients, place of housing of patients, and type of dispensary health institutions as independent variables. Calculated OR values and the limits of their 95% CI show the ratio of the probability that there will be recovery or improvement in health status and the probability that the health condition is likely to remain or get worse. Patient age, number of days of hospitalization, gender of patients, education level, place of housing, and types of healthcare institutions were used as categorical variables.

Logistic regression analysis confirmed that decreasing probability of occurrence of a positive treatment outcome was associated with the education level of patients (OR = 0.406 95% CI: 0.731 to 1.135; p = 0.406), as well as housing place (OR = 0.297, 95% CI: 0.267 to 0.331; p < 0.001), and the probability of the occurrence of a positive outcome was increased for female gender (OR = 1.107, 95% CI: 0.996 to 1.231, p = 0.060), patient age (OR = 1.335, 95% CI: 1.162 to 1.535, p < 0.001), increased length of hospitalization (OR = 2.277, 95% CI: 1.845 to 2.811, p < 0.001), and treatment at the Clinical Center—Nis (OR = 7.612, 95% CI: 2.368 to 24.464, p = 0.001). These conclusions only confirm a logical and experience-based expectation. Further discussion is necessary for a comprehensive analysis of the results obtained using logistic regression, and information about the results of necessary tests for this task are provided in the tables given below.

The obtained results confirm the all l = n = 6 factors are valid and significant for prediction.

Table 5 provides a summary of the proposed model’s good performances.

In Table 5, the Hosmer and Lemeshow test shows that the value of HLSig = 0.088 is greater than the requested 0.05. The classification table shows that for the model predictions for the dependent categorical variable for each test case, OPCT = 85.3%, and both of them are acceptable values as required in the procedure. The positive predictive value, indicating that the model identified treatment successfully, is 85.3%, and the negative predictor value is 0%. This shows that the percentage of modelled cases is classified as lacking hallmark—since it is not observed in the group.

The accuracy of the classification by random selection is (1735/11833)² + (10098/11833)² = 0.7945, which is 79.45%, so it can be seen that the model of binary logistic regression analysis, with 85.3%, has a higher classification accuracy than random selection models. The table of variables given as part of Table 3, which provides information about the importance of each predictor in the Wald column, can be included in the equation of prediction. It cannot be concluded that all predictors influence the dependent variable, as the predictors the level of education of the patients and gender of patient were in contrast to all others, which evidently have an affect.

The values given in column B of this table suggest the direction of the relationships from the dependent variables to the independent variable.

4.2.2. Using Classification Algorithms

According to the proposed algorithm, taking into account that in this step, logistic regression confirmed the validity of the influence of all l = n = 6 considered predictors on the dependent variable outcome, we evaluate the quality of the influence of one of the classification algorithms using the AUC measure, and we do it with the default configuration of three classification algorithms of a different type—J48 decision tree, NaiveBayes, and LogitBoost. The obtained results determine that we can use the LogitBoost classification algorithm, as it is the best one, with the highest value of the AUC measure, as shown in Table 6 and also shown graphically with a bar chart in Figure 3.

The authors used the above-mentioned Weka software and simple ten-time cross-validation, which means that the Weka invokes the learning algorithm eleven times, once for each fold of the cross-validation and then once on the entire dataset at the end.

4.2.3. Check Fulfillment of Set Conditions

Since the set condition OPCT > 0.5 and HLSig > 0.05 is fulfilled, we continue. Otherwise, the algorithm would lead to an output without a possible valid prediction. In the case that the condition AUC > 0.6 is not fulfilled, we continue to an output with l=6 prediction factors, determined in step 2, because the condition HLSig > 0.05 is already determined as fulfilled. Otherwise, the algorithm continues with the next step, 3, which leads to an output with k factors where it is possible that k ≤ l, i.e., k ≤ 6.

4.3. Using Feature Selection

In this step of the procedure, the authors have used a selection of relevant attributes, using the so-called feature selection technique, to reduce the dimensionality of the original space up to the space with lower dimensionality, where the individual factors’ importance and correlation between the attribute values can be easily determined. We have proposed a filter–ranker evaluation approach for detecting factors and used three randomly selected algorithms of a different type—GainRatio (GR), ChiSquaredAttributeEval (CHI), and Relief (REL)—instead of one. The obtained results in Table 7 show different ranking, but it is easy to conclude that the factors ‘Education’ and ‘Gender’ have the least significance, as was obtained using the regression algorithm.

Next comes the calculation, with the classification algorithm determined in step 2, of which one has a higher AUC value using LogitBoost for each filter–ranker algorithm and by eliminating one factor at a time, starting from the last-ranked one. Using such a procedure, it has been determined that the maximal value of AUC1 = AUC = 0.671 for LogitBoost was achieved with the ChiSquaredAttributeEval algorithm using the first five ranked attributes, as shown in Table 8.

A graphical presentation of this procedure is given in Figure 4. From Table 8 and the diagram in Figure 3, it is clear that the optimization using the proposed procedure has determined that the number of k = 5 factors, with the Education factor excluded.

4.4. Decision Blcok

In this last step, using the determined k = 5 significant factors—DaysofTreatment, UrbanHousing, HospitalType, Age, and Gender—we first checked the validity of classification (Table 9) and fulfillment of the condition AUC1 ≥ AUC. If it is not fulfilled, the output is with l = 6 factors in the prediction formula determined in step 2. But this condition was fulfilled, as is given in Table 10, so we continued and checked the validity of the logistic regression (Table 10). The authors concluded that if both conditions—OPCT > 0.5 and HLSig > 0.05—are fulfilled, as well as the obtained results of the omnibus tests of model coefficients, the Hosmer and Lemeshow test and classification table for logistic regression are valid. The output of the proposed procedure with fine calibration of classification discrimination using the regression formula will be with k = 5 mentioned factors; otherwise, the output will be with l = 6 factors in the prediction formula, which is determined in step 2.

At the end of the application of the proposed model, the prediction formula is with five factors, as is given in Table 11.

4.5. Discussion

As is represented in this section of the paper, in the example considered as a case study, the authors used filter feature selection algorithms of different types (GainRatio, ChiSquaredAttributeEval, and Relief for dimension reduction of the problem) and a combiner algorithm in ensemble and logistic regression with LogitBoost as the best of the selected classification algorithms of different type—LogitBoost, J48 decision tree, and NaiveBayes—for evaluation of the proposed stacking ensemble ML model.

The obtained results showed that the proposed algorithm provides optimization of the asymmetric procedure for determining the importance of certain selected non-medical factors for the success of hospital treatment through dimensionality reduction with fine calibration using logistic regression and the classification algorithm. The results also showed that the described procedure leads to a unique prediction formula with good classification characteristics, which qualifies the proposed ensemble method as improved compared to each of the included aggregated methods individually, i.e., with better characteristics than Ada Boost, Bagging, and Random Forest ensemble methods, which are the state of the art in the considered field of procedures (see Table 12).

Thus, it can be concluded that the proposed generic enables dimensionality reduction and data compression, and hence reduces storage space. It also helps remove redundant features, if there any, and, in this way, reduce noise in the dataset, which reduces computational time for classification. All of this increases the AUC and accuracy without decreasing other commonly used measures in binary classification. This is clearly shown using the example of data from the case study considered in this paper in the last two columns of Table 12, in which the proposed method is compared with other currently used methods. Moreover, we have also presented this graphically with a bar chart in Figure 5.

Also, it is important to note that the obtained results were evaluated in the application of the proposed method using the 10-time cross-validation method, which is the state of the art for this obligatory process [97].

The initial basic hypothesis introduced in this paper in its introductory part was that it is possible to test the success of one plan of organization of health institutions by testing successful treatment depending on the level of professional expertise of healthcare institutions rather than other factors like days of hospitalization, education, place of housing, and age and gender of the patients. The results of the applied analyses show that the success of the treatment of cardiovascular patients predominantly depends on the place of housing and is consequently connected to the type of the hospital, while it depends, to some extent, on the days of hospitalization and age and gender of patients and does not depend on the level of education of patients. However, the influence of these five factors also depends on the non-urban place of residence of patients, which has a negative sign, reflecting decreasing success of inpatient treatment. All other factors have a positive sign, which means that hospitals with higher levels of expertise, more days of hospitalization, and younger, female patients increase the success of treatment.

It is very important to note that the authors did not notice any new limitation for the use of the proposed method besides generally valid disadvantages for all ensemble methods: a longer time of execution. Namely, the required time for its execution, which is evidently longer than the time required if only one of the algorithms aggregated in the proposed ensemble was used, is not a limitation for working in real time, which is valid for considered problem.

Considering the above discussion on the obtained results, the authors aim to expand their research with inclusion of more classifications and with filter ranking algorithms as a part of the proposed ensemble model.

These would lead to additional improvements in model characteristics. Also, the authors will further include studies of other modern methods for assessing the quality of fit between models and data, and eventually present prevalence as they are penalty and early stopping methods [98]. One more very interesting research direction could be the analysis of the influence of several separate groups of non-medical factors on the success of patient treatment, for example, environment, economics, genetics, demographics, etc. The already mentioned conclusion on different methodologies dealing with patient treatment in Section 2.1 suggests that future work will include the use of state-of-the-art metaheuristic strategies and MOEAs from this group, as well as the evaluation of other possible choices and combinations of schemes in the proposed strategy in this manuscript. On the other hand, research into techniques and methodologies for solving the considered healthcare problems was already the subject of other researchers [99,100]; thus, in the future, the authors of this paper will deal with many different problems in healthcare and other fields of human life, such as traffic and education, which were also considered as problems in the literature [101,102].

5. Technical Solution—Code Implementation and Real-Life Software Platform Usage

Model deployment involves taking a model and integrating it into a software application that can be used in real-world scenarios. The purpose of model deployment is to provide a user-friendly interface to interact with the model, allowing users to input new data and obtain predictions based on the model’s output. Here, we list the six steps involved in deploying an ML model:

Export the model: Export the trained ML model into a file format that can be used by other software applications. This could be a serialized object or an ML library-specific format.
Set up a server: Create a server to host the model and handle incoming requests from users. This server could be a cloud-based service like Amazon Web Services (AWS) or Microsoft Azure, or it could be set up on a local machine using software like Flask or Django.
Create an API: Create an application programming interface (API) that will handle requests from clients and return responses with predictions from the model. This API can be created using a web framework like Flask or Django, and it will typically use HTTP requests to send and receive data.
Create a client application: Create a client application that can be used to interface with the API. This client application can be a web application or a mobile application, and it will typically use HTTP requests to send data to the API and receive predictions from the model.
Test and deploy: Test the deployed model using sample data to ensure that it is working as expected. Once testing is complete, deploy the model in a production environment where it can be accessed by users.
Monitor and update: Monitor the deployed model to ensure that it is performing as expected and update it as needed with new data or changes to the model itself.

Overall, model deployment involves creating a server that can host the trained ML model, setting up an API to handle requests from clients, and creating a client application that can interface with the API. The flow of the data in the implemented solution is shown in Figure 6.

As can be seen, the flow of data in this solution starts with the input data, which are collected and sent to the Flask API via the client app.

The Flask API receives the input data, passes them to the ML model, and returns the predictions to the client app. The client app then stores the input data and predictions in a database for future analysis and reference. This process can be repeated for new input data, allowing the ML model to continually improve its predictions over time. A block diagram of the proposed solution is shown in Figure 7.

As shown in Figure 4, the proposed solution consists of four main components:

Electronic Health Record (EHR): This is the source of data for the ML model. It could be a database or other storage mechanism that contains information about patients and their treatments.
ML Model: This is the core of the solution, which determines the importance of non-medical factors affecting successful inpatient treatment. The model could be developed using various ML algorithms and techniques, depending on the specifics of the problem.
Flask API Server: This component serves as the interface between the ML model and the client application. It provides a RESTful API that receives input data, performs model prediction, and returns output data.
Web-based Client Application: This component provides a user interface for interacting with the ML model. It could be a web application that allows users to input data, view model predictions, and take actions based on the predictions.

The input data could be pre-processed and feature-engineered (if needed) before being sent to the Flask API server for model prediction. The output of the model prediction could be displayed or used to take further action, depending on the requirements of the application. Here are the steps involved in deploying the proposed model as a technical solution in order to try and test it:

Export the model: We exported the trained ML model into a file format that other software applications can use.
Set up a server: We developed a server app to host the model and handle incoming requests from users.
Create an API: We implemented an application programming interface (API) that will handle requests from clients and return responses with predictions from the model. This API can be created using a web framework like Flask or Django, and it will typically use HTTP requests to send and receive data.
Create a client application: We created a client application that can be used to interface with the API. This client application is a web application, and it uses HTTP requests to send data to the API and receive predictions from the model.
Test and deploy: The deployed model has been tested using sample data to ensure that it is working as expected. After we finished testing, the model was deployed in a production environment where real-life users could access it.
Monitor and update: The deployed model is monitored to ensure that it is performing as expected and updated as needed with new data or changes to the model itself, which has also been supported in this technical solution.

Overall, the model deployment has involved creating a server that can host the trained ML model, setting up an API to handle requests from clients, and creating a client application that can interface with the API. The implemented solution is accessible to users and used to make predictions based on new data. The source code of the implemented solution is given as Supplementary Materials for this work. The solution is robust, easily expandable, and adaptable to any context. In other words, one can use our code and straightforwardly adapt new models and new client scenarios, as well as using it as a real-world client–server software platform.

6. Conclusions

The proposed ensemble model represents an asymmetric optimization procedure based on the stacking model of ensemble learning that consists of logistic regression and classification techniques. The combiner algorithm uses feature selection, which enables dimension reduction in solving the binary classification problem of estimating the importance of non-medical factors for successful inpatient treatment.

The obtained results show that the proposed algorithm leads to a unique prediction formula with good classification characteristics, which qualifies the proposed ensemble method as better compared to each of the combined methods when used individually. In addition, the proposed algorithms surpass the state-of-the-art ensemble algorithms in the ML field, inducing the Random Forest, Bagging, and Ada Boost algorithms, as shown in Table 13, where the AUC results are presented.

The main contributions and conclusions of this study are as follows:

○: From a scientific point of view, the authors propose an efficient generic optimization procedure with very good values of classification quality measures that can be used to solve both classic prediction problems and discriminative classification, which essentially determine the importance of individual factors in a multivariate problem in the general case.
○: The proposed algorithm belongs to the class of the generic algorithms family, which allows its application to a wide range of different problems, and, in general, the generic modeling could represent the development of the concept of a model library.
○: From a professional point of view, the authors have developed and made available to the public for use and further development a modern multi-agent application for solving the specific problem of assessing the influence of certain factors on the success of hospital treatment, but it is also usable as such for solving other, similar problems in healthcare and in other fields of human activity.
○: Thereby, using the proposed procedure, the authors have also positively answered both sets of hypotheses, basic and fundamental.
○: There is a difference between factors in their impact on the outcome depending on a particular process. The conducted analysis has shown that from the analyzed factors, the most important individual factors in successful treatment are hospital type and the number of days of treatment.
○: It is possible to aggregate other types of algorithms to construct an ensemble procedure that has better characteristics than each of the included algorithms individually and also better characteristics than the existing ensemble methods.

In future work, the inclusion and selection of other classification and feature selection algorithms, as well as a larger number of them, could be considered. The most recently developed measures for assessing the quality of fit of a model to its data and elimination of possible existing prevalence from data could also be considered. In addition, it could be interesting to consider the influence of several separate groups of non-medical factors on the success of patient treatment, including environmental, economic, genetic, and demographic factors. Finally, future work could employ state-of-the-art metaheuristic strategies and MOEAs and evaluate the other possible choices and combinations of schemes in the proposed strategy.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sym15112050/s1.

Author Contributions

J.M.: Methodology, Software, Writing—review and editing; A.K.: Project administration, Resources, Validation, Formal analysis; M.R.: Conceptualization, Formal analysis, Investigation, Writing—review and editing; D.R.: Methodology, Software, Validation, Writing—original draft, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://it.fdb.edu.rs/wp-content/uploads/2023/06/Current-scientific-work.zip (accessed on 27 October 2023).

Acknowledgments

The authors would like to thank the Science and Technology Park in Nis, Republic of Serbia, for their help so that this paper could be written and published.

Conflicts of Interest

The authors declare no conflict of interests.

References

World Health Assembly Resolution WHA51.7. 1998 Health for all Policy for the Twenty-First Century Geneva: World Health Organization. Available online: http://legacy.library.ucsf.edu/documentStore/g/w/o/gwo93a99/Sgwo93a99.pdf (accessed on 12 August 2023).
Health21: The Health for all Policy Framework for the WHO European Region 1999 (European Health for All series; no. 6.) Copenhagen: World Health Organization Regional Office for Europe. Available online: http://www.euro.who.int/_data/assets/pdf_file/0010/98398/wa540ga199heeng.pdf (accessed on 12 August 2023).
Plan Zdravstvene Zastite iz Obaveznog Zdravstvenog Osiguranja u Republici Srbiji za 2012. Available online: https://www.rfzo.rs/download/plan%20zz/planZZ-2012.pdf (accessed on 12 August 2023).
Zakon o Zdravstvenoj Zastiti Republike Srbije. Available online: http://www.zdravlje.gov.rs/tmpmzadmin/downloads/zakoni1/zakon_zdravstvena_zastita.pdf (accessed on 12 August 2023).
Uredba o Nacionalnom Programu Prevencije, Lecenja i Kontrole Kardiovaskularnih Bolesti u Republici Srbiji do 2020. Available online: https://www.pravno-informacionisistem.rs/SlGlasnikPortal/eli/rep/sgrs/vlada/uredba/2010/11/5 (accessed on 12 August 2023).
Meijden, V.D.; Tange, M.J.; Troost, H.J.; Hasman, J.A. Determinants of success of inpatient clinical information systems: A literature review. J. Am. Med. Inform. Assoc. 2003, 10, 235–243. [Google Scholar] [CrossRef]
Non-Medical Determinants of Health. Available online: https://meteor.aihw.gov.au/content/392618 (accessed on 20 September 2023).
Social Determinants of Health (SDOH) and PLACES Data. Available online: https://www.cdc.gov/about/sdoh/index.html (accessed on 12 August 2023).
Valaitis, R.; Meagher-Stewart, D.; Martin-Misener, R.; Wong, S.T.; MacDonald, M.; O’Mara, L.; The Strengthening Primary Health Care through Primary Care and Public Health Collaboration Team. Organizational factors influencing successful primary care and public health collaboration. BMC Health Serv Res. 2018, 18, 420. [Google Scholar] [CrossRef] [PubMed]
Mosadeghrad, A.M. Factors influencing healthcare service quality. Int J Health Policy Manag. 2014, 3, 77–89. [Google Scholar] [CrossRef]
Truglio-Londrigan, M.; Slyer, J.T.; Singleton, J.K.; Worral, P. A qualitative systematic review of internal and external influences on shared decision making in all health care settings. JBI Database Syst. Rev. Implement. Rep. 2014, 12, 121–194. [Google Scholar] [CrossRef]
Marmot, M.G.; Ruth, B. Action on health disparities in the United States: Commission on Social Determinants of Health. J. Am. Med. Assoc. 2009, 301, 1169–1171. [Google Scholar] [CrossRef]
The Impact of Political, EConomic, Socio-CUltural, Environmental and Other External Influences. Available online: https://www.healthknowledge.org.uk/public-health-textbook/organisation-management/5b-understanding-ofs/assessing-impact-external-influences (accessed on 20 August 2023).
Lewis Hunter, A.E.; Spatz, E.S.; Rosenthal, M.S. Factors influencing hospital admission of non-critically ill patients presenting to the emergency department: A cross-sectional study. J. Gen. Intern. Med. 2016, 31, 37–44. [Google Scholar] [CrossRef] [PubMed]
Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev. 2010, 33, 1–39. [Google Scholar] [CrossRef]
Advantages and Disadvantages of Logistic Regression. Available online: https://www.geeksforgeeks.org/advantages-and-disadvantages-of-logistic-regression/ (accessed on 20 August 2023).
Opitz, D.; Maclin, R. Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
Nguyen, D.K.; Lan, C.H.; Chan, C.L. Deep ensemble learning approaches in healthcare to enhance the prediction and diagnosing performance: The workflows, deployments, and surveys on the statistical, image-based, and sequential datasets. Int. J. Environ. Res. Public Health 2021, 18, 10811. [Google Scholar] [CrossRef]
Alekhya, B.; Sasikumar, R. An ensemble approach for healthcare application and diagnosis using natural language processing. Cogn. Neurodyn. 2022, 16, 1203–1220. [Google Scholar] [CrossRef]
Breiman, L. Stacked regression. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
Smyth, P.; Wolpert, D.H. Linearly combining density estimators via stacking. Mach. Learn. J. 1999, 36, 59–83. [Google Scholar] [CrossRef]
Faltin, F.W.; Kenett, R.S.; Ruggeri, F. Statistical Methods in Healthcare; Wiley: Hoboken, NJ, USA, 2012; ISBN 978-0-470-67015-6. [Google Scholar]
El-Sappagh, S.H.; El-Masri, S.; Riad, A.M.; Elmogy, M. Data mining and knowledge discovery: Applications, techniques, challenges and process models in healthcare. Int. J. Eng. Res. Appl. 2013, 3, 900–906. [Google Scholar]
Bahel, V.; Pillai, S.; Malhotra, M. A Comparative Study on Various Binary Classification Algorithms and their Improved Variant for Optimal Performance. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; pp. 495–498. [Google Scholar] [CrossRef]
Bzovsky, S.; Phillips, M.R.; Guymer, R.H.; Wykoff, C.C.; Thabane, L.; Bhandari, M. The clinician’s guide to interpreting a regression analysis. Eye 2022, 36, 1715–1717. [Google Scholar] [CrossRef] [PubMed]
Wilhelmsen, L.; Wedel, H.; Tibblin, G. Multivariate analysis of risk factors for coronary heart disease. Circulation 2015, 1973, 950–958. [Google Scholar] [CrossRef] [PubMed]
Silver, M.; Sakata, T.; Su, H.C.; Herman, C.; Dolins, S.B.; OShea, M.J. Case study: How to apply data mining techniques in a healthcare data warehouse. J. Healthc. Inf. Manag. 2001, 15, 155–164. [Google Scholar] [PubMed]
Koh, H.C.; Tan, G. Data mining applications in healthcare. J. Healthc. Inf. Manag. 2005, 19, 64–72. [Google Scholar]
Milley, A. Healthcare and data mining. Health Manag. Technol. 2000, 21, 44–47. [Google Scholar]
Saini, A.; Meitei, A.J.; Singh, J. Machine learning chine learning in healthcare: A review. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC), University of Delhi, Delhi, India, 20–21 February 2021; Available online: https://ssrn.com/abstract=3834096 (accessed on 20 August 2023).
Toh, C.; Brody, J. Applications of in healthcare In Smart Manufacturing—When Artificial Intelligence Meets the Internet of Things; Intechopen: London, UK, 2021. [Google Scholar] [CrossRef]
Yan, L. The Effect of Risk Factors on Coronary Heart Disease: An Age-Relevant Multivariate Meta Analysis. Ph.D. Thesis, Florida State University, Tallahassee, FL, USA, August 2010. Available online: http://diginole.lib.fsu.edu/etd/1428 (accessed on 12 August 2023).
Shouman, M.; Turner, T.; Stocker, R. Using data mining techniques in heart disease diagnosis and treatment. In Proceedings of the Conference on Electronics, Communications and Computers, Alexandria, Egypt, 6–9 March 2012; pp. 173–177. [Google Scholar] [CrossRef]
Tang, J.W.; Caniza, M.A.; Dinn, M.; Dwyer, D.E.; Heraud, J.M.; Jennings, L.C.; Zaidi, S.K. An exploration of the political, social, economic and cultural factors affecting how different global regions initially reacted to the COVID-19 pandemic. Interface Focus 2022, 12, 20210079. [Google Scholar] [CrossRef]
Rezaei, P.; Hachesu, P.R.; Ahmadi, M.; Alizadeh, S.; Sadoughi, F. Use of data mining techniques to determine and predict length of stay of cardiac patients. Healthc. Inform. Res. 2013, 19, 121–129. [Google Scholar] [CrossRef]
Chen, H.; Poon, J.; Poon, S.K.; Cui, L.; Fan, K.; Sze, D.M.Y. Ensemble learning for prediction of the bioactivity capacity of herbal medicines from chromatographic fingerprints. BMC Bioinform. 2015, 16 (Suppl. 12), S4. [Google Scholar] [CrossRef] [PubMed]
Rahmani, A.M.; Yousefpoor, E.; Yousefpoor, M.S.; Mehmood, Z.; Haider, A.; Hosseinzadeh, M.; Ali Naqvi, R. Machine learning in medicine: Review, applications, and challenges. Mathematics 2021, 9, 2970. [Google Scholar] [CrossRef]
Panagiotis, P.; Livieris, I.E. Special issue on ensemble learning and applications. Algorithms 2020, 13, 140. [Google Scholar] [CrossRef]
Ren, Y.; Zhang, L.; Suganthan, P.N. Ensemble classification and regression-recent developments, applications and future directions. IEEE Comput. Intell. Mag. 2016, 11, 41–53. [Google Scholar] [CrossRef]
Jazieh, A.R. Quality measures: Types, selection, and application in health care quality improvement projects. Glob. J. Qual. Saf. Healthc. 2020, 3, 144–146. [Google Scholar] [CrossRef] [PubMed]
Donabedian, A. Evaluating the quality of medical care. Milbank Q. 2005, 83, 691–729. [Google Scholar] [CrossRef] [PubMed]
Khan, H.; Srivastav, A.; Mishra, A.K. Use of classification algorithms in health care. In Big Data Analytics and Intelligence: A Perspective for Health Care; Tanwar, P., Jain, V., Liu, C.M., Goyal, V., Eds.; Emerald Publishing Limited: Bingley, UK, 2020; pp. 31–54. [Google Scholar] [CrossRef]
Zikos, D.; Zikos, D.; Tsiakas, K.; Qudah, F.; Athitsos, V.; Makedon, F. Evaluation of classification methods for the prediction of hospital length of stay using medicare claims data. In Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive Environments (PETRA), Rhodes, Greece, 29–31 May 2013. [Google Scholar] [CrossRef]
Mantas, J.; Zikos, D.; Diomidous, M. Exploring the potential of an electronic documentation system to reduce length of stay. In Proceedings of the 14th World Congress on Medical and Health Informatics, MEDINFO 2013, Copenhagen, Denmark, 20–23 August 2013. [Google Scholar] [CrossRef]
Fontalvo-Herrera, T.; Delahoz-Dominguez, E.; Fontalvo, O. Methodology of classification, forecast and prediction of healthcare providers accredited in high quality in Colombia. Int. J. Product. Qual. Manag. 2021, 33, 1–20. Available online: https://repositorio.utb.edu.co/bitstream/handle/20.500.12585/10351/2021_IJPQM-27920_PPV%20%282%29_oz%20De%20la%20Hoz%20Domingu.pdf?sequence=1&isAllowed=y (accessed on 20 September 2023). [CrossRef]
Mahesh, V.; Mudlappa, M. An ensemble classification based approach for breast cancer prediction. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1065, 012049. [Google Scholar] [CrossRef]
Brandt, P.; Moodley, D.; Pillay, A.W.; Seebregts, C.J.; de Oliveira, T. An investigation of classification algorithms for predicting HIV drug resistance without genotype resistance testing. In Foundations of Health Information Engineering and Systems; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2014; Volume 8315, pp. 236–253. [Google Scholar] [CrossRef]
Rodrigues, D.S.; Nastri, A.C.S.; Magri, M.M.; Oliveira, M.S.D.; Sabino, E.C.; Figueiredo, P.H.; Ferreira, J.E. Predicting the outcome for COVID-19 patients by applying time series classification to electronic health records. BMC Med. Inform. Decis. Mak. 2022, 22, 187. [Google Scholar] [CrossRef]
Sahoo, A.K.; Parida, P.; Muralibabu, K.; Dash, S. Efficient simultaneous segmentation and classification of brain tumors from MRI scans using deep learning. Biocybern. Biomed. Eng. 2023, 43, 616–633. [Google Scholar] [CrossRef]
Ahmad, R.; Akhtar, N.; Choubey, N.S. Applications of Artificial Bee Colony Algorithms and its variants in Health care. Biochem. Ind. J. 2017, 11, 110. Available online: https://www.tsijournals.com/articles/applications-of-artificial-bee-colony-algorithms-and-its-variants-in-health-care.pdf (accessed on 20 September 2023).
Zhang, Z.; Wang, H.; Zhang, W.; Cui, Z. Cooperative-competitive two-stage game mechanism assisted many-objective evolutionary algorithm. Inf. Sci. 2023, 647, 119559. [Google Scholar] [CrossRef]
Rylan, H.; Caldeira, A.; Gnanavelbabu, A. Pareto based discrete Jaya algorithm for multi-objective flexible job shop scheduling problem. Expert Syst. Appl. 2021, 170, 114567. [Google Scholar] [CrossRef]
Heydarpoor, F.; Karbassi, S.M.; Bidabadi, N.; Ebadi, M.J. Solving multi-ob jective functions for cancer treatment by using Metaheuristic Algorithms. Int. J. Comb. Optim. Probl. Inform. 2020, 11, 61–75. [Google Scholar]
Sara Beheshtifar, S.; Alimohammadi, A. Multi-objective evolutionary algorithm for modeling of site suitability for health-care facilities. Health Sci. J. 2013, 7, 209. [Google Scholar]
AbdelAziz, A.M.; Alarabi, L.; Basalamah, S.; Hendawi, A. Multi-Objective Optimization Method for Hospital Admission Problem-A Case Study on Covid-19 Patients. Algorithms 2021, 14, 38. [Google Scholar] [CrossRef]
Ansarifar, J.; Tavakkoli-Moghaddam, R.; Akhavizadegan, F.; Hassanzadeh Amin, S. Multi-objective integrated planning and scheduling model for operating rooms under uncertainty. Proc. IMechE Part H J. Eng. Med. 2018, 232, 930–948. [Google Scholar] [CrossRef]
Ghaheri, A.; Shoar, S.; Naderan, M.; Hoseini, S.S. The Applications of Genetic Algorithms in Medicine. Oman Med. J. 2015, 30, 406–416. [Google Scholar] [CrossRef]
Espinosa, R.; Jiménez, F.; Palma, J. Multi-surrogate assisted multi-objective evolutionary algorithms for feature selection in regression and classification problems with time series data. Inf. Sci. 2023, 622, 1064–1091. [Google Scholar] [CrossRef]
Hosmer, D.W.; Hosmer, T.; Le Cessie, S.; Lemeshow, S. A comparison of goodness of fit tests for the logistic regression model. Stat. Med. 1997, 16, 965–980. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S. A goodness of fit test for the multiple logistic regression model. Commun. Stat. 1980, 9, 1043–1069. [Google Scholar] [CrossRef]
How to Improve the Accuracy of a Regression Model. Available online: https://towardsdatascience.com/how-to-improve-the-accuracy-of-a-regression-model-3517accf8604 (accessed on 20 August 2023).
Fawcett, T. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers; Technical Report; HP Laboratories: Palo Alto, CA, USA, 2003; Available online: https://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf (accessed on 1 September 2023.).
Vuk, M.; Curk, T. ROC curve, lift chart and calibration plot. Metod. Zv. 2006, 3, 89–108. [Google Scholar] [CrossRef]
Dimić, G.; Prokin, D.; Kuk, K.; Micalović, M. Primena decision trees i naive Bayes klasifikatora na skup podataka izdvojen iz Moodle kursa. In Proceedings of the Conference INFOTEH, Jahorina, Bosnia and Herzegovina, 21–23 March 2012; Volume 11, pp. 877–882. [Google Scholar]
Witten, H.; Eibe, F. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.; Morgan Kaufmann: San Francisco, CA USA, 2005. [Google Scholar]
Benoît, G. Data Mining. Annu. Rev. Inf. Sci. Technol. 2002, 36, 265–310. [Google Scholar] [CrossRef]
Romero, C.; Ventura, S.; Espejo, P.G.; Hervás, C. Data mining algorithms to classify students. In Proceedings of the 1st IC on Educational Data Mining (EDM08), Montreal, QC, Canada, 20–21 June 2008; pp. 20–21. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufman: San Francisco, CA, USA, 1988. [Google Scholar]
Zhang, H. The optimality of naive Bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, Miami Beach, FL, USA, 17–19 May 2004; pp. 562–567. [Google Scholar]
Rokach, L.; Maimon, O. Decision trees. In The Data Mining and Knowledge Discovery Handbook; Springer: Berlin/Heidelberg, Germany, 2005; pp. 165–192. [Google Scholar] [CrossRef]
Xiaohu, W.; Lele, W.; Nianfeng, L. An application of decision tree based on ID3. Phys. Procedia 2012, 25, 1017–1021. [Google Scholar] [CrossRef]
Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Francisco, CA, USA, 1993. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting. Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
Bella, A. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2009; pp. 128–146. [Google Scholar] [CrossRef]
Park, H.A. An introduction to logistic regression: From basic concepts to interpretation with particular attention to nursing domain. J. Korean Acad. Nurs. 2013, 43, 154–164. [Google Scholar] [CrossRef] [PubMed]
Rajendra, P.; Latifi, S. Prediction of diabetes using logistic regression and ensemble techniques. Comput. Methods Programs Biomed. Update 2021, 1, 100032. [Google Scholar] [CrossRef]
IBM SPSS Statistics. Available online: https://www.ibm.com/products/spss-statistics (accessed on 15 August 2023).
Zadrozny, B.; Elkan, C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the Eighteenth International Conference on machine learning, ICML 2001, Williamstown, MA, USA, 28 June–1 July 2001; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2001; pp. 609–616. [Google Scholar]
Weka (University of Waikato: New Zealand). Available online: http://www.cs.waikato.ac.nz/ml/weka (accessed on 20 August 2023).
Liu, H.; Motoda, H. Feature Selection for Knowledge Discovery and Data Mining; Kluwer Academic Publishers: New York, NY, USA, 1998. [Google Scholar]
Hall, M.A.; Smith, L.A. Practical feature subset selection for machine learning. In Proceedings of the Computer Science ’98—21st Australasian Computer Science Conference ACSC’98, Perth, Australia, 4–6 February 1998; pp. 181–191. [Google Scholar]
Moriwal, R.; Prakash, V. An efficient info-gain algorithm for finding frequent sequential traversal patterns from web logs based on dynamic weight constraint. In Proceedings of the International Information Technology Conference CUBE ’12, Pune, India, 3–5 September 2012; pp. 718–723, ISBN 978-1-4503-1185-4. [Google Scholar]
Pravena, R.; Valamathi, M.; Sivakumari, S. Gain ratio based feature selection method for privacy preservation. ICTACT J. Soft Comput. 2011, 1, 201–205. [Google Scholar] [CrossRef]
Turhan, N.S. Karl Pearson’s chi-square tests. Educ. Res. Rev. 2020, 15, 575–580. [Google Scholar] [CrossRef]
Robnik-Šikonja, M.; Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef]
Xie, Y.; Li, D.; Zhang, D.; Shuang, H. An Improved Multi-Label Relief Feature Selection Algorithm for Unbalanced Datasets. In Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2018; pp. 141–151. [Google Scholar] [CrossRef]
Harrell, F. Hosmer-Lemeshow vs. AIC for Logistic Regression. Available online: https://stats.stackexchange.com/q/18772 (accessed on 20 August 2023).
Steyerberg, E.W.; Vickers, A.J.; Cook, N.R.; Gerds, T.; Gonen, M.; Obuchowski, N.; Pencina, M.J.; Kattan, M.W. Assessing the performance of prediction models A framework for traditional and novel measures. Epidemiology 2010, 21, 128–138. [Google Scholar] [CrossRef]
Arshed, N.; Pancholi, J. Porter’s generic strategies. In Enterprise and Its Business Environment; Arshed, N., McFarlane, J., Eds.; Goodfellow Publishers Ltd.: Oxford, UK, 2016; ISBN 978-1-910158-78-4. [Google Scholar]
Vahdati, H.; Nejad, S.H.M.; Shahsia, N. Generic competitive strategies toward achieving sustainable and dynamic competitive advantage. Rev. Espac. 2018, 39, 25. Available online: https://www.revistaespacios.com/a18v39n13/18391325.html (accessed on 20 September 2023).
Chikhachev, S.A. Generic models. Algebra Log. 1975, 14, 214–218. [Google Scholar] [CrossRef]
Shelah, S.A. note on model complete models and generic models. Proc. Am. Math. Soc. 1972, 34, 509–514. [Google Scholar] [CrossRef]
Mienye, D.; Sun, Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Bennett, K.P.; Mangasarian, O.L. Robust linear programming discrimination of two linearly inseparable sets. Optim. Methods Softw. 1992, 1, 23–34. [Google Scholar] [CrossRef]
Scikit Learn. Available online: https://scikit-learn.org/stable/modules/ensemble.html#stacking (accessed on 20 September 2023).
Wolpert, D. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Jabbar, H.K.; Khan, R.Z. Methods to avoid over-fitting and under-fitting in supervised machine learning (Comparative study). Comput. Sci. Commun. Instrum. Devices 2015, 12, 978–981. [Google Scholar] [CrossRef]
Aleksić, A.; Nedeljković, S.; Jovanović, M.; Ranđelović, M.; Vuković, M.; Stojanović, V.; Radovanović, R.; Ranđelović, M.; Ranđelović, D. Prediction of important factors for bleeding in liver cirrhosis disease using ensemble data mining approach. Mathematics 2020, 8, 1887. [Google Scholar] [CrossRef]
Ranđelović, D.; Ranđelović, M.; Čabarkapa, M. Using machine learning in the prediction of the influence of atmospheric parameters on health. Mathematics 2022, 10, 3043. [Google Scholar] [CrossRef]
Aleksić, A.; Ranđelović, M.; Ranđelović, D. Using machine learning in predicting the impact of meteorological parameters on traffic incidents. Mathematics 2023, 11, 479. [Google Scholar] [CrossRef]
Ranđelović, M.; Aleksić, A.; Radovanović, R.; Stojanović, V.; Čabarkapa, M.; Ranđelović, D. One aggregated approach in multidisciplinary based modeling to predict further students’ education. Mathematics 2022, 10, 2381. [Google Scholar] [CrossRef]

Figure 1. The flowchart of Algorithm 1.

Figure 2. Block schema for the procedure that is described with Algorithm 1.

Figure 3. Chart diagram representation of the main result in Table 5.

Figure 4. Graphic of determining the highest value of AUC for the lowest number of factors.

Figure 5. Chart diagram representation of the main result in Table 11.

Figure 6. The data flow in the implemented solution.

Figure 7. Block diagram of the proposed solution.

Table 1. Tabular overview of used literature (* means that the mentioned factor type/methodology is used in the reference).

Reference Citations	Organizational Factors	Socio-Economic Factors	Descriptive Statistics	Logistic Regression	ML and Data Mining	Ensemble Methods	Other Strategies
[9]	*		*
[10]	*		*
[11]	*		*
[12]		*	*
[13]		*	*
[14]		*	*
[24]		*		*
[25]		*		*
[26]		*			*
[27]		*			*
[28]		*			*
[29]		*			*
[30]		*			*
[31]		*		*
[32]		*			*
[33]		*
[34]			*
[35]		*			*
[36]		*				*
[37]		*			*	*
[38]		*				*
[39]		*		*		*
[40]	*		*	*	*
[41]	*		*
[42]		*			*
[43]		*			*
[44]	*		*
[45]	*				*
[46]		*				*
[47]		*			*
[48]	*			*
[49]		*			*
[50]	*						*
[51]	*						*
[52]	*						*
[53]		*					*
[54]	*						*
[55]		*					*
[56]	*						*
[57]		*					*
[58]		*		*	*		*

Table 2. The confusion matrix for two-class classifier.

		Label—Predicted
		Positive	Negative
Label—Actual	Positive	TP	FN
Label—Actual	Negative	FP	TN

Table 3. Non-medical factors used in the case study.

Variable’s Serial Number	Non-Medical Factor	Data Type
1	Education	Boolean
2	HospitalType	Boolean
3	Gender	Boolean
4	Age	Boolean
5	DaysofTreatment	Boolean
6	UrbanHousing	Boolean
7	Outcome	Boolean

Table 4. OR values and their 95% CI for assessing the impact of the examined factors.

Variables in the Equation
		B	S.E.	Wald	df	Sig.	Exp(B)	95% CI for EXP(B)
		B	S.E.	Wald	df	Sig.	Exp(B)	Lower	Upper
Step 1 ^a	HospitalType	2.030	0.596	11.610	1	0.001	7.612	2.368	24.464
	UrbanHousing	−1.214	0.055	492.897	1	0.000	0.297	0.267	0.331
	Education	−0.093	0.112	0.690	1	0.406	0.911	0.731	1.135
	Gender	0.102	0.054	3.540	1	0.060	1.107	0.996	1.231
	Age	0.289	0.071	16.524	1	0.000	1.335	1.162	1.535
	DaysOfTreatment	0.823	0.107	58.707	1	0.000	2.277	1.845	2.811
	Constant	1.796	0.077	548.193	1	0.000	6.027

^a Variable(s) entered in step 1: HospitalType, UrbanHousing, Education, Gender, Age, DaysOfTreatment.

Table 5. Enter method—beginning regression analysis using all 6 factors.

Hosmer and Lemeshow Test
Step 1	Chi-square		df		Sig.
Step 1	11.001		6		0.088
Classification Table ^a
						Predicted
						Outcome		Percentage Correct
Step 1		Observed				0	1	Percentage Correct
		Outcome		0		0	1735	0.0
		Outcome		1		0	10,098	100.0
		Overall percentage						85.3

^a The cut-off value is 0.500.

Table 6. Five usual performance indicators obtained using the classification algorithm, which uses all six factors.

	Classifier Configuration	Precision	Recall	F1 Measure	Accuracy	AUC
Naive Bayes	Default	0.728	0.853	0.786	85.3376	0.669
Logit Boost	Default	0.728	0.853	0.786	85.3376	0.670
J48 Decision Tree	Default	0.728	0.853	0.786	85.3376	0.499

Table 7. Factors ranking by the feature selection measures—6 factors.

SerialNum. Tag	Attribute Name	GR-Ranking GainRatio	CHI-Ranking ChiSquared	REL-Ranking ReliefF
1	HospitalType	2	4	1
2	UrbanHousing	1	1	5
3	Education	6	6	3
4	Gender	5	5	6
5	Age	4	3	4
6	DaysOfTreatment	3	2	2

Table 8. AUC of Logit Boost ranking for different number of factors.

Algorithm/Number of Factors	6	5	4	3	2	1
GR	0.671	0.670	0.661	0.651	0.634	0.631
CHI	0.671	0.671	0.661	0.658	0.649	0.631
REL	0.671	0.663	0.560	0.530	0.538	0.501

Table 9. Performance indicators obtained busing the proposed algorithm—Logit Boost using 5 factors.

Number of Factors	AUC Value
6 factors	0.670
5 factors	0.671

Table 10. Hosmer–Lemeshow test and classification table—regression analysis using 5 factors.

Hosmer and Lemeshow Test
Step 1	Chi-Square	df	Sig.
Step 1	9.606	4	0.058
Classification Table ^a
				Predicted
				Outcome		Percentage Correct
Step 1	Observed			0	1
	Outcome	0		0	1735	0.0
	Outcome	1		0	10098	100.0
	Overall percentage					85.3

^a The cut-off value is 0.500.

Table 11. Enter method of logistic regression analysis using 5 factors.

Variables in the Equation
		B	S.E.	Wald	df	Sig.	Exp (B)	95% CI for EXP(B)
								Lower	Upper
Step1 ^a	Hospital Type	2.039	0.596	11.727	1	0.001	7.687	2.392	24.698
	Urban Housing	−1.209	0.054	494.128	1	0.000	0.298	0.268	0.332
	Gender	0.108	0.054	4.012	1	0.045	1.114	1.002	1.237
	Age	0.294	0.071	17.189	1	0.000	1.342	1.168	1.542
	Days of Treatment	0.824	0.107	58.885	1	0.000	2.280	1.847	2.814
	Constant	1.782	0.075	570.063	1	0.000	5.940

^a Variable(s) entered in step 1: Hospital Type, Urban Housing, Gender, Age, Days of Treatment.

Table 12. Comparison performance indicators of the proposed and best-known ensemble procedures.

	Precision	Recall	F1 Measure	Accuracy	AUC
Proposed model	0.728	0.853	0.786	85.3376	0.671
Ada Boost	0.728	0.853	0.786	85.3376	0.670
Bagging	0.728	0.853	0.786	85.3376	0.640
Random Forest	0.728	0.853	0.787	85.303	0.670

Table 13. AUC comparison of the proposed methodology with state-of-the-art algorithms.

Methodology	AUC
Proposed model	0.671
Ada Boost	0.670
Bagging	0.640
Random Forest	0.670

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mišić, J.; Kemiveš, A.; Ranđelović, M.; Ranđelović, D. An Asymmetric Ensemble Method for Determining the Importance of Individual Factors of a Univariate Problem. Symmetry 2023, 15, 2050. https://doi.org/10.3390/sym15112050

AMA Style

Mišić J, Kemiveš A, Ranđelović M, Ranđelović D. An Asymmetric Ensemble Method for Determining the Importance of Individual Factors of a Univariate Problem. Symmetry. 2023; 15(11):2050. https://doi.org/10.3390/sym15112050

Chicago/Turabian Style

Mišić, Jelena, Aleksandar Kemiveš, Milan Ranđelović, and Dragan Ranđelović. 2023. "An Asymmetric Ensemble Method for Determining the Importance of Individual Factors of a Univariate Problem" Symmetry 15, no. 11: 2050. https://doi.org/10.3390/sym15112050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Asymmetric Ensemble Method for Determining the Importance of Individual Factors of a Univariate Problem

Abstract

1. Introduction

2. Background Review

2.1. Literature Review of Different Methodologies That Deal with Patient Treatment

2.2. State of the Art

2.2.1. ML-Based Classification Method

2.2.2. ML-Based Feature Selection Techniques

3. Materials and Methods

3.1. Materials

Data Acquired during the period 2006 to 2009 by the Institute of Public Health in Nis

3.2. Methods

3.2.1. Ensemble Prediction Methods

3.2.2. Ensemble Prediction Method of Selected Factor Effect on Inpatient Treatment Quality

4. Results and Discussion

4.1. Input Data for Considered Case Study

4.2. Using Logistic Regression Analysis and Classification Algorithms

4.2.1. Using Logistic Regression

4.2.2. Using Classification Algorithms

4.2.3. Check Fulfillment of Set Conditions

4.3. Using Feature Selection

4.4. Decision Blcok

4.5. Discussion

5. Technical Solution—Code Implementation and Real-Life Software Platform Usage

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI