Next Article in Journal
Flow Stress Description Characteristics of Some Constitutive Models at Wide Strain Rates and Temperatures
Next Article in Special Issue
Business Intelligence’s Self-Service Tools Evaluation
Previous Article in Journal
The NESTORE e-Coach: Designing a Multi-Domain Pathway to Well-Being in Older Age
Previous Article in Special Issue
Big Data in Biodiversity Science: A Framework for Engagement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Rough-Set-Theory-Based Classification with Optimized k-Means Discretization

by
Teguh Handjojo Dwiputranto
*,
Noor Akhmad Setiawan
and
Teguh Bharata Adji
Department of Electrical and Information Engineering, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
*
Author to whom correspondence should be addressed.
Technologies 2022, 10(2), 51; https://doi.org/10.3390/technologies10020051
Submission received: 10 March 2022 / Revised: 5 April 2022 / Accepted: 6 April 2022 / Published: 8 April 2022

Abstract

:
The discretization of continuous attributes in a dataset is an essential step before the Rough-Set-Theory (RST)-based classification process is applied. There are many methods for discretization, but not many of them have linked the RST instruments from the beginning of the discretization process. The objective of this research is to propose a method to improve the accuracy and reliability of the RST-based classifier model by involving RST instruments at the beginning of the discretization process. In the proposed method, a k-means-based discretization method optimized with a genetic algorithm (GA) was introduced. Four datasets taken from UCI were selected to test the performance of the proposed method. The evaluation of the proposed discretization technique for RST-based classification is performed by comparing it to other discretization methods, i.e., equal-frequency and entropy-based. The performance comparison among these methods is measured by the number of bins and rules generated and by its accuracy, precision, and recall. A Friedman test continued with post hoc analysis is also applied to measure the significance of the difference in performance. The experimental results indicate that, in general, the performance of the proposed discretization method is significantly better than the other compared methods.

1. Introduction

Classification is one of the processes commonly completed by researchers in machine learning (ML). In general, the purpose of classification is to assign an object to one of the categories that has been predefined. Currently, there are various algorithms for classification, such as Decision Tree, Artificial Neural Network, Random Forest, Fuzzy Logic, and many more, including Rough Set Theory (RST). To obtain the best result, selecting the proper algorithm is crucial by considering not only the accuracy but also the cost of training, cost of testing, and cost of the implementation. Another important factor is whether the classification model needs to be built as a white or black box model. If a white box model is expected, a method such as Decision Tree, Fuzzy Logic, or RST can be applied because this method can produce transparent decision rules.
In a dataset that will be processed for classification, attributes that have continuous values are often found. Hence, the data of the attributes cannot be directly processed by a classifier that requires discrete data, such as RST. To be able to process the dataset, a discretization process should be carried out first.
Currently, there are many state-of-the-art methods for discretization, as reported in Refs. [1,2]. Based on this report, there are two main groups of discretization methods, i.e., supervised and unsupervised. This work also conducted a survey, finding that the popular methods for unsupervised discretization use an equal-width and equal-frequency base. The disadvantage of this unsupervised method is that we cannot be sure whether the discrete results are optimal since there is no feedback to measure the optimality of discrete results at the time of the process. To generate optimal discretized values, a supervised method should be applied. One of the popular methods for supervised discretization is entropy-based [1]. However, the next question is whether the entropy-based method will be suitable or not for RST-based classifiers.
This paper aims to improve the classification performance using the RST method on various datasets with continuous values obtained from UCI. The contribution of this study is to propose data pre-processing methods related to discretization before carrying out the classification process. The proposed method starts with applying k-means to discretize continuous value attributes, then optimizes them by using a genetic algorithm (GA) that involves one of the RST instruments, called the dependency coefficient, to maintain the quality of the dataset as the original after the implementation of the discrete process.
By involving one of the RST elements in the discretization process, it is expected that the discretization results will be suitable for the RST-based classifier. Thus, the novelty of the proposed method compared to other discretization processes is that the method is based on approximation quality with the expectation that it will give better results to be used by the RST-based classifier because the approximation is controlled by one of the RST elements from the beginning.
This paper is organized as follows: Section 2 explains the theoretical basis of the RST, which begins with the concept of approximation in the framework of rough sets, and then continues with an explanation of the basic notions and characteristics of the RST. Section 3 presents the need for discretization and its various techniques, especially those related to the proposed method. Section 4 describes the basic concepts of the proposed method and the algorithm in pseudo-code form. Section 5 presents the experimental framework, the datasets used, and other popular discretization methods. Section 6 describes the analysis of the experimental results, and this paper is concluded in Section 7.

2. Basic Notions

Before the detailed description of the method proposed in this article is discussed, a basic picture of RST that was first proposed by Zdzislaw Pawlak in 1982 will be given. This RST method is intended to classify and analyze imprecise, uncertain, or incomplete information and knowledge [3,4]. The underlying concept of the RST is the size approximation of the lower and upper sets. The approximation of the size of the lower subset is determined by the group of objects that are becoming members of the desired subset. Meanwhile, the size of the upper subset approximation is determined by the possible group of objects to become a member of the desired subset. Any subset defined or bordered by an upper–lower approximation is called a Rough Set [3]. Since it was proposed, RST has been used as a valuable tool for solving various problems, such as for imprecise or uncertain knowledge representation, knowledge analysis, quality measurement of the information available on the data pattern, data dependency and uncertainty analysis, and information reduction [5].
This RST approach also contributes to the artificial intelligence (AI) foundation, especially in machine learning, knowledge discovery, decision analysis, expert systems, inductive reasoning, and pattern recognition [3].
The rough sets approach has many advantages. Some of the most prominent advantages of applying RST are 6:
  • Efficient in finding hidden patterns in the dataset;
  • Able to identify difficult data relationships;
  • Able to reduce the amount of data to a minimum (data reduction);
  • Able to evaluate the level of significance of the data;
  • Able to produce a set of rules for transparent classification.
The following sub-sections will explain the basic and important philosophies associated with RST to be discussed based on Refs. [3,6,7,8,9].

2.1. Equivalent Relations

Let U be a non-empty set, whereas p ,   q , and r are elements of U . If R is a symbol of a relation so that p R q is a relation function between p and q , then R is said to be an equivalent relation when it meets three properties as follows:
  • Reflexive: p R p for all p in U ;
  • Symmetric: if p R q , then q R p ;
  • Transitive: if p R q and q R r , then p R r .
  • If x in U , then R x = y U : y R x is the equivalence class of x with respect to R .

2.2. Information System and Relationship Indiscernibility

Let T = U ,   A ,   Q ,   ρ be an Information System I S , where U is a set of non-empty objects called universe, A is a set of attributes, Q is the union among the attribute domains in A , and ρ : U × Q A is the description of the total function. For classification, the set of attributes, A , is divided into condition attributes denoted by C O N and a decision attribute denoted by D E C . When the attributes of the information table have been divided into condition and decision attributes, then the table is called a decision table. The element of U can be called object, case, instance, or observation [10]. The attributes can be called features, variables, or characteristic conditions. If an attribute a is given, then: a : U V a for a A . V a is called the set of values of a .
If a A , P A , then an indiscernibility relation I N D P can be defined as: I N D P = { x , y U × U : for all a P ,   a x = a y } , or in the statement that the two objects are said to be indiscernible when the two objects are indistinguishable since they do not have sufficient differences in the set of attributes called P . The equivalence class of indiscernibility relation I N D P is denoted by X P .

2.3. Lower Approximation Subset

Let B C , where C is a set of condition attributes, and X U ; then, the B-lower approximation subset of X is the set of all elements of U that can be classified exactly as an element of X , and it is shown in Equation (1):
B * X = x U : X B X

2.4. Upper Approximation Subset

A B-upper approximation subset of X is the set of all elements of U that may be classified as elements of X , and this is shown in Equation (2):
B * X = x U : X B X

2.5. Boundary Region Subset

This subset contains a group of elements as defined in Equation (3). This set contains objects that, whether they belong to the X classification, cannot be determined exactly.
B N B X = B * X B * X

2.6. Rough Set

A set obtained by the lower and upper approximations is called a rough set. When a rough set is found, then it must be B * X B * X . Figure 1 illustrates each set that meets Equations (1)–(3).

2.7. Crisp Set

If B * X = B * X , then the set is called a crisp set.

2.8. Positive Region Subset

This is a set that has an object of the universal set U that can be classified or partitioned into certain classes of U/D using the set of attributes C, as shown in Equation (4).
P O S C D = C * X ,
where U/D is the partitioning of U based on the attribute values of D and C * X is the notation of lower approximation of the set X with respect to C . The positive region of the subset X belonging to the partition U / D is also called the lower approximation of the set X . The positive region of a decision attribute with respect to a subset C approximately represents the quality of C . The union of the positive and the boundary regions yields the upper approximation [7].

2.9. Dependency Coefficient

Let T = U , A , C , D be a decision table. The dependency coefficient between attribute condition C and attribute decision D can be formulated as in Equation (5) as follows:
γ C , D = P O S C D / U
The value of the dependency coefficient is in the range from 0 to 1. This coefficient represents a portion of the objects that can be correctly classified against the total. If γ = 1 , then D is completely related to C , if 0 < γ < 1 , then D is said to have partial relation on C , and if γ = 0 , then D has no dependency to C . A decision table depends on the feature set condition when all values on the decision feature D can be uniquely determined by the condition attribute values.

2.10. Reduction of Attributes

As explained in Section 2.2., it is possible that two or more objects are indiscernible because they do not have enough different attribute values. In this case, it is necessary to make savings so that only one element of the equivalence class is required to represent the whole class. To be able to make savings, some additional notions are needed.
Let T = U , A be an information system, P A , and let a P . It can be said that a is dispensable in P if I N D T P = I N D T P a ; otherwise, a is indispensable in P . A set P is called independent if all of its attributes are indispensable.
Any subset P of P is called a reduct of P if P is independent and I N D T P = I N D T P .
Therefore, reduct is the minimal set of attributes without changing the classification results when using all attributes. In other words, the attributes not in reduct are considered redundant and have no effect on classification.

2.11. Discernibility Matrix and Function

Reducts have several properties, one of which is the validity of the relation, as shown in Equation (6). Let P be a subset of A . The core of P is the set off all indispensable attributes of P [10].
C o r e P = R e d P ,
where R e d P is the set of all reducts of P .
In order to easily calculate reduct and core, discernibility matrix can be used [10], which is defined as follows.
Let T = U , A be an information system with n objects. The discernibility matrix of T is a symmetric n × n matrix with entries in c i j , as given in Equation (7).
c i j = a A | a x i a x j   for   i ,   j = 1 , , n
A discernibility function f T for an information system T is a Boolean function of m Boolean variables a 1 * , , a m * (corresponding to the attribute a 1 , , a m ), defined as follows:
f T a 1 * , , a m * = c i j * | 1 j i n , c i j ,
where c i j = a * | a c i j .

3. Discretization

Discretization is one of the data preprocessing activity types performed in the preparation stage as well as data normalization, data cleaning, data integration, and so on. Often, data preprocessing needs to be performed to improve the efficiency in subsequent processes [11]. It is also needed to meet the requirements of the method or algorithm to be executed. The rough-set-theory-based method is one of the methods that requires data in the discrete form. Therefore, if the dataset to be processed is in continuous mode, then the discretization process is required.
There are several well-known discretization techniques that can be categorized based on how the discretization process is carried out. When it is carried out by referring to the labels that have been provided in the dataset, then it is called supervised discretization, while, if the label is not available, then it is categorized as unsupervised discretization [11].
Discretization by binning is one of the discretization techniques based on a specified number of bins. If the dataset has a label, then the number of bins for discretization can be determined for as many as the number of classes on the label, while, for a dataset with no label, an unsupervised technique, such as clustering, should be applied.

3.1. k-Means

Cluster analysis or clustering is one of the most popular methods for discretization. This technique can be used to discretize a numeric attribute, A , by dividing the values of A into several clusters [11]. This experiment applies the k-means method to discretize the numeric attributes of the dataset.
k-means is a centroid-based method. Assume A is one of the numeric attributes of a dataset D . Partitioning can be performed on the A attribute into k clusters, C 1 , C 2 , , C k , where C i A and C i C j = for ( 1 i , j k ). In k-means, the centroid, c i , of a cluster C i is the center point that is defined as the mean of the points assigned to the cluster. The difference between a point, p n , and its centroid, c i , is measured using a distance function, d i s t p n , c i . The most popular formula to measure the distance is by using the Euclidean distance formula, as shown by Equation (9).
d i s t x , y = i = 1 n x i y i 2
Because k-means is one of the unsupervised techniques, then the value of k is not known and it is usually defined through trial and error iteratively to find the optimum value. To automate this trial-and-error process, an optimization technique should be applied. There are many optimization techniques available, but this experiment employs genetic algorithm (GA) technique to find the optimum value for k .
In this experiment, k is optimum if the value is as minimal as possible without losing the quality of the information of the dataset. This experiment uses γ C , D function, as shown in Equation (5).

3.2. Genetic Algorithm

Genetic algorithm (GA) is an algorithm inspired by biological phenomena, namely the process of genetic evolution from the creation of a population that consists of some individuals who later experience genetic evolution. There are three genetic processes that occur, i.e., selection, crossover, and mutation, to obtain new individuals who are expected to be stronger or fitter during the next cycle selection process [12]. Figure 2 shows GA’s operational processes. Figure 3 illustrates the crossover process.

4. Proposed Method

The concept of the proposed method for the discretization in this experiment is the integration of RST, k-means, and GA. An RST is used to measure the dependency coefficient, which can be used to define the approximation quality. Therefore, the transformed dataset after the discretization process will not decrease the quality of the information from the original dataset. To measure the approximation quality, the formula of RST dependency coefficient, γ C , D , as shown in Equation (5), is applied.
Further, k-means is applied to cluster continuum data attributes. The result is the number of bins or clusters of the attributes. The bins are then transformed into discrete values. The GA function is used to minimize the number of bins or clusters of every attribute, which, at the same time, must meet the constraint in which the value of γ C , D is equal to 1 or any value that is targeted. Minimizing the number of bins is expected to generate the most optimum number of RST rules, which make the classification process become more efficient. The following algorithm of the proposed method is developed to find the most optimal discretization scenario of an Information System.
As shown on the pseudo-code, the algorithm of the proposed method begins with reading the training dataset to construct a table called T = U , A , V , f , where U is a set of objects, A is a set of the attributes, V is a set of values of the attributes, and f is a function of the relationship between the object and the attributes. This table is then transformed into a decision table, called D T = U , C , D , v , f , where C is the condition attribute set and D is the decision attribute set that satisfies C D = A .
After the dataset is loaded, the process continues with the setting of the GA process, starting from the number of chromosomes, which is associated with the number of attributes, and followed by the number of genes for each chromosome, which is associated with the number of centroids or bins of the respective attribute. After the setting of the GA parameters is completed, it continues by executing the GA processes based on Figure 2 and Figure 3. The objective function of the GA is to minimize the number of bins for each attribute with a certain value of γ C , D as the constraint.
The end of the GA iteration contains the process to convert the chromosome values into the attribute bin values. When the maximum iteration is achieved, then the bin values of each attribute are considered optimum and then are used to discretize the condition of attribute values.

5. Experimental Setup

In this section, the test results of the proposed algorithm are compared with two popular discretization algorithms, namely equal-frequency, which is processed using unsupervised learning, and entropy-based, which uses supervised learning.
Four datasets downloaded from the UCI data repository with details of the properties owned by each dataset shown in Table 1 are selected. Those datasets are:
1. 
iris;
2. 
ecoli;
3. 
wine;
4. 
banknote.
The proposed algorithm was tested on four datasets and compared with two discretization methods, namely equal-frequency and entropy-based. Figure 4 shows the flow of the research.
In the initial step, a k-fold mechanism with k = 5 is applied to each dataset so that a ratio of 80:20 is obtained, where 80% of the data are used for the training and 20% for testing. The k-fold approach is applied to ensure that every record in the dataset becomes either a training or test dataset. With the application of k-fold, it is expected that the results of testing the algorithm can be more reliable. Each fold of each dataset is then discretized using three tested methods, namely: equal-frequency (EQFREQ), entropy-based (ENTROPY), and the proposed method, which is based on genetic algorithm and rough set theory (GARST).
Discretization with the EQFREQ and ENTROPY methods was concluded on the Rosetta software ver. 1.4.41. Meanwhile, the proposed method was developed by using Python 3.8 based on Algorithm 1.
Algorithm 1.Pseudo-code of proposed method.
Input:   A dataset in the form of Table T = U ,   A ,   V ,   f
Output:   Optimum numbers of bins for each condition attribute in the form of discretized table D i s c T = (U, C, D, Vc disc, f)
Create decision table D T = U ,   C , D ,   V ,   f = c o n v e r t _ t o _ D T T , where   C D = A ;
Introduce integer variable m a x K = 10 or any integer value;
Introduce scalar and vector variables g e n B i t ,   n u m C h r o m ,   p o p Size, max Generation, constraintGA,
Chromosome, Individu, Fitness, Parents, Offsprings, New Pop for the GA processes;
genBit integer _ t o _ bineary(maxK); numChrom cardinality(C);
popSize 30 or any integer value;
maxGeneration 50 or any integer value;
forindv 1 to popSize do
for chr 1 to numChrom do
  Chromosome[chr]   binary _ r a n d o m g e n B i t ;
end
Individu[indv] [Chromosome[numChrom]];
end
constraintGA 0.8 or any real value between 0.0 and 1.0;
Introduce vector variables Bins, Discr _ V, γ C D   for   the   RST   processes ;
for generation 1   to   m a x G e n e r a t i o n   do
for indv  1 to popSize do
  for chr 1 to numChrom do
Bins[chr] = KMeans(C[chr], binary _ to _ i n t e g e r I n d i v i d u c h r ;
End
for   c 1   to   c a r d i n a l i t y   do
 Discr_V[c] ← discretize(V[c], Bins[c]);
γ C D [indvc] calc _ γ C D (Discr _ V c , V dc ) by referring to Eq. 2.5;
if   γ C D [indvc]  constraintGA then
   Fitness [ indv ]   s u m _ c a r d i n a l i t y B i n s 1 , , B i n s n u m C h r o m
else   Fitness [ indv ]   v e r y _ b i g _ vaule;
 End
End
Parents s e l e c t _ t h e _ m o s t _ f i t I n d i v i d u 1 , , I n d i v i d u p o p S i z e to create parents; the
Individu have smaller Fitness value will have chance to be selected as a parent;
Offsprings crossover(Parents) to create Offsprings;
NewPop mutate(Offsprings);
Run transform(NewPop) to create new list of Individu in the form of
[Individu[1],…, Individu[popSize]];
end
return DiscT = U ,   C ,   D ,   V , d i s c ,   f
After the 5-fold datasets have been discretized, each fold is reduced and then rules generation is performed using the Rosetta software. The reduct process is carried out using the RST method based on a discernibility matrix, and rule generation using the application of Boolean algebra to the built discernibility matrix, as described in Section 2. This process is repeated five times for each dataset due to the application of 5-fold.
The final step of this experiment is to compare the performance of the three methods. The measuring instruments used in the experiment and their explanations are listed in Table 2.
To ensure that there is a difference in performance between the three tested methods, the statistical Friedman test method was applied to this experiment. The Friedman test is a statistical measuring tool used to determine whether there is a statistically significant difference in the average value of three or more groups [13]. If the p-value of the Friedman test is less than 0.05, then there is a significant difference. The post hoc test was used as a continuation of the Friedman test to determine which group had a significant difference compared to the other groups.

6. Results and Discussion

After the entire process is completed, the last step is to review the performance of each discretization method. Table 3 shows the performance comparison of the discretization methods of the equal-frequency (EQFREQ), entropy-based (ENTROPY), and genetic algorithm and rough set theory (GARST) proposed in this paper.
Compared to the performance of the EQFREQ and ENTROPY discretization methods, it is confirmed that the proposed method (GARST) has a better performance, showing the smallest number of the generated bins and rules across three datasets, namely iris, wine, and banknote. The ENTROPY method indicates a better performance for the ecoli dataset, demonstrated by the smallest number of bins; however, the GARST method is still superior because it succeeded in generating the smallest number of rules in all the datasets, including ecoli.
Table 4 shows the test results that are presented in statistical measures, namely average and standard deviation. From this table, it can be seen that the GARST method has the highest average accuracy, precision, and recall, and has competitive values for standard deviation.
Figure 5 describes the distribution of the accuracy values for each test. From this figure, it can be seen that the GARST method produces consistent accuracy values, although it is not always superior. Thus, it can be concluded that the GARST method is generally proven to have a superior performance in terms of accuracy and reliability, as measured by precision and recall, compared to the other two methods.
According to non-parametric statistical testing, namely the Friedman test, as shown in Table 5, the p-value obtained is smaller than 0.05, so it can be concluded that there is a significant difference between the three methods. Meanwhile, from the post hoc test results, as shown in Table 6, the p-values of ENTROPY vs. GARST and EQFREQ vs. GARST are all less than 0.05, so it can be concluded that the GARST method is a method that has a significant difference compared to the other two methods.

7. Conclusions

A method to improve the accuracy and reliability of the RST-based classifier model has been proposed by involving the RST instruments at the beginning of the discretization process. This method uses a k-means-based discretization method optimized with a genetic algorithm (GA). As a result, the method was proven not to sacrifice the degree of information quality from the dataset and the performance was quite competitive compared to the popular state-of-the-art methods, namely equal-frequency and entropy-based. Moreover, the proposed discretization method based on k-means optimized by GA and using one of the rough set theory instruments has proven to be effective for use in the RST classifier.
The test of the discretization method proposed in this study uses four datasets that have different profiles in the 5-fold scenario, and the results were tested by using Friedman and post hoc tests; therefore, it can be concluded that the proposed method should be effective for discretization purposes to any dataset, especially for the RST-based classification cases. The disadvantage of this proposed method is an unstable speed during discrete processes, especially in the optimization of the number of bins. This is due to the application of a heuristic approach by GA.

Author Contributions

Writing—review & editing, T.H.D., N.A.S. and T.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the datasets used in this study were taken from the UCI public data repository.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Garcia, S.; Luengo, J.; Sáez, J.A.; Lopez, V.; Herrera, F. A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Trans. Knowl. Data Eng. 2013, 25, 29–37. [Google Scholar] [CrossRef]
  2. Dash, R.; Paramguru, R.L.; Dash, R. Comparative Analysis of Supervised and Unsupervised Discretization Techniques. Int. J. Adv. Sci. Technol. 2011, 2, 29–37. [Google Scholar]
  3. Pawlak, Z. Rough Set. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  4. Pawlak, Z. Rough Sets, Rough Relations and Rough Functions. Fundam. Inform. 1996, 27, 103–108. [Google Scholar] [CrossRef] [Green Version]
  5. Pawlak, Z. Information Systems Theoretical Foundations. Inf. Syst. 1981, 6, 205–218. [Google Scholar] [CrossRef] [Green Version]
  6. Kalaivani, R.; Suresh, M.V.; Srinivasan, A. Study of Rough Sets Theory and It’s Application Over Various Fields. J. Appl. Sci. Eng. 2017, 3, 447–455. [Google Scholar]
  7. Pawlak, Z.; Skowron, A. Rudiments of Rough Sets. Inf. Sci. 2007, 177, 3–27. [Google Scholar] [CrossRef]
  8. Czerniak, J. Evolutionary Approach to Data Discretization for Rough Sets Theory. Fundam. Inform. 2009, 92, 43–61. [Google Scholar] [CrossRef]
  9. Pawlak, Z.; Skowron, A. Rough Membership Functions: A tool for reasoning with uncertainty. Algebraic Methods Log. Comput. Sci. 1993, 28, 135–150. [Google Scholar] [CrossRef] [Green Version]
  10. Suraj, Z. An Introduction to Rough Set Theory and Its Applications A tutorial. In Proceedings of the ICENCO’2004, Cairo, Egypt, 27–30 December 2004. [Google Scholar]
  11. Han, J.; Kamber, M.; Pei, J. Data Mining, 3rd ed.; Morgan Kaufmann Publishers: Burlington, MA, USA; Elsevier Inc.: Amsterdam, The Netherlands, 2012. [Google Scholar]
  12. Vincent, P.; Cunha Sergio, G.; Jang, J.; Kang, I.M.; Park, J.; Kim, H.; Lee, M.; Bae, J.H. Application of Genetic Algorithm for More Efficient Multi-Layer Thickness Optimization in Solar Cells. Energies 2020, 13, 1726. [Google Scholar] [CrossRef] [Green Version]
  13. Riffenburgh, R.H.; Gillen, D.L. Statistics in Medicine, 4th ed.; Academic Press: Cambridge, MA, USA; Elsevier Inc.: Amsterdam, The Netherlands, 2020. [Google Scholar]
Figure 1. The illustration of rough set. The universe (U) is the union of all blocks. If the set (X) is represented by the red shape, then the lower approximation is the union of all green blocks, the upper approximation is the union of all green and yellow blocks, and the boundary region is the union of yellow blocks, while the union of all white blocks is called outside approximation.
Figure 1. The illustration of rough set. The universe (U) is the union of all blocks. If the set (X) is represented by the red shape, then the lower approximation is the union of all green blocks, the upper approximation is the union of all green and yellow blocks, and the boundary region is the union of yellow blocks, while the union of all white blocks is called outside approximation.
Technologies 10 00051 g001
Figure 2. Genetic algorithm process flow.
Figure 2. Genetic algorithm process flow.
Technologies 10 00051 g002
Figure 3. Illustration of some crossover-type processes.
Figure 3. Illustration of some crossover-type processes.
Technologies 10 00051 g003
Figure 4. Flow of the research.
Figure 4. Flow of the research.
Technologies 10 00051 g004
Figure 5. Plots showing the distribution of accuracy values.
Figure 5. Plots showing the distribution of accuracy values.
Technologies 10 00051 g005
Table 1. Descriptions of the tested datasets in this research.
Table 1. Descriptions of the tested datasets in this research.
PropertiesDatasets
irisecoliwinebanknote
# of examples1502711781370
# of classes3832
# of condition attributes47134
Table 2. Metrics to measure the performance.
Table 2. Metrics to measure the performance.
Measurement UnitObjectiveRemarks
# of binsAn integer value that indicates the number of bins resulting from discretization.The smaller this value, the better the performance of the discretization method because the dataset resulting from the discretization becomes simpler.
# of rulesAn integer value that indicates the number of rules generated by RST after the reduct process.The smaller this number indicates the better performance of the discretization method because the smaller number of rules makes it easier to understand and more transparent.
AccuracyProvides a measure of how many samples were correctly predicted by a classifier compared to the total number of samples.This metric is applied to measure the overall performance.
PrecisionProvides a measurement of how many samples are correctly predicted for a particular class. This is the TP ratio of a given class to the number of samples predicted as this class, in other words, the total number of TP and FP.This metric is applied to measure the class-by-class performance of a method.
RecallProvides a measurement of how many samples are correctly predicted in a given class.This metric also measures the class-by-class performance of a model.
Table 3. Number of bins and rules generated by each method.
Table 3. Number of bins and rules generated by each method.
irisecoliwinebanknote
# of bins# of rules# of bins# of rules# of bins# of rules# of bins# of rules
ENTROPYFold-120494210414339525012941
Fold-22060361071524874296341
Fold-321833213715568924541364
Fold-421874511616483665642760
Fold-5211024311816793944981423
Average20.676.239.6116.4156.26695.6462.61765.8
Max211024513716793945642941
Min2049321041433952296341
StdDev0.489919.13534.841511.56898.61162047.229390.3761967.3199
EQFREQFold-12018627401655047320158
Fold-22019227218654877020149
Fold-32022027214654992120154
Fold-42013527215654992920157
Fold-52018627211655140120151
Average20183.827251.86550098.820153.8
Max2022027401655140120158
Min2013527211654877020149
StdDev0.000027.45470.000074.63350.0000855.79260.00003.4293
GARSTFold-11739311644625271755
Fold-21154292772329951336
Fold-31321281544352121373
Fold-416252913645103191588
Fold-51153301344657341365
Average13.638.429.417340.65357.414.263.4
Max17543127746103191788
Min1121281342325271336
StdDev2.498013.70551.019853.19408.86792770.29031.600017.4425
Table 4. The accuracy, precision, and recall of each method.
Table 4. The accuracy, precision, and recall of each method.
irisecoliwinebanknote
AccAvgAvgAccAvgAvgAccAvgAvgAccAvgAvg
(%)PrecRecall(%)PrecRecall(%)PrecRecall(%)PrecRecall
ENTROPYFold-196.670.970.9529.850.180.2138.890.420.4174.400.740.74
Fold-293.330.930.9334.330.270.3350.000.500.5199.540.810.81
Fold-396.670.970.9726.870.420.4041.670.380.4099.890.840.84
Fold-493.330.940.9420.900.240.1950.000.440.4799.880.650.65
Fold-593.330.950.9525.370.230.2049.650.510.4999.960.830.82
Global Avg94.670.950.9527.460.270.2746.040.450.4694.730.770.77
Max96.670.970.9734.330.420.4050.000.510.5199.960.840.84
Min93.330.930.9320.900.180.1938.890.380.4074.400.650.65
StdDev1.630.020.014.490.080.084.790.050.0410.170.070.07
EQFREQFold-153.330.620.5450.750.400.3452.780.540.5391.200.740.74
Fold-293.330.930.9435.820.280.1952.780.590.4697.580.810.81
Fold-3100.001.001.0029.850.350.1941.670.270.4299.940.900.90
Fold-483.330.840.8525.370.310.1350.000.660.5399.980.930.93
Fold-583.330.810.8332.840.370.2347.570.650.5399.970.870.87
Global Avg82.670.840.8334.930.340.2248.960.540.4997.730.850.85
Max100.001.001.0050.750.400.3452.780.660.5399.980.930.93
Min53.330.620.5425.370.280.1341.670.270.4291.200.740.74
StdDev15.970.130.168.630.040.074.130.140.053.390.070.07
GARSTFold-1100.001.001.0052.240.460.3583.330.840.8496.800.970.97
Fold-296.670.960.9749.250.230.2588.890.910.8899.860.940.94
Fold-390.000.900.9043.280.400.2469.440.690.6699.960.930.93
Fold-493.330.940.9455.220.400.3266.670.770.6999.990.970.98
Fold-596.670.970.9756.720.420.3868.060.760.7299.990.940.94
Global Avg95.330.950.9651.340.380.3175.280.790.7699.320.950.95
Max100.001.001.0056.720.460.3888.890.910.8899.990.970.98
Min90.000.900.9043.280.230.2466.670.690.6696.800.930.93
StdDev3.400.030.034.780.080.059.060.070.091.260.020.02
Table 5. The results of Friedman test for the accuracy.
Table 5. The results of Friedman test for the accuracy.
DatasetDiscretization Methods
ENTROPYEQFREQGARST
iris96.6753.33100.00
93.3393.3396.67
96.67100.0090.00
93.3383.3393.33
93.3383.3396.67
ecoli29.8550.7552.24
34.3335.8249.25
26.8729.8543.28
20.9025.3755.22
25.3732.8456.72
wine38.8952.7883.33
50.0052.7888.89
41.6741.6769.44
50.0050.0066.67
49.6547.5768.06
banknote74.4091.2096.80
99.5497.5899.86
99.8999.9499.96
99.8899.9899.99
99.9699.9799.99
Friedman Test Resultp-value0.000003224
Table 6. The results of post hoc test.
Table 6. The results of post hoc test.
MethodENTROPYEQFREQGARST
ENTROPY1.0000.5560.001
EQFREQ0.5561.0000.001
GARST0.0010.0011.000
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dwiputranto, T.H.; Setiawan, N.A.; Adji, T.B. Rough-Set-Theory-Based Classification with Optimized k-Means Discretization. Technologies 2022, 10, 51. https://doi.org/10.3390/technologies10020051

AMA Style

Dwiputranto TH, Setiawan NA, Adji TB. Rough-Set-Theory-Based Classification with Optimized k-Means Discretization. Technologies. 2022; 10(2):51. https://doi.org/10.3390/technologies10020051

Chicago/Turabian Style

Dwiputranto, Teguh Handjojo, Noor Akhmad Setiawan, and Teguh Bharata Adji. 2022. "Rough-Set-Theory-Based Classification with Optimized k-Means Discretization" Technologies 10, no. 2: 51. https://doi.org/10.3390/technologies10020051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop