A New Rough Set Classifier for Numerical Data Based on Reflexive and Antisymmetric Relations

Ishii, Yoshie; Iwao, Koki; Kinoshita, Tsuguki

doi:10.3390/make4040054

Open AccessArticle

A New Rough Set Classifier for Numerical Data Based on Reflexive and Antisymmetric Relations

by

Yoshie Ishii

^1,*

,

Koki Iwao

²

and

Tsuguki Kinoshita

^3,*

¹

United Graduate School of Agricultural Science, Tokyo University of Agricultural and Technology, 3-21-1 Chuo, Ami 300-0393, Japan

²

Geological Survey of Japan, National Institute of Advanced Industrial Science and Technology, Tsukuba Central 7, Higashi 1-1-1, Tsukuba 305-8567, Japan

³

College of Agriculture, Ibaraki University, 3-21-1 Chuo, Ami 300-0393, Japan

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2022, 4(4), 1065-1087; https://doi.org/10.3390/make4040054

Submission received: 2 October 2022 / Revised: 12 November 2022 / Accepted: 16 November 2022 / Published: 18 November 2022

(This article belongs to the Section Data)

Download

Browse Figures

Versions Notes

Abstract

:

The grade-added rough set (GRS) approach is an extension of the rough set theory proposed by Pawlak to deal with numerical data. However, the GRS has problems with overtraining, unclassified and unnatural results. In this study, we propose a new approach called the directional neighborhood rough set (DNRS) approach to solve the problems of the GRS. The information granules in the DNRS are based on reflexive and antisymmetric relations. Following these relations, new lower and upper approximations are defined. Based on these definitions, we developed a classifier with a three-step algorithm, including DN-lower approximation classification, DN-upper approximation classification, and exceptional processing. Three experiments were conducted using the University of California Irvine (UCI)’s machine learning dataset to demonstrate the effect of each step in the DNRS model, overcoming the problems of the GRS, and achieving more accurate classifiers. The results showed that when the number of dimensions is reduced and both the lower and upper approximation algorithms are used, the DNRS model is more efficient than when the number of dimensions is large. Additionally, it was shown that the DNRS solves the problems of the GRS and the DNRS model is as accurate as existing classifiers.

Keywords:

antisymmetric; classification; lower approximation; neighborhood; numerical data; half-space; reflexive; rough set theory; UCI dataset; upper approximation

1. Introduction

Pawlak’s classical rough set theory, proposed in 1982, is a mathematical method that can deal with vagueness and uncertainty [1,2]. The theory represents an arbitrary set by a pair of approximation sets, a lower approximation set, and an upper approximation set, with an equivalence class serving as the minimal unit of knowledge, in which the elements of the entire sets are partitioned based on equivalence relations. Methods for extracting useful knowledge hidden in information tables using rough sets have been proposed [3,4,5]. Rough set theory has been widely used in machine learning, pattern recognition, data mining, knowledge discovery, bioinformatics, medicine, signal processing, image processing, robotics, social science, web engineering, and so on [6].

There are cases where the equivalence relation in the classical rough set is too strict to use for application [7]. Therefore, various extensions of the rough set theory have been proposed. The equivalence relations are relaxed to some nonequivalence relations, e.g., characteristic relation, tolerance relation and nonsymmetric similarity relation, and suitable approximations have been suggested based on each relation [7]. Additionally, there are studies related to the structural type generalizations of rough sets. Ciucci et al. [8] generalized the notion of partitioning the classical rough set theory into four types of theory/framework by giving up the pairwise disjoint property of information granules and/or the covering of the universe.

Of the various extensions mentioned above, this paper focuses on the extensions of rough set theory for the treatment of numerical data. Classical rough set theory can deal with nominal data in terms of both their condition and decision attributes. However, in many real-world problems, the condition attributes are numerical data and the decision attributes are categorical data. When the classical rough set theory is applied directly to numerical data, very few data have equivalence relations, and the classification results are almost entirely unclassified [9]. Several studies have been conducted to address this problem of treating numerical data using rough set theory. These studies can be grouped into three main categories, including discretization, feature/rule reduction, and classification.

Discretization is a method for separating numerical data into coarser interval data. Each discretized interval is a nominal value of the attribute [10], allowing numerical data to be treated as a classical rough set. Discretization schemes use rough set measures and properties, such as lower and upper approximations, class separability, etc. [11]. In recent years, a discretization method that combines k-means, genetic algorithms, and rough set theory has been proposed [12], and another method uses rough set theory as a discretization method for incomplete big data [13]. However, discretization causes information loss in two ways. One is that the data become coarser, and the other is that order relationships are lost [14].

Several rough set theories have been proposed in feature extraction and rule reduction to deal with selecting useful features when the dimensionality is high and the attributes are numerical data or a mixture of numerical and categorical data. One is fuzzy rough sets (FRSs) [15]. FRSs are a theory that combines fuzzy [16] and rough set theories [1]. The theory uses a fuzzy equivalence relation that allows the similarity of two arbitrary elements to be expressed in terms of both categorical and numeric attributes. Many studies on feature selection using FRSs have been published [17]. Recently, a noise-aware fuzzy rough set feature selection [18], local attribute reduction using fuzzy rough sets [19], a greedy algorithm for attribute reduction with fuzzy rough self-information measures [20], and so on, have been proposed. The second theory is neighborhood rough sets [21]. The granules of neighborhood rough sets are built based on distance functions and covering rather than partitions [22]. Many methods for feature extraction using neighborhood rough sets have also been developed. The latest research involves attribute reduction using weighted neighborhood probabilistic rough sets (WNPRSs) [23], feature selection using neighborhood self-information that can consider uncertainty measures in not only the lower approximation but also the upper approximation of the feature selection [24], attribute reductions based on pseudolabel neighborhood rough sets [25], and so on. Additionally, the fuzzy neighborhood multigranulation rough sets (FNMRS)-based feature selection approach, which is a combination of the fuzzy rough set and neighborhood rough sets, is available for heterogeneous datasets containing numerical and nominal values [26,27]. The third theory is tolerant rough sets [28]. Tolerant rough sets relaxed equivalence relations to similarity relations that satisfy only reflexive and symmetric relations. A feature selection method based on this theory has also been proposed [29].

Not many classifiers have been developed for numerical data, but there are a few. First, an FRS classifier has been proposed [30]. The generalization of the FRS classifier (GFRSC) developed in that study overcomes the FRS’s misclassification and sensitivity to perturbation by introducing a threshold in the membership function of the FRS. The GFRSC has a two-stage method consisting of attribute value reduction and rule induction from the reduced decision table. A classifier based on neighborhood rough sets is also available [31]. This method also consists of two steps, namely feature selection based on a neighborhood model and a neighborhood classifier (NEC). Kumar and Inbarani also developed the neighborhood rough set-based classification (NRSC) algorithm for medical diagnosis [32]. This method generates neighborhood rough set-based (NRS) lower approximation and boundary region rules based on neighborhood relations to determine the final decision rule. In addition, a classifier [10] that combines neighborhood rough and decision-theoretic rough sets is available. The neighborhood-based decision-theoretic rough set model (NDTRS) can include noisy values. In the study, two attribute reduction methods were proposed, and the NDTRS classifier was developed. A classifier using a tolerant rough set is also available. The study uses a similarity measure described by a distance function determined by a similarity threshold value optimized using a genetic algorithm (GA) [33]. In the first stage, the proposed classifier is classified using the lower approximation, and the unclassified data at the first stage are classified using the membership function obtained through the upper approximation. Grade-added rough sets (GRS) are also available. Mori and Takanashi proposed this theory for rule extraction from numerical data [34]. Later, Ishii et al. applied GRS to the problem of satellite image classification [9]. The accuracy of the GRS classifier is the same as that of the SVM or MLC, and its robustness is higher than that of the SVM and MLC.

However, the GRS has some problems with classification. The first is overtraining, the second is unclassified results, and the third is unnatural results. Since GRS deals with numerical data, there is rarely training data that has the same value for all attributes and is rarely deleted as contradictory data, resulting in overtraining where 100% of the training data is learned. In the GRS, decision rule extraction is performed using the greater-than and less-than equal relationship, but in regions where different classes are mixed in the feature space, decision rules can only be obtained when the relationship with the training data is greater-than or less-than equal; that is, only when the relationship with training data is equal. If the value deviates from the training data slightly, the data become unclassified. In addition, since only decision rule extraction is indicated in the GRS, and the information granule, lower approximation, and upper approximation are not explicitly defined, unnatural results may be obtained.

To solve these problems, we attempt to develop a new approach that can deal with the problems of the GRS within the framework of the rough set theory. To achieve this objective, we introduce a new information granule that satisfies only the reflexive and antisymmetric relations and define the DN-lower and DN-upper approximations based on it. In the classification studies presented above, there is no classifier that uses rough sets that can handle numerical data based on only these two relations. Based on these definitions, we develop a classifier with a three-step algorithm, i.e., DN-lower approximation classification, DN-upper approximation classification, and exceptional processing. Finally, three experiments are conducted using the University of California Irvine (UCI)’s machine learning dataset to demonstrate each step effect in the DNRS model in order to overcome the problems with the GRS and give results that are as accurate as existing classifiers.

2. Directional Neighborhood Rough Set Approach

A new approach, the directional neighborhood rough set (DNRS) approach, is defined in this section.

2.1. Decision Table [35]

Let

D T = 〈U, A, V, ρ〉

be a decision table, where

U

is a nonempty finite set of

m

objects

\{x_{1}, x_{2}, \dots, x_{m}\}

;

A = \{a_{1}, a_{2}, \dots, a_{n}\}

is a nonempty finite set of

n

attributes; and

V = \cup_{a \in A} V_{a}

, where

V_{a}

is the domain of attribute

a \in A

.

A = C \cup D

consists of a condition attribute set

C

and a decision attribute set

D

.

ρ : U \times A \to V

is an information function that allocates attribute value

ρ (x, a) \in V

to an object

x \in U

and an attribute

a \in A .

2.2. Grade and Difference in Grade

The type of the condition attribute value is numerical, and that of the decision attribute value is nominal in the decision table for this approach. Condition attribute values are defined in the same way as membership values are in fuzzy sets (see Mori and Takanashi [34]). Let the attribute value

ρ (x_{i}, a)

of the attribute

a \in C

of object

x_{i}

represent the degree of belonging to that attribute, and define the grade

g (x_{i}, a)

as follows:

g (x_{i}, a) = \frac{ρ (x_{i}, a) - \min_{x \in U} \{ρ (x, a)\}}{\max_{x \in U} \{ρ (x, a)\} - \min_{x \in U} \{ρ (x, a)\}}

(1)

where

g (x_{i}, a) \in [0, 1]

. We use the attribute values converted to grades throughout the study.

The difference in grade is defined as follows

:

d i f f (x, y, a) = g (x, a) - g (y, a)

(2)

Additionally, the

n

-dimensional vector of the grade and difference in grade can be represented as follows:

\begin{matrix} g (x) = (g (x, a_{1}), g (x, a_{2}), \dots, g (x, a_{n})) \end{matrix}

(3)

\begin{matrix} d i f f (x, y) = (d i f f (x, y, a_{1}), d i f f (x, y, a_{2}), \dots, d i f f (x, y, a_{n})) \end{matrix}

(4)

2.3. Intersection of Half-Space

In preparation for defining a new information granule in Section 2.5, the intersection of half-space is defined in this section.

The

i

-th hyperplane on real

n

-space

ℝ^{n}

is defined by the following equation using

x, y \in ℝ^{n}

:

\begin{matrix} H_{i} (ε_{i}, g (x)) = \{g (y) \in ℝ^{n} | ε_{i} \cdot g (y) = ε_{i} \cdot g (x)\} \end{matrix}

(5)

Here, the elements of the vector

ε_{i}

in the real

n

-space is assumed to satisfy the following conditions:

ε_{i j} = \{\begin{matrix} 1 i = j \\ 0 i \neq j \end{matrix}

where

i

and

j

are the subscripts, indicating the number of dimensions.

The real

n

-space can be divided into

2^{n}

regions using

n

hyperplanes defined in Equation (5). To define such a region, the upper half-space and lower half-space created using the hyperplane are defined as follows:

\begin{matrix} H_{i}^{+} (ε_{i}, g (x)) & = \{g (y) \in ℝ^{n} | ε_{i} \cdot g (y) \geq ε_{i} \cdot g (x)\} \\ = \{g (y) \in ℝ^{n} | ε_{i} \cdot d i f f (y, x) \geq 0\} \end{matrix}

(6)

\begin{matrix} H_{i}^{-} (ε_{i}, g (x)) & = \{g (y) \in ℝ^{n} | ε_{i} \cdot g (y) \leq ε_{i} \cdot g (x)\} \\ = \{g (y) \in ℝ^{n} | ε_{i} \cdot d i f f (y, x) \leq 0\} \end{matrix}

(7)

When the number of attributes of

B \subseteq C

is

n

, the intersection of these half-spaces creates

2^{n}

regions centered on the object

x \in U

, which is analogous to an orthant in the

n

-dimension, and the regions are analogous to quadrants when

n = 2

. Hereafter, the region formed by this is called an orthant. One of the

2^{n}

orthants created by the

n

hyperplanes for object

x

in Equations (6) and (7) is represented by the following equation:

Q_{B}^{l} (x) = \{y \in U | y \in \cap_{i \in B} H_{i}^{*} (ε_{i}, g (x))\}

(8)

Here,

l

is a superscript, representing the number of the orthant, and the sign of the half-space

H

is determined using

l

and the attribute number

i

as follows:

H_{i}^{*} = \{\begin{matrix} H_{i}^{+} if the i - th digit of l in binary is 0 \\ H_{i}^{-} if the i - th digit of l in binary is 1 \end{matrix}

The object

y

necessarily belongs to at least one orthant centered on

x

, so the following total order relation holds between

x

and

y

.

(i) Reflexive

\forall x \in X, x R x

(ii) Antisymmetric

x R y, y R x \Rightarrow x = y

(iii) Transitive

x R y, y R z \Rightarrow x = z

(iv) Completeness

\forall x, \forall y \in X, x \leq y \lor y \leq x

.

2.4. Neighborhood

The distance function is defined here to define a new information granule in Section 2.5. Distance functions are used in neighborhood rough sets [21].

Given the objects

x, y \in U

and the attribute set

B \subseteq C

, where

B = (a_{1}, a_{2}, \dots, a_{n})

, the distance function of the object

x, y

in the feature space

B

is defined as follows:

\begin{matrix} Δ^{B} (x, y) = {(\sum_{a \in B} {|d i f f (x, y, a)|}^{P})}^{\frac{1}{P}} \end{matrix}

(9)

where, the number of attributes of

B

is

n

. Equation (9) represents the distance between

x

and

y

in the

n

-dimensional feature space, which is the Manhattan distance when

P = 1

, Euclidean distance when

P = 2

, and Chebyshev distance when

P = \infty

[21]. The Chebyshev distance is used in this study.

A neighborhood

N_{B}^{δ} (x)

of an object

x \in U

in the attribute set

B \subseteq C

is defined as follows:

\begin{matrix} N_{B}^{δ} (x) = {y \in U | Δ^{B} (x, y) \leq δ} \end{matrix}

(10)

where

δ

is a neighborhood parameter.

The objects

x

and

y

are then in a neighborhood relation, which satisfies the following:

(i) Reflexive

\forall x \in X, x R x

(ii) Symmetric

x R y \Rightarrow y R x

The neighborhood relation has the effect of ignoring the influence of the training data located far away in the feature space. This makes classification possible even when a class is surrounded by other classes in the feature space. On the other hand, the GRS cannot conduct classification in such cases.

2.5. Directional Neighborhood

An information granule using Equations (8) and (10) is defined as follows:

\begin{matrix} R_{B}^{l δ} (x) = \{y \in U | y \in Q_{B}^{l} (x) \cap N_{B}^{δ} (x)\} \end{matrix}

(11)

This expression represents a set of objects that are contained in the orthant

Q_{B}^{l} (x)

when centered on a given object

x

and within the distance

δ

from object

x

. Added the constraint of distance

δ

, Equation (11) does not generally satisfy the total order relation that was satisfied in Equation (8). The superscript

l

, representing which of the

2^{n}

orthants it is in, can also be seen as representing direction, for instance, the upper right, lower left, etc., as viewed from object

x

, as shown in Figure 1. Hence, we call the information granule expressed in Equation (11) the directional neighborhood. The relations satisfied by the elements

x

and

y

in Equation (11) are as follows:

(i) Reflexive

\forall x \in X, x R x

(ii) Antisymmetric

x R y, y R x \Rightarrow x = y

The relations that satisfy the above condition are used in the context of mereology [36,37] and feature extraction [38], but not in the field of classification.

In addition, the directional neighborhoods overlap with each other, and their union set

U

is as follows:

\begin{matrix} \cup_{i}^{m} \cup_{l}^{2^{n}} R_{B}^{l δ} (x_{i}) = U \end{matrix}

(12)

where

m

represents the number of objects included in

U

. Thus, the directional neighborhood is one of the covering rough sets.

2.6. DN-Lower and DN-Upper Approximations

The DN-lower and DN-upper approximations of an arbitrary set

X \subseteq U

are defined as follows:

{\underline{R}}_{B} (X) = \{x | R_{B}^{l δ} (x) \subseteq X, \exists l\}

(13)

\bar{R_{B}} (X) = \{x | R_{B}^{l δ} (x) \cap X \neq \emptyset, \forall l\}

(14)

Equation (13) shows a DN-lower approximation set in at least one of the

l

directional neighborhoods for object

x

, and Equation (14) shows DN-upper approximation sets in all directional neighborhoods for object

x

.

The DN-boundary region of

X

is as follows:

\begin{matrix} B N_{B} (X) = \bar{R_{B}} (X) - \underline{R_{B}} (X) \end{matrix}

(15)

2.7. Decision Rule Extraction

Shan and Ziarko proposed a decision matrix as a method of efficiently extracting decision rules using identifiability [5]. The difference in grade defined in Equation (2) shows the degree of discernibility. Therefore, the component of the difference in grade between the objects belonging to different classes is defined in the decision matrix. The components of the decision matrix for decision class

D_{k}

are defined as follows:

M_{i j}^{k l} = \{|d i f f (y_{j}, x_{i}, a)) ∥ y_{j} \notin Q_{B}^{l} (x_{i})\} i \in K_{k}^{+}, j \in K_{k}^{-}, l \in 2^{n}

(16)

where

x_{i}

is an object belonging to the decision class

D_{k}

and

y_{j}

is an object not belonging to the decision class

D_{k}

. This decision matrix is an extension of Shan and Ziarko’s decision matrix to be obtained for a

2^{n}

orthant.

n

is the number of the condition attributes.

|\cdot|

indicates an absolute value. The definition of the elements of the sets

K_{k}^{+}

and

K_{k}^{-}

as

K_{k}^{+} = \{i | x_{i} \in \underline{R_{B}} (D_{k})\}

and

K_{k}^{-} = \{i | y_{i} \in U - \underline{R_{B}} (D_{k})\}

results in the decision rule of DN-lower approximation, or as

K_{k}^{+} = \{i | x_{i} \in D_{k}\}

and

K_{k}^{-} = \{i | y_{i} \in \underline{R_{B}} (U - D_{k})\}

results in the decision rule of DN-upper approximation. An example of a decision matrix is shown in Table 1.

The logical operations for a decision rule extraction are conducted through fuzzy operations similar to those of Mori and Takanashi [34]. The decision rule extraction for each orthant

l

is defined using the following formula:

\begin{matrix} R U L E_{l} (D_{k}) = \underset{i \in K_{k}^{+}}{\lor} \underset{j \in K_{k}^{-}}{\land} \lor M_{i j}^{k l} = \max_{i \in K_{k}^{+}} (\min_{j \in K_{k}^{-}} (\max M_{i j}^{k l})) \end{matrix}

(17)

The absolute value of the largest difference in grades among the elements in the set

M_{i j}^{k l}

is chosen using the innermost disjunction. After calculating Equation (17) for all orthants, the orthant with the largest difference in grade is selected from among them as follows:

\begin{matrix} R U L E (D_{k}) = \underset{l \in 2^{n}}{\lor} R U L E_{l} (D_{k}) = \max_{l \in 2^{n}} R U L E_{l} (D_{k}) \end{matrix}

(18)

This is an extension of Shan and Ziarko’s decision rule extraction of orthants. Here,

n

is the number of attributes.

2.8. Classification

The decision rule extraction in the previous section only described the decision rule for the target

x

. When solving a classification problem, the inference must be performed on data for which the decision attribute value is not given. Therefore, we used the decision rule extraction defined in the previous section to infer decision attributes for data that were not in the decision table.

Let an arbitrary object

z

be given a condition attribute set

B \subseteq C

such that

x_{i} \in U

, but the decision attribute is unknown.

L (M_{i j}^{k l})

, the degree of discernibility of any

z

with respect to some

y

, is represented by the degree of discernibility with respect to object

x

satisfying

z \in R_{B}^{l δ} (x_{i})

.

\begin{matrix} L (M_{i j}^{k l}) & = \underset{a \in B}{\lor} \{|d i f f (y_{j}, x_{i}, a)| | z \in R_{B}^{l δ} (x_{i})\} \\ = \max_{a \in B} \{|d i f f (y_{j}, x_{i}, a)| | z \in R_{B}^{l δ} (x_{i})\} \end{matrix}

(19)

If Equation (19) is not an empty set, it means that object

z

is not the same attribute value as the nontarget object

y_{j} for at least one attribute

. Furthermore, the following fuzzy operations are used in Equation (19) to obtain the decision rule for object

z

.

\begin{matrix} R U L E_{l}^{z} (D_{k}) = \underset{i \in K_{k}^{+}}{\lor} \underset{j \in K_{k}^{-}}{\land} L (M_{i j}^{k l}) = \max_{i \in K_{k}^{+}} (\min_{j \in K_{k}^{-}} (\max_{a \in B} M_{i j}^{k l})) \end{matrix}

(20)

\begin{matrix} R U L E^{z} (D_{k}) = \underset{l \in 2^{n}}{\lor} R U L E_{l}^{z} (D_{k}) = \max_{l \in 2^{n}} R U L E_{l}^{z} (D_{k}) \end{matrix}

(21)

The decision rules for object

z

obtained using these equations provides

k

decision rules when there are

k

classes. Classification into the most appropriate class among them is defined as classification into the class that takes the maximum value among the decision rules of each class, that is:

\begin{matrix} C l a s s (z) = \underset{k \in D}{argmax} R U L E^{z} (D_{k}) \end{matrix}

(22)

where

C l a s s (\cdot)

indicates the class given to object

z

.

2.9. Comparison between DNRS and GRS

Based on the article by Ishii et al. [9], the decision rule extraction for the GRS is described briefly.

The element

G_{i j}^{k}

of the decision matrix for decision class

D_{k}

contains the difference in grades between the objects

x_{i}

and

y_{j}

.

G_{i j}^{k} = \{|d i f f (y_{j}, x_{i}, a)|\} i \in S_{k}^{+}, j \in S_{k}^{-}

(23)

where the sets

S_{k}^{+}

and

S_{k}^{-}

are defined as

S_{k}^{+} = \{i | x_{i} \in D_{k}\}

and

S_{k}^{-} = \{i | y_{i} \in U - D_{k}\}

. The formula for the decision rule for an arbitrary object

z

, using the decision rule obtained in Equation (23), is expressed as follows:

F (G_{i j}^{k}) = \underset{a \in B}{\lor} \{|d i f f (y_{j}, x_{i}, a)| | f = 1\}

(24)

where

f

in Equation (24) is given by:

f = \{\begin{array}{l} 1, & i f g (z, a) \geq g (x_{i}, a) > g (y_{j}, a) \\ 1, & i f g (z, a) \leq g (x_{i}, a) < g (y_{j}, a) \\ 0, & o t h e r w i s e \end{array}

F (G_{i j}^{k})

, the degree of discernibility of any

z

with respect to some

y

, is represented by the degree of discernibility with respect to object

x

satisfying

f = 1

. Furthermore, the final decision rule of

z

for class

D_{k}

is obtained by the following equation:

\begin{matrix} R U L E^{z} (D_{k}) = \underset{i \in S_{k}^{+}}{\lor} \underset{j \in S_{k}^{-}}{\land} F (G_{i j}^{k}) \end{matrix}

(25)

Then, the GRS is described in the framework of the DNRS for comparison between the DNRS and the GRS. The information granule in GRS can be considered a special case of the information granule in DNRS, although this is not explicitly stated by Mori and Takanashi [34] and Ishii et al. [9]. The information granule in GRS is the case when

δ

in Equation (11) is the largest value that can be taken in the feature space. Equation (11) is then the same as Equation (8). Hereafter, the information granule of GRS is expressed using Equation (8).

The decision matrix

M_{i j}^{k l}

is defined using the information granules of the GRS defined in the DNRS framework as follows:

M_{i j}^{k l} = \{|d i f f (y_{j}, x_{i}, a)) ∥ y_{j} \notin Q_{B}^{l} (x_{i})\} i \in K_{k}^{+}, j \in K_{k}^{-}, l \in 2^{n}

(26)

where the sets

K_{k}^{+}

and

K_{k}^{-}

are defined as

K_{k}^{+} = \{i | x_{i} \in \underline{R_{B}} (D_{k})\}

and

K_{k}^{-} = \{i | y_{i} \in U - \underline{R_{B}} (D_{k})\},

and

l

is one of the

2^{n}

orthants centered on object

x_{i}

. The formula for the decision rule for an arbitrary object

z

, using the decision matrix obtained in Equation (26), is expressed as follows:

L (M_{i j}^{k l}) = \underset{a \in B}{\lor} \{|d i f f (y_{j}, x_{i}, a)) ∥ z \in Q_{B}^{l} (x_{i})\}

(27)

Furthermore, the final decision rule of

z

for class

D_{k}

is obtained using the following equation:

R U L E^{z} (D_{k}) = \underset{i \in K_{k}^{+}}{\lor} \underset{l \in 2^{n}}{\land} \underset{j \in K_{k}^{-}}{\land} L (M_{i j}^{k l})

(28)

On the other hand, the DNRS decision rule, Equations (20) and (21), can be integrated into the following equation:

\begin{matrix} R U L E^{z} (D_{k}) = \underset{l \in 2^{n}}{\lor} \underset{i \in K_{k}^{+}}{\lor} \underset{j \in K_{k}^{-}}{\land} L (M_{i j}^{k l}) = \underset{i \in K_{k}^{+}}{\lor} \underset{l \in 2^{n}}{\lor} \underset{j \in K_{k}^{-}}{\land} L (M_{i j}^{k l}) \end{matrix}

(29)

Comparing Equation (28) for GRS and Equation (29) for DNRS, we can see that the difference is whether the operations on the orthants are logical conjunction or logical disjunction.

In summary, there are two points of difference between the DNRS and GRS: (i) whether the information granule has a distance constraint or not, and (ii) whether logical operations related to the orthant in the equation of decision rule extraction use logical disjunction or logical conjunction.

These differences allow the DNRS to avoid the unnatural classification that occurs in the GRS. Figure 2 illustrates an example of the difference between the DNRS and GRS decision rules for the extraction of the red class in a two-dimensional feature space. Figure 2a shows the distribution of the target class object

x_{1}

(red cell) and the other class objects

y_{1}

,

y_{2}

, and

y_{3}

(blue cells). The results of the decision rule extraction for the target class are shown in Figure 2b,c. The values in the cells were obtained from Equation (28) in Figure 2b and from Equation (29) in Figure 2c, respectively. The cells with no value indicate that the rule has not been obtained. Comparing Figure 2b,c, the result of GRS shows an unnatural stripe-like distribution, whereas the result of DNRS shows no such stripe-like distribution.

3. Directional Neighborhood Rough Set Model

A classification algorithm was developed based on the directional neighborhood rough set (DNRS) approach defined in Section 2. The new model based on the DNRS approach, the DNRS model, is constructed in this section. The flow of the DNRS model is shown in Figure 3. In Figure 3, input data represent the data that only have the condition attributes to be classified, and output data represent the decision attribute (decision class) inferred from the information on the condition attributes for the input data by the classifier.

In Figure 3, the training dataset is first loaded, and it is determined whether the training data belong to the DN-lower or DN-upper approximation set. For this, we use Equations (13) and (14). The training data determined to belong to the DN-lower approximation set using Equation (13) are defined as the DN-lower approximation training data, and the training data determined to belong to the DN-upper approximation set using Equation (14) are defined as the DN-upper approximation training data. Here, the size

δ

of the directional neighborhood used in Equations (11)–(14) and (19) is given as follows:

\begin{matrix} δ_{x, l} (t) = \min \{δ | Card (R_{B}^{l δ} (x)) \geq t\} \end{matrix}

(30)

where

x

is the training data,

l

is the number of orthants,

B \subseteq C

is a subset of the condition attribute set, and

t

is the number of training data. In this algorithm,

δ

is not given directly as a hyperparameter. Instead, the intervening variable

t

is used as a hyperparameter to determine

δ

. Equation (30) shows that the size of

δ

is chosen to be the smallest

δ

among those whose cardinality of the set

R_{B}^{l δ} (x)

is greater than or equal to

t

. By using such a

δ

, a difference in grade can be assigned to a small region in the feature space that has a high density of training data, and a difference in grade can be assigned to a wide region in the feature space that has a sparse training dataset.

Classification is conducted in three steps.

In step 1, the DN-lower approximation classification algorithm is implemented using the DN-lower approximation training data for the target class and the DN-upper approximation training data for the other classes. When the target class is

k

,

x_{i}

is the training data of the target class

k

and

y_{j}

is the training data of other classes, the row component representing the target class in the decision matrix is

K_{k}^{+} = \{i | x_{i} \in \underline{R_{B}} (D_{k})\},

and the column component representing the other class is

K_{k}^{-} = {i | y_{i} \in U - \underline{R_{B}} (D_{k})}

. This rule extraction uses the definitions given in Section 2.8 and performs the DN-lower approximation classification, indicating a certain classification. However, unclassified data that are not classified into any class may be present.

In step 2, the DN-upper approximation classification algorithm is performed for the data that are unclassified in step 1. Here, the DN-upper approximation training data are used for the target class, and the DN-lower approximation training data are used for the other classes. When the target class is

D_{k}

,

x_{i}

is the training data for the target class

k

,

y_{j}

is the training data for the other classes, the row component representing the target class in the decision matrix is

K_{k}^{+} = {i | x_{i} \in D_{k}}

, and the column component representing the other class is

K_{k}^{-} = {i | y_{i} \in \underline{R_{B}} (U - D_{k})}

. As in step 1, rule extraction uses the definitions given in Section 2.8 and performs the DN-upper approximation classification, indicating classification possibilities. This is good for classifying data located in regions with a mix of different classes in the feature space, and most data will be classified in this step. However, there may be data that are not included in the directional neighborhood of all training data, such data cannot be classified into any class.

In step 3, the unclassified data from step 2 are classified into a class. This process was excluded from DNRS approach in Section 2 but was introduced as an exceptional process as part of the algorithm. Here, it refers to the nearest-neighbor training data of the unclassified data. However, only the DN-lower approximation training data, which are more certain, are used because the DN-upper approximation training data are less certain and may be noisy. For the distance to determine the nearest neighbor, we used the Chebyshev distance, which is consistent with the approach behind the DNRS.

In this way, unclassification and overlearning can be reduced by performing classification separately for the DN-lower and DN-upper approximations. This is because in the region corresponding to the DN-boundary in GRS, the classification of lower and upper approximations is not divided into steps as in DNRS, and points in the training data are classified into the class of the training data, while other points remain unclassified. The DNRS model is intended to solve these GRS problems.

4. Experiments

Three experiments were performed using the proposed model. One was an experiment to determine the effect of each step of the DNRS model. The second was a set of comparative experiments with the GRS classifier to demonstrate improvements brought about by the DNRS model. The third was a set of comparative experiments with existing classifiers to evaluate the performance of the DNRS model.

4.1. Dataset

The UCI Machine Learning Repository [39] was used in these experiments. The details of the dataset are shown in Table 2. The numbers that correspond to each attribute in column 5 are used in the Results and Discussion section.

4.2. Methods

4.2.1. Experiment Demonstrating the Characteristics of the DNRS Model

The DNRS was developed based on information granules that satisfy the reflexive, antisymmetric, and nontransitivity constraints. Therefore, we performed experiments to characterize the new classifier. Classifications were performed under the following conditions to achieve this objective. Classification results were output for each step of the algorithm to clarify the effect of each step. In addition, we performed classifications in all combinations from one attribute to the largest number of attributes for each of the UCI datasets to clarify the characteristics related to the number of condition attributes (Table 2).

4.2.2. Experiments to Demonstrate the Improvements by DNRS Model

The DNRS model is an improved model to overcome the problems of the GRS classifier. Therefore, we conducted two experiments comparing the GRS and DNRS. To demonstrate that the DNRS overcame the problem of unclassified and unnatural classification, we visualized the classification results on a two-dimensional feature space using attributes 1 and 2 in the Iris dataset, one of the UCI datasets. This dataset was selected because the data were reasonably well scattered in the feature space and the data regions of each class had overlap. All 150 samples from the Iris dataset were used as training data. The second experiment was conducted to confirm the solution to the problem of overtraining. The Raisin UCI dataset was selected for this experiment, because the accuracy of the classification is not very high, and we can, thus, more easily confirm overtraining by class to allow for 10-fold cross-validation. Note that in one set, the number of validation data samples was set to 90, and a learning curve was created by extracting the remaining 810 samples from the training data by stratified random extraction in increasing increments of 90, from 90 to 810.

4.2.3. Experiments to Assess the Performance of the DNRS Model

Experiments were performed to assess the performance of the DNRS model compared with existing classifiers. SVM [43,44] and Random Forests (RF) [45] were employed as existing classifiers because they have performed well in numerous studies and are frequently used [46,47].

DNRS, SVM, and RF are classifiers with hyperparameters. Parameter tuning was performed for each classifier in this study within the ranges shown in Table 3. In this experiment, the classification was also performed for each UCI dataset in all combinations, ranging from one attribute to the maximum number of attributes. In order to clarify the characteristics of the three classifiers, we measured the computation time and visualized the classification results in a two-dimensional feature space.

4.2.4. Accuracy Assessment

Ten-fold cross-validation was used to assess the statistical accuracy in three experiments. For each class in each dataset, stratified random sampling was used.

The following indicators were used to conduct the accuracy assessment.

\begin{matrix} A c c u r a c y = \frac{\sum_{i = 1}^{k} C_{i i}}{\sum_{i = 1}^{k} \sum_{j = 1}^{k} C_{i j}} \end{matrix}

(31)

where

C_{i j}

represents the elements of the confusion matrix and

k

is the number of classes. The mean accuracy for 10 datasets for each condition was calculated for the results obtained. In addition, Dunnett’s test was performed at the 5% significance level to statistically evaluate the performance of the DNRS with RF and the DNRS with SVM.

5. Results and Discussion

5.1. Experiment Demonstrating the Characteristics of the DNRS Model

Figure 4 shows the classification results for the UCI dataset in the three steps of the DNRS model. The vertical axis of the graph represents the mean accuracy of cross-validation, and the horizontal axis represents the number of attributes. The bar heights represent the most accurate value for all attribute combinations. Note that the highest accuracy in steps 1 and 2 does not always result in the same attribute combination; however, the attribute combinations at the highest accuracy in step 2 and step 3 are the same. Thus, Table 4, Table 5, Table 6, Table 7 and Table 8 shows the attribute combinations and hyperparameter values when the maximum accuracy was achieved in step 1, step 2, and step 3, respectively. The combination of the attribute numbers in these tables corresponds to the number in the fifth column of Table 2. When the accuracy is at its maximum, there may be more than one combination of attributes. In these cases, all combinations are described. When step 2 has the highest accuracy, the rate of the DN-lower approximation training data (DN-lower approximation training data/all training data) is calculated, and the mean rate of the DN-lower approximation training data of the 10 datasets is also calculated and shown in the last column of Table 4, Table 5, Table 6, Table 7 and Table 8. Note that the number of all training data and DN-upper approximation training data are equivalent. When the rate shown in the last column of Table 4, Table 5, Table 6, Table 7 and Table 8 is 100%, all the input data are classified by the DN-lower approximation training data using step 1, and no data are classified by the DN-upper approximation training data using step 2.

In Figure 4, the attribute numbers with the highest accuracy were 3 and 4 for Banknote, 2 for Iris, 3 for Raisin, 2 for Rice, and 3 and 4 for Wireless. Table 4, Table 5, Table 6, Table 7 and Table 8 show that, except for the Banknote dataset, there is no 100% DN-lower approximation training data, indicating that training data exist in the boundary region (DN-upper approximation but not DN-lower approximation). Therefore, except for the case of the Banknote and the four Wireless attributes, while focusing on the number of attributes with the highest accuracy, we can see that the accuracy of step 2 is higher than that of step 1, thus, demonstrating the effect of the DN-upper approximation. In the case of four Wireless attributes, the accuracy was the same for steps 1 and 2. From step 2 to step 3, the accuracy improves slightly in two Banknote attributes, one and two Iris attributes, four Rice attributes, and six Wireless attributes. This indicates that there are still a few unclassifications after performing the DN-upper approximation process in step 2, and that the step 3 process is still valid for them.

The above results show that the highest accuracy for most of the datasets is achieved when the number of attributes is small and both DN-lower and DN-upper approximations are used. In contrast, when the number of attributes is large and only DN-lower approximation classifications are completed, the accuracies are not sufficiently high. This phenomenon can be explained using the concept of indiscernibility in relation in rough set theory. Rough set theory is based on the rule that an object whose condition attribute value is different from that of an object in another class in at least one condition attribute is discernible, as is shown in Equation (19). An object with a condition attribute that happens to have a value different from other classes due to noise is discernible from other classes, even if it is in fact indiscernible. This is the reason why the accuracy decreases as the attribute number increases. Conversely, if the number of attributes is small and the attributes are properly selected, the misclassification of objects due to noise will decrease, and they will become unclassified in the DN-lower approximation classification algorithm. These unclassified data are in the DN-boundary region. In this model, these unclassified data can be classified by the DN-upper approximation classification algorithm.

In these experiments, only the number of attributes was varied, and no dimensionality compression was conducted. However, the results obtained showed that the accuracy could be further improved by considering an appropriate method of dimension compression for this classifier to obtain more useful information with a smaller number of dimensions. This will be discussed in the future.

5.2. Experiments to Demonstrate the Improvements by the DNRS Model

Figure 5 shows the comparison of the class distributions of the GRS and DNRS in two-dimensional feature space using attributes 1 and 2 of the Iris dataset. For DNRS, the results of each of the three steps of the DNRS model are shown to demonstrate that most of the unclassified and unnatural classifications that occur in the GRS are resolved by the lower and upper approximations, i.e., steps 1 and 2.

In the GRS, the unclassified class (black pixels in Figure 5d) is shown in the region of the boundary between the three classes, and a streak-like unnatural classified region can be seen. The DNRS classification results show that almost all unclassified pixels are eliminated in step 2, and there is no unclassified pixel in step 3. In addition, the unnatural streak-like classified region cannot be seen in the DNRS model. This improvement is due to the difference in the GRS and the DNRS.

Figure 6 shows the learning curves for classification using attributes 4, 6, and 7 of the Raisin dataset. This combination of attributes was chosen because it was the condition under which the DNRS was most accurate in the experiment using the Raisin dataset in Section 5.1. The DNRS hyperparameter was

t = 11

. The horizontal axis in Figure 6 shows the number of samples in the training dataset, and the vertical axis shows the mean accuracy of the ten datasets. Table 9 and Table 10 show the total confusion matrices of 10-fold cross-validation when trained by the GRS with 90 and 810 samples of the training data in Figure 6, respectively, and evaluated for accuracy with the validation data. Table 11 and Table 12 show the total confusion matrices of 10-fold cross-validation when trained with the DNRS with 90 and 810 samples of training data in Figure 6, respectively, and evaluate the accuracy of the validation data.

In the GRS, the accuracy of the training data is 100% in all samples, whereas the accuracy decreases when evaluated with the validation data. In general, the accuracy tends to improve as the amount of information increases due to an increase in the number of training data, but the results of the GRS validation data show a different result from the general hypothesis. That is, the accuracy evaluated by the validation data decreased with an increase in the number of training data. Table 9 and Table 10 show that this decrease in accuracy is due to an increase in unclassified data associated with the increase in the number of training data. This unclassified increase is likely caused by an increase in the number of regions with mixed classes in the feature space as a result of the increase in the number of training data. On the other hand, the DNRS never learns 100% of the training data at any number of samples, and the accuracy does not drop significantly when evaluating the validation data. Table 11 and Table 12 also show that unclassified results do not occur as frequently as in the GRS as a result of the increase in training data. These results show that overtraining in the GRS is overcome in the DNRS and that it also copes well with the mixing of multiple classes with increasing numbers of training data.

5.3. Experiments to Assess the Performance of the DNRS Model

Table 13 shows the classification results for the UCI dataset for the condition with the highest accuracy for all attribute combinations. The accuracy and standard deviation are shown in the top row, and the number of attributes with the highest accuracy is shown in brackets in the bottom row. The most accurate condition is defined as follows: (i) the accuracy in Equation (31) is the highest, (ii) the standard deviation is smaller, and (iii) the number of attributes is the smaller. Dunnett’s test was performed at a 5% significance level to statistically examine whether there was a difference in accuracy between the DNRS and other classifiers for these results, and the results are shown in the right two columns of Table 13. In this experiment, the D-value was 2.333, and if the absolute value of the statistic is greater than this D-value, there is a significant difference.

Table 13 shows that the DNRS had the highest accuracy for Iris, while the SVM had the highest accuracy for Banknote, Rice, and Wireless. The three classifiers have the same mean accuracy for Raisin. The DNRS had the highest mean accuracy across all the datasets. However, when Dunnett’s test was performed at the 5% significance level, the only significant difference was found for Banknote’s DNRS accuracy vs. RF accuracy. The classification performance of the DNRS is statistically comparable to that of RF and SVM because no significant differences were observed in the other datasets.

In addition, focusing on the number of attributes when the highest accuracy was achieved (Table 13), there is a trend that DNRS is more accurate than RF and SVM with fewer attributes, although not all of them are. This is a property that emerges from the concept of the rough set indiscernibility relation, as discussed in Section 5.1.

The measured classification times by each classifier for the UCI dataset are shown in Table 14. The time measurements listed are the times required to classify using the largest number of attributes for each UCI dataset. For the maximum number of attributes, the hyperparameters are fixed at the highest accuracy for the maximum number of attributes. The means and standard deviations of the 10 repeated calculations for the first dataset of the 10-fold cross-validation are shown in Table 14. The hardware used was a Dual Intel(R) Xeon™ E5-2696 v4 CPU @ 2.20 GHz and 64 GB RAM. The OS was Ubuntu 20.04. The compiler was oneAPI 5.0.17 for the classification of the DNRS. For RF and SVM, a Python library called scikit-learn was used.

Overall, the DNRS tends to be the slowest, and the SVM tends to be the fastest. The reason the DNRS is slower is that, in addition to not summarizing information, which is a common property of rough sets [6], the DNRS has

2^{n}

information granules for an object, as shown in Equation (11). The time difference between the DNRS and other classifiers is particularly large for the Rice and Wireless datasets, which have a relatively large number of attributes and samples.

To visually confirm the classifier characteristics, an extensive experiment was performed in which three classifiers were used to classify a two-dimensional feature space using the UCI datasets. Here, we show the classification results for a feature space composed of attributes 1 and 5 of the Wireless dataset as an example. This is because they have the combination of attributes for which all three classifiers commonly have the highest accuracy in the two-dimensional feature space. The mean classification accuracies for the 10-fold cross-validations were 0.971, 0.965, and 0.970 for the DNRS, RF, and SVM, respectively. The classification results of the feature space of one of these 10 datasets chosen are shown in Figure 7. The distribution of the training data is also included for reference. The distribution of the training data shows that the classes partially overlap with each other. The results of the three classifiers on the feature spaces where such classes overlap are shown in Figure 7.

In RF, there are some unnatural points, such as having one class inside another class. SVM shows the property of drawing smooth boundaries using the radial basis function (RBF) kernel. The DNRS has classification results similar to those of SVMs. The advantage of DNRS is that it allows similar classification with a single hyperparameter, while SVM uses two hyperparameters to achieve such classification results. When focusing on the orange and blue, red and blue, and green and blue borders in the near region where the training data are present, all classifiers are similar. The abovementioned visualization of the classification results in the feature space enabled us to confirm the characteristics of how the boundaries of the three classifiers are drawn.

6. Conclusions

In this study, we proposed a new approach called the directional neighborhood rough set (DNRS) to solve the problem of grade-added rough sets (GRS). The information granules in the DNRS are based on reflexive and antisymmetric relations. The DN-upper and DN-lower approximations and the rule of extraction were defined. The rule of extraction was extended to classification.

Furthermore, a new classification model based on the DNRS approach was developed. The proposed model created the distance function variable by defining it based on the number of training data points, and a three-step classification algorithm was constructed. These steps are the DN-lower approximation classification algorithm, the DN-upper approximation classification algorithm, and processing to classify exceptional data.

To characterize and assess the performance of the proposed classifier, three experiments were conducted using the UCI datasets. In the first experiment, classification results were output for each step of the classification algorithm for all feature combinations. Therefore, classifications were conducted using only the DN-lower approximation when the number of attributes was high, and the effect of the DN-upper approximation was seen when the number of attributes was low. Furthermore, most of the classification accuracies in which both the DN-lower and DN-upper approximations had a few attributes were higher than those in which only the DN-lower approximations had many attributes. This indicates that the restriction on the number of attributes plays an important role in correct inference due to the nature of rough set theory and that the DN-upper approximation is useful for mixed-class regions caused by reducing the number of attributes. In the second experiment, we conducted two experiments to demonstrate that the DNRS can overcome the problems of the GRS. One was an experiment on the problem of unclassified classes and unnatural classification. The second was an experiment on the problem of overtraining. From the results of these experiments, it was found that the DNRS resolves the problems of the GRS. In the third experiment, we conducted a comparison between the DNRS, RF, and SVM. The results show that the classification accuracy of the DNRS is statistically as high as those of the RF and SVM. Experiments on visualization using a two-dimensional feature space showed that DNRS is a classifier with the advantage of being able to draw class boundaries similar to SVMs, with fewer hyperparameters than SVMs and with no unnatural points, such as RFs.

There are some challenges involved with the proposed method, despite its high performance and novel characteristics. First, an appropriate dimensionality compression method must be identified to apply the proposed classifier to datasets with high dimensionality. Another issue is that the classification time is longer than for other classifiers. To solve this problem, improving the efficiency of data summarization and computation is required.

For future work, we expect the DNRS to be effective in applications with ambiguous classes, such as land cover classification, a remote sensing technique where the condition attribute is numerical data and the decision attribute is nominal data.

Author Contributions

Conceptualization, Y.I. and T.K.; methodology, Y.I. and T.K.; software, Y.I.; validation, Y.I. and T.K.; formal analysis, Y.I.; investigation, Y.I.; data curation, Y.I.; writing—original draft preparation, Y.I.; writing—review and editing, Y.I., K.I. and T.K.; visualization, Y.I.; supervision, K.I.; project administration, T.K.; funding acquisition, Y.I. and T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, grant number JP21J15348, JSPS KAKENHI and JP19K06307.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
Pawlak, Z. Rough Sets: Theoretical Aspects of Reasoning About Data, 1st ed.; Kluwer Academic Publishers: Dordrecht, The Netherland, 1991; pp. 1–231. [Google Scholar]
Grzymala-Busse, J.W. Knowledge acquisition under uncertainty—A rough set approach. J. Intell. Robot. Syst. 1988, 1, 3–16. [Google Scholar] [CrossRef]
Tsumoto, S. Automated extraction of medical expert system rules from clinical databases based on rough set theory. Inf. Sci. 1998, 112, 67–84. [Google Scholar] [CrossRef]
Shan, N.; Ziarko, W. An incremental learning algorithm for constructing decision rules. In Rough Sets, Fuzzy Sets and Knowledge Discovery; Springer: London, UK, 1994; pp. 326–334. [Google Scholar]
Pawlak, Z.; Skowron, A. Rudiments of rough sets. Inf. Sci. 2007, 177, 3–27. [Google Scholar] [CrossRef]
Guan, L.; Wang, G. Generalized Approximations Defined by Non-Equivalence Relations. Inf. Sci. 2012, 193, 163–179. [Google Scholar] [CrossRef]
Ciucci, D.; Mihálydeák, T.; Csajbók, Z.E. On exactness, definability and vagueness in partial approximation spaces. Tech. Sci. Univ. Warm. Maz. Olsztyn 2015, 18, 203–212. [Google Scholar]
Ishii, Y.; Bagan, H.; Iwao, K.; Kinoshita, T. A new land cover classification method using grade-added rough sets. IEEE Geosci. Remote Sens. Lett. 2021, 18, 8–12. [Google Scholar] [CrossRef]
Li, W.; Huang, Z.; Jia, X.; Cai, X. Neighborhood based decision-theoretic rough set models. Int. J. Approx. Reason. 2016, 69, 1–17. [Google Scholar] [CrossRef]
García, S.; Luengo, J.; Sáez, J.A.; López, V.; Herrera, F. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 2013, 25, 734–750. [Google Scholar] [CrossRef]
Dwiputranto, T.H.; Setiawan, N.A.; Adji, T.B. Rough-Set-Theory-Based Classification with Optimized k-Means Discretization. Technologies 2022, 10, 51. [Google Scholar] [CrossRef]
Li, X.; Shen, Y. Discretization Algorithm for Incomplete Economic Information in Rough Set Based on Big Data. Symmetry 2020, 12, 1245. [Google Scholar] [CrossRef]
Hu, Q.; Xie, Z.; Yu, D. Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation. Pattern Recognit. 2007, 40, 3509–3521. [Google Scholar] [CrossRef]
Dubois, D.; Prade, H. Rough fuzzy sets and fuzzy rough sets. Int. J. Gen. Syst. 1990, 17, 191–209. [Google Scholar] [CrossRef]
Zadeh, L.A. Fuzzy Sets. In Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems; World Scientific: Singapore, 1965; pp. 394–432. ISBN 9780784413616. [Google Scholar]
Ji, W.; Pang, Y.; Jia, X.; Wang, Z.; Hou, F.; Song, B.; Liu, M.; Wang, R. Fuzzy rough sets and fuzzy rough neural networks for feature selection: A review. Wiley Data Min. Knowl. Discov. 2021, 11, 1–15. [Google Scholar] [CrossRef]
Yang, X.; Chen, H.; Li, T.; Luo, C. A Noise-Aware Fuzzy Rough Set Approach for Feature Selection. Knowl. Based Syst. 2022, 250, 109092. [Google Scholar] [CrossRef]
Che, X.; Chen, D.; Mi, J. Label Correlation in Multi-Label Classification Using Local Attribute Reductions with Fuzzy Rough Sets. Fuzzy Sets Syst. 2022, 426, 121–144. [Google Scholar] [CrossRef]
Wang, C.; Huang, Y.; Ding, W.; Cao, Z. Attribute Reduction with Fuzzy Rough Self-Information Measures. Inf. Sci. 2021, 549, 68–86. [Google Scholar] [CrossRef]
Hu, Q.; Yu, D.; Liu, J.; Wu, C. Neighborhood rough set based heterogeneous feature subset selection. Inf. Sci. 2008, 178, 3577–3594. [Google Scholar] [CrossRef]
Yao, Y.; Yao, B. Covering based rough set approximations. Inf. Sci. 2012, 200, 91–107. [Google Scholar] [CrossRef]
Xie, J.; Hu, B.Q.; Jiang, H. A novel method to attribute reduction based on weighted neighborhood probabilistic rough sets. Int. J. Approx. Reason. 2022, 144, 1–17. [Google Scholar] [CrossRef]
Wang, C.; Huang, Y.; Shao, M.; Hu, Q.; Chen, D. Feature Selection Based on Neighborhood Self-Information. IEEE Trans. Cybern. 2020, 50, 4031–4042. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Liang, S.; Yu, H.; Gao, S.; Qian, Y. Pseudo-Label Neighborhood Rough Set: Measures and Attribute Reductions. Int. J. Approx. Reason. 2019, 105, 112–129. [Google Scholar] [CrossRef]
Sun, L.; Wang, L.; Ding, W.; Qian, Y.; Xu, J. Feature Selection Using Fuzzy Neighborhood Entropy-Based Uncertainty Measures for Fuzzy Neighborhood Multigranulation Rough Sets. IEEE Trans. Fuzzy Syst. 2021, 29, 19–33. [Google Scholar] [CrossRef]
Xu, J.; Shen, K.; Sun, L. Multi-Label Feature Selection Based on Fuzzy Neighborhood Rough Sets. Complex Intell. Syst. 2022, 8, 2105–2129. [Google Scholar] [CrossRef]
Skowron, A.; Stepaniuk, J. Tolerance approximation spaces. Fundam. Inform. 1996, 27, 245–253. [Google Scholar] [CrossRef]
Parthaláin, N.M.; Shen, Q. Exploring the boundary region of tolerance rough sets for feature selection. Pattern Recognit. 2009, 42, 655–667. [Google Scholar] [CrossRef] [Green Version]
Zhao, S.; Tsang, E.C.C.; Chen, D.; Wang, X. Building a rule-based classifier—A fuzzy-rough set approach. IEEE Trans. Knowl. Data Eng. 2010, 22, 624–638. [Google Scholar] [CrossRef]
Hu, Q.; Yu, D.; Xie, Z. Neighborhood classifiers. Expert Syst. Appl. 2008, 34, 866–876. [Google Scholar] [CrossRef]
Kumar, S.U.; Inbarani, H.H. A novel neighborhood rough set based classification approach for medical diagnosis. Procedia Comput. Sci. 2015, 47, 351–359. [Google Scholar] [CrossRef] [Green Version]
Kim, D. Data Classification based on tolerant rough set. Pattern Recognit. 2001, 34, 1613–1624. [Google Scholar] [CrossRef]
Mori, N.; Takanashi, R. Knowledge acquisition from the data consisting of categories added with degrees of conformity. Kansei Eng. Int. 2000, 1, 19–24. [Google Scholar] [CrossRef]
Pawlak, Z. Information systems theoretical foundations. Inf. Syst. 1981, 6, 205–218. [Google Scholar] [CrossRef] [Green Version]
Mani, A.; Radeleczki, S. Algebraic approach to directed rough sets. arXiv 2020. [Google Scholar] [CrossRef]
Mani, A. Comparative approaches to granularity in general rough sets. In Rough Sets; Bello, R., Miao, D., Falcon, R., Nakata, M., Rosete, A., Ciucci, D., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 500–517. ISBN 978-3-030-52705-1. [Google Scholar]
Yu, B.; Cai, M.; Li, Q. A λ-rough set model and its applications with TOPSIS method to decision making. Knowl. Based Syst. 2019, 165, 420–431. [Google Scholar] [CrossRef]
Dua, D.; Graff, C. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 25 September 2022).
Ilkay, C.; Murat, K.; Sakir, T. Classification of raisin grains using machine vision and artificial intelligence methods. Gazi Muhendis. Bilim. Derg. 2020, 6, 200–209. [Google Scholar] [CrossRef]
Ilkay, C.; Murat, K. Classification of rice varieties using artificial intelligence methods. Int. J. Intell. Syst. Appl. Eng. 2019, 7, 188–194. [Google Scholar] [CrossRef] [Green Version]
Rohra, J.G.; Perumal, B.; Narayanan, S.J.; Thakur, P.; Bhatt, R.B. User localization in an indoor environment using fuzzy hybrid of particle swarm optimization & gravitational search algorithm with neural networks. Adv. Intell. Syst. Comput. 2019, 741, 217–225. [Google Scholar] [CrossRef]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Boateng, E.Y.; Otoo, J.; Abaye, D.A. Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: A review. J. Data Anal. Inf. Process. 2020, 8, 341–357. [Google Scholar] [CrossRef]
Sheykhmousa, M.; Mahdianpari, M.; Ghanbari, H.; Mohammadimanesh, F.; Ghamisi, P.; Homayouni, S. Support Vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 6308–6325. [Google Scholar] [CrossRef]

Figure 1. Example of the directional neighborhoods for an object

x

when

B = \{A t t r i b u t e 1, A t t r i b u t e 2\}

and

δ = 4

. The gray lines represent hyperplanes in each dimension centered on the object

x

. The vertical and horizontal axes in this figure show the feature space with the grade multiplied by 100, and a part of the feature space is expanded.

Figure 1. Example of the directional neighborhoods for an object

x

when

B = \{A t t r i b u t e 1, A t t r i b u t e 2\}

and

δ = 4

. The gray lines represent hyperplanes in each dimension centered on the object

x

. The vertical and horizontal axes in this figure show the feature space with the grade multiplied by 100, and a part of the feature space is expanded.

Figure 2. Comparison of spatial distribution of the final decision rule between GRS and DNRS in two-dimensional feature space. The vertical and horizontal axes in this figure show the feature space with the grade multiplied by 100, and a part of the feature space is expanded. And the grades in difference in cells are also multiplied by 100.

Figure 3. Flow of DNRS model.

Figure 4. Comparison of DN-lower (step 1), DN-upper (step 2), and exceptional processing (step 3) at each attribute number for each UCI dataset.

Figure 5. Visualized classification results of the DNRS and GRS in a two-dimension feature space using one of the 10-fold cross-validation datasets of attributes 1 and 2 in the Iris dataset. Black indicates unclassified.

Figure 6. Learning curves for classification using the Raisin 10-fold cross-validation dataset with attributes 4, 6, and 7.

Figure 7. Visualized classification results in a two-dimension feature space using one of the 10-fold cross-validation datasets of attributes 1 and 5 in the Wireless dataset.

Table 1. Decision matrix for class

k = 1

, orthant

l = 1

,

p

objects in target class, and

q

objects in other classes.

Table 1. Decision matrix for class

k = 1

, orthant

l = 1

,

p

objects in target class, and

q

objects in other classes.

	$y_{1}$	$y_{2}$	$\dots$	$y_{q}$
$x_{1}$	$M_{11}^{11}$	$M_{12}^{11}$	$\dots$	$M_{1 q}^{11}$
$x_{2}$	$M_{21}^{11}$	$M_{22}^{11}$	$\dots$	$M_{2 q}^{11}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$
$x_{p}$	$M_{p 1}^{11}$	$M_{p 2}^{11}$	$\dots$	$M_{p q}^{11}$

Table 2. UCI dataset used in the experiments.

Dataset Name	Number of Instances	Number of Condition Attributes	Number of Classes	Correspondence between the Attributes and the No. Used in This Article
Banknote	1372	4	2	1: Variance of Wavelet Transformed image 2: Skewness of Wavelet Transformed image 3: Cortosis of Wavelet Transformed image 4: Entropy of image
Iris	150	4	3	1: Sepal length 2: Sepal width 3: Petal length 4: Petal width
Raisin [40]	900	7	2	1: Area 2: Perimeter 3: Major Axis Length 4: Minor Axis Length 5: Eccentricity 6: Convex Area 7: Extent
Rice [41]	3810	7	2	1: Area 2: Perimeter 3: Major Axis Length 4: Minor Axis Length 5: Eccentricity 6: Convex Area 7: Extent
Wireless [42]	2000	7	4	1:WS1 2:WS2 3:WS3 4:WS4 5:WS5 6:WS6 7:WS7

Table 3. Range of hyperparameters.

Classifier	Hyperparameters
DNRS	Delta(t) t: 1–20
RF	Max_depth = None N_estimators = 50, 100, 300, 500 Max_features = sqrt, log2 Criterion = Gini
SVM	C = 0.01, 0.1, 1, 10, 100, 1000 Gamma = 0.001, 0.01, 0.1, 1, 10 Kernel = rbf

Table 4. Optimal hyperparameters of the DNRS model for Banknote.

Number of Attributes	DN-lower Approximation Classification (Step 1)		DN-Lower and DN-Upper Approximation Classification (Step 2, Step 3)
Number of Attributes	Combination of Attributes	$t$ (Hyperparameter)	Combination of attributes	$t$ (Hyperparameter)	Mean Rate of DN-Lower Approximation Training Data
1	1	2	1	17	46.2%
2	1, 2	3	1, 2	4	96.6%
3	1, 2, 3	8	1, 2, 3	8	100.0%
4	1, 2, 3, 4	8	1, 2, 3, 4	8	100.0%

Table 5. Optimal hyperparameters of the DNRS model for Iris.

Number of Attributes	DN-Lower Approximation Classification (Step 1)		DN-Lower and DN-Upper Approximation Classification (Step 2, Step 3)
Number of Attributes	Combination of Attributes	$t$ (Hyperparameter)	Combination of Attributes	$t$ (Hyperparameter)	Mean Rate of DN-Lower Approximation Training Data
1	3	6	4	17	75.9%
2	3, 4	9	3, 4	14	96.8%
3	2, 3, 4	11	2, 3, 4	11	99.9%
4	1, 2, 3, 4	9	1, 2, 3, 4	9	100.0%

Table 6. Optimal hyperparameter of the DNRS model for Raisin.

Number of Attributes	DN-lower Approximation Classification (Step 1)		DN-lower and DN-Upper Approximation Classification (Step 2, Step 3)
Number of Attributes	Combination of Attributes	$t$ (Hyperparameter)	Combination of attributes	$t$ (Hyperparameter)	Mean Rate of DN-Lower Approximation Training Data
1	2	2	7	18	42.0%
2	2, 5	3	6, 7	8	72.6%
3	2, 3, 6	3	4, 6, 7	11	83.6%
4	1, 3, 6, 7	5	1, 3, 6, 7	5	100.0%
4	2, 4, 6, 7	3	2, 4, 6, 7	3	100.0%
5	1, 4, 5, 6, 7	9	1, 3, 5, 6, 7	9	100.0%
5	1, 4, 5, 6, 7	9	1, 4, 5, 6, 7	9	100.0%
6	1, 2, 3, 5, 6, 7	3	1, 2, 3, 5, 6, 7	3	100.0%
7	1, 2, 3, 4, 5, 6, 7	6	1, 2, 3, 4, 5, 6, 7	6	100.0%

Table 7. Optimal hyperparameters of the DNRS model for Rice.

Number of Attributes	DN-lower Approximation Classification (Step 1)		DN-Lower and DN-Upper Approximation Classification (Step 2, Step 3)
Number of Attributes	Combination of Attributes	$t$ (Hyperparameter)	Combination of Attributes	$t$ (Hyperparameter)	Mean Rate of DN-Lower Approximation Training Data
1	3	2	3	15	69.4%
2	1, 5	3	4, 6	19	77.5%
2	3, 5	3	4, 6	19	77.5%
3	1, 2, 3 1, 2, 5 2, 4, 6	3 3 4	1, 5, 6 2, 3, 7 3, 5, 7 4, 6, 7	5 20 15 11	97.5% 86.9% 88.2% 93.6%
4	1, 2, 3, 5	3	1, 2, 3, 4	8	100.0%
5	1, 2, 3, 5, 6	3	1, 2, 3, 5, 6	3	100.0%
5	1, 3, 5, 6, 7	6	1, 3, 5, 6, 7	6	100.0%
6	1, 3, 4, 5, 6, 7	8	1, 3, 4, 5, 6, 7	8	100.0%
7	1, 2, 3, 4, 5, 6, 7	5	1, 2, 3, 4, 5, 6, 7	5	100.0%

Table 8. Optimal hyperparameters of the DNRS model for Wireless.

Number of Attributes	DN-lower Approximation Classification (Step 1)		DN-Lower and DN-Upper Approximation Classification (Step 2, Step 3)
Number of Attributes	Combination of Attributes	$t$ (Hyperparameter)	Combination of Attributes	$t$ (Hyperparameter)	Mean Rate of DN-Lower Approximation Training Data
1	5	2	1	9	21.5%
2	1, 5	7	1, 5	20	89.5%
3	1, 4, 5	5	1, 4, 5	17	97.6%
4	1, 4, 5, 7	12	1, 4, 5, 7	12	99.6%
5	1, 3, 4, 5, 6	8	1, 3, 4, 5, 6	8	100.0%
5	1, 3, 4, 5, 6	8	1, 4, 5, 6, 7	4	100.0%
6	1, 3, 4, 5, 6, 7	5	1, 3, 4, 5, 6, 7	1	100.0%
7	1, 2, 3, 4, 5, 6, 7	19	1, 2, 3, 4, 5, 6, 7	1	100.0%

Table 9. Total confusion matrix of 10-fold cross-validation of classification results with the GRS with 90 samples of training data using attributes 4, 6, and 7 of the Raisin dataset.

		True Class
		Class 1	Class 2
Predicted class	Class 1	323	61
	Class 2	45	359
	Unclassified	82	30

Table 10. Total confusion matrix of 10-fold cross-validation of classification results with the GRS with 810 samples of training data using attributes 4, 6, and 7 of the Raisin dataset.

		True Class
		Class 1	Class 2
Predicted class	Class 1	206	17
	Class 2	25	328
	Unclassified	219	105

Table 11. Total confusion matrix of 10-fold cross-validation of classification results with the DNRS with 90 samples of training data using attributes 4, 6, and 7 of the Raisin dataset.

		True Class
		Class 1	Class 2
Predicted class	Class 1	319	83
	Class 2	59	367
	Unclassified	0	0

Table 12. Total confusion matrix of 10-fold cross-validation of classification results with the DNRS with 810 samples of training data using attributes 4, 6, and 7 of the Raisin dataset.

		True class
		Class 1	Class 2
Predicted class	Class 1	402	64
	Class 2	48	386
	Unclassified	0	0

Table 13. Comparison of classification accuracy and Dunnett’s test.

Dataset	DNRS	RF	SVM	Dunnett’s Test (D-Value = 2.333)
Dataset	DNRS	RF	SVM	DNRS vs. RF	DNRS vs. SVM
Banknote	0.999 $\pm$ 0.002 (4)	0.994 $\pm$ 0.005 (4)	1.000 $\pm$ 0.000 (3)	* 3.196	−0.454
Iris	0.980 $\pm$ 0.032 (2)	0.973 $\pm 0.034$ (2)	0.973 $\pm 0.034$ (2)	0.442	0.442
Raisin	0.876 $\pm$ 0.036 (3)	0.876 $\pm$ 0.032 (4)	0.876 $\pm$ 0.028 (4)	0.000	0.000
Rice	0.932 $\pm$ 0.012 (2)	0.927 $\pm$ 0.011 (1)	0.933 $\pm$ 0.014 (5)	0.948	−0.190
Wireless	0.982 $\pm$ 0.004 (4)	0.983 $\pm$ 0.009 (6)	0.985 $\pm$ 0.008 (6)	−0.441	−1.029
Mean	0.954	0.950	0.953

* significant difference at the 5% significance level.

Table 14. Classification time for UCI datasets (in seconds).

	Classification Time for the Largest Number of Attributes
	DNRS	RF	SVM
Banknote	0.073 $\pm$ 0.007	0.052 $\pm$ 0.013	0.037 $\pm$ 0.010
Iris	0.057 $\pm$ 0.009	0.026 $\pm$ 0.010	0.010 $\pm$ 0.007
Raisin	0.135 $\pm$ 0.039	0.188 $\pm$ 0.027	0.026 $\pm$ 0.009
Rice	0.468 $\pm$ 0.034	0.156 $\pm$ 0.023	0.112 $\pm$ 0.013
Wireless	0.325 $\pm$ 0.028	0.182 $\pm$ 0.030	0.035 $\pm$ 0.011

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ishii, Y.; Iwao, K.; Kinoshita, T. A New Rough Set Classifier for Numerical Data Based on Reflexive and Antisymmetric Relations. Mach. Learn. Knowl. Extr. 2022, 4, 1065-1087. https://doi.org/10.3390/make4040054

AMA Style

Ishii Y, Iwao K, Kinoshita T. A New Rough Set Classifier for Numerical Data Based on Reflexive and Antisymmetric Relations. Machine Learning and Knowledge Extraction. 2022; 4(4):1065-1087. https://doi.org/10.3390/make4040054

Chicago/Turabian Style

Ishii, Yoshie, Koki Iwao, and Tsuguki Kinoshita. 2022. "A New Rough Set Classifier for Numerical Data Based on Reflexive and Antisymmetric Relations" Machine Learning and Knowledge Extraction 4, no. 4: 1065-1087. https://doi.org/10.3390/make4040054

Article Menu

A New Rough Set Classifier for Numerical Data Based on Reflexive and Antisymmetric Relations

Abstract

1. Introduction

2. Directional Neighborhood Rough Set Approach

2.1. Decision Table [35]

2.2. Grade and Difference in Grade

2.3. Intersection of Half-Space

2.4. Neighborhood

2.5. Directional Neighborhood

2.6. DN-Lower and DN-Upper Approximations

2.7. Decision Rule Extraction

2.8. Classification

2.9. Comparison between DNRS and GRS

3. Directional Neighborhood Rough Set Model

4. Experiments

4.1. Dataset

4.2. Methods

4.2.1. Experiment Demonstrating the Characteristics of the DNRS Model

4.2.2. Experiments to Demonstrate the Improvements by DNRS Model

4.2.3. Experiments to Assess the Performance of the DNRS Model

4.2.4. Accuracy Assessment

5. Results and Discussion

5.1. Experiment Demonstrating the Characteristics of the DNRS Model

5.2. Experiments to Demonstrate the Improvements by the DNRS Model

5.3. Experiments to Assess the Performance of the DNRS Model

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI