Next Article in Journal
Alternative Entropy Measures and Generalized Khinchin–Shannon Inequalities
Previous Article in Journal
Generalizing Boltzmann Configurational Entropy to Surfaces, Point Patterns and Landscape Mosaics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Label Feature Selection Combining Three Types of Conditional Relevance

1
College of Computer Science and Technology, Jilin University, Changchun 130012, China
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(12), 1617; https://doi.org/10.3390/e23121617
Submission received: 27 October 2021 / Revised: 19 November 2021 / Accepted: 25 November 2021 / Published: 1 December 2021
(This article belongs to the Topic Machine and Deep Learning)

Abstract

:
With the rapid growth of the Internet, the curse of dimensionality caused by massive multi-label data has attracted extensive attention. Feature selection plays an indispensable role in dimensionality reduction processing. Many researchers have focused on this subject based on information theory. Here, to evaluate feature relevance, a novel feature relevance term (FR) that employs three incremental information terms to comprehensively consider three key aspects (candidate features, selected features, and label correlations) is designed. A thorough examination of the three key aspects of FR outlined above is more favorable to capturing the optimal features. Moreover, we employ label-related feature redundancy as the label-related feature redundancy term (LR) to reduce unnecessary redundancy. Therefore, a designed multi-label feature selection method that integrates FR with LR is proposed, namely, Feature Selection combining three types of Conditional Relevance (TCRFS). Numerous experiments indicate that TCRFS outperforms the other 6 state-of-the-art multi-label approaches on 13 multi-label benchmark data sets from 4 domains.

1. Introduction

In recent years, multi-label learning [1,2,3,4] has been increasingly popular in applications such as text categorization [5], image annotation [6], protein function prediction [7], etc. Additionally, feature selection is of great significance to solving industrial application problems. Some researchers monitor the wind speed in the wake region to detect wind farm faults based on feature selection [8]. In signal processing applications, feature selection is effective for chatter vibration diagnosis in CNC machines [9]. Feature selection is adopted to classify the cutting stabilities based on the selected features [10]. The most crucial thing in diverse multi-label applications is to classify each sample and its corresponding label accurately. Multi-label learning, such as traditional classification approaches, is vulnerable to dimensional catastrophes. The number of features in text multi-label data is frequently in the tens of thousands, which means that there are a lot of redundant or irrelevant features [11,12]. It can easily lead to the “curse of dimensionality”, which dramatically increases the model complexity and computation time [13]. Feature selection is the process of selecting a set of feature subsets with distinguishing features from the original data set according to specific evaluation criteria. Redundant or irrelevant features can be eliminated to improve model accuracy and reduce feature dimensions, feature space, and running time [14,15]. Simultaneously, the selected features are more conducive to model understanding and data analysis.
In traditional machine learning problems, feature selection approaches include wrapper, embedded, and filter approaches [16,17,18,19]. Among them, wrapper feature selection approaches use the classifier performance to weigh the pros and cons of a feature subset, which has high computational complexity and a large memory footprint [20,21]. The processes of feature selection and learner training are combined in embedded approaches [22,23]. Feature selection is automatically conducted during the learner training procedure when the two are completed in the same optimization procedure. Filter feature selection approaches weigh the pros and cons of feature subsets using specific evaluation criteria [24,25]. It is independent of the classifier, and the calculation is fast and straight. As a result, the filter feature selection approaches are generally used for feature selection.
There are also the above-mentioned three feature selection approaches in multi-label feature selection, with filter feature selection being the most popular. Information theory is a standard mathematical tool for filter feature selection [26]. Based on information theory, this paper mainly focuses on three key aspects that affect feature relevance: candidate features, selected features, and label correlations. The method proposed in this paper examines the amount of information shared between the selected feature subset and the total label set to evaluate feature relevance and denotes it as Δ I for the time being. Once any candidate feature is selected in the current selected feature subset, the current selected feature subset will be updated at this point, and Δ I will be altered accordingly. Moreover, the original label correlations in the total label set also affect Δ I due to some new candidate features being added to the current selected feature subset. Hence, three incremental information terms which combine candidate features, selected features, and label correlations to evaluate feature relevance are used to design a novel feature relevance term. Furthermore, we employ label-related feature redundancy as the feature redundancy term to reduce unnecessary redundancy. Table 1 provides three abbreviations and their corresponding meanings we mentioned. We explain them in detail in Section 4.
The major contributions of this paper are as follows:
  • Analyze and discuss the indispensability of the three key aspects (candidate features, selected features and label correlations) for feature relevance evaluation;
  • Three incremental information terms taking three key aspects into account are used to express three types of conditional relevance. Then, FR combining the three incremental information terms is designed;
  • A designed multi-label feature selection method that integrates FR with LR is proposed, namely TCRFS;
  • TCRFS is compared to 6 state-of-the-art multi-label feature selection methods on 13 benchmark multi-label data sets using 4 evaluation criteria and certified the efficacy in numerous experiments.
The rest of this paper is structured as follows. Section 2 introduces the preliminary theoretical knowledge of this paper: information theory and the four evaluation criteria used in our experiments. Related works are reviewed in Section 3. Section 4 combines three types of conditional relevance to design FR and proposes TCRFS, which integrates FR with LR. The efficacy of TCRFS is proven by comparing it with 6 multi-label methods on 13 benchmark data sets applying 4 evaluation criteria in Section 5. Section 6 concludes our work in this paper.

2. Preliminaries

2.1. Information Theory for Multi-Label Feature Selection

Information theory is a popular and effective means to tackle the problem of multi-label feature selection [27,28,29]. It is used to measure the correlation between random variables [30] and its fundamentals are covered in this subsection.
Assume that the selected feature subset S = { f 1 , f 2 , , f n } , the label set L = { l 1 , l 2 , , l m } . To convey feature relevance, we typically employ I ( S ; L ) , which is mutual information between the selected feature subset and the total label set. Mutual information is a measure in information theory. It can be seen as the amount of information contained in one random variable about another random variable. Assume two discrete random variables X = { x 1 , x 2 , , x n } , Y = { y 1 , y 2 , , y m } , then the mutual information between X and Y can be represented as I ( X ; Y ) . Its expansion formula is as follows:
I ( X ; Y ) = H ( X ) H ( X | Y ) = H ( Y ) H ( Y | X )
where H ( X ) denotes the information entropy of X, and H ( X | Y ) denotes the conditional entropy of X given Y. Information entropy is a concept used to measure the amount of information in information theory. H ( X ) is defined as:
H ( X ) = i = 1 n p ( x i ) log p ( x i )
where p ( x i ) represents the probability distribution of x i , and the base of the logarithm is 2. The conditional entropy H ( X | Y ) is defined as the mathematical expectation of Y for the entropy of the conditional probability distribution of X under the given condition Y:
H ( X | Y ) = i = 1 n j = 1 m p ( x i , y j ) log p ( x i | y j )
where p ( x i , y i ) and p ( x i | y i ) represent the joint probability distribution of ( x i , y i ) and the conditional probability distribution of x i given y i , respectively. H ( X | Y ) can also be represented as follows:
H ( X | Y ) = H ( X , Y ) H ( Y )
where H ( X , Y ) is another measure in information theory, namely, the joint entropy. Its definition is as follows:
H ( X , Y ) = i = 1 n j = 1 m p ( x i , y j ) log p ( x i , y j )
According to Equation (4), combining the relationship between the three different measures of the amount information, the mutual information I ( X ; Y ) can also be alternatively written as follows:
I ( X ; Y ) = H ( X ) + H ( Y ) H ( X , Y )
It is common in multi-label feature selection to have more than two random variables, assuming another discrete random variable Z = { z 1 , z 2 , , z q } . The conditional mutual information I ( X ; Y | Z ) , which expresses the expected value of mutual information of two discrete random variables X and Y given the value of the third discrete variable Z. It is represented as follows:
I ( X ; Y | Z ) = I ( X , Z ; Y ) I ( Y ; Z ) = I ( X ; Y , Z ) I ( X ; Z ) = I ( X ; Y ) I ( X ; Y ; Z )
where I ( X , Z ; Y ) is the joint mutual information and I ( X ; Y ; Z ) is the interaction information. Their expansion formulas are as follows:
I ( X , Z ; Y ) = I ( X ; Y | Z ) + I ( Y ; Z ) = I ( Y ; Z | X ) + I ( X ; Y )
I ( X ; Y ; Z ) = I ( X ; Y ) + I ( X ; Z ) I ( X ; Y , Z ) = I ( X ; Y ) I ( X ; Y | Z )

2.2. Evaluation Criteria for Multi-Label Feature Selection

In our experiments, we employ four distinct evaluation criteria to confirm the efficacy of TCRFS. The four evaluation criteria are essentially separated into two categories: label-based evaluation criteria and example-based evaluation criteria [31]. The label-based evaluation criteria include Macro- F 1 and Micro- F 1 [32]. The higher the value of these two indicators, the better the classification effect. Macro- F 1 actually calculates the F 1 -score of q categories first and then averages it as follows:
Macro - F 1 = 1 q i = 1 q 2 T P i 2 T P i + F P i + F N i
where T P i , F P i , and F N i represent true positives, false positives, and false negatives in i-th category, respectively. Micro- F 1 calculates the confusion matrix of each category, and adds the confusion matrix to obtain a multi-category confusion matrix and then calculates the F 1 -score as follows:
Macro - F 1 = i = 1 q 2 T P i i = 1 q ( 2 T P i + F P i + F N i )
The example-based evaluation criteria include the Hamming Loss (HL) and Zero One Loss (ZOL) [33]. The lower the value of these two indicators, the better the classification effect. HL is a metric for the number of times a label is misclassified. That is, a label belonging to a sample is not predicted, and a label not belonging to the sample is projected to belong to the sample. Suppose that D = { ( x i , Y i ) | 1 i m } is a label test set and Y i Y is a set of class labels corresponding to x i , where Y is the label space with q categories. The definition of HL is as follows:
H L = 1 m i = 1 m Y i Y i q
where ⊕ means the XOR operation. Y i denotes the predicted label set corresponding to x i . The other example-based criterion ZOL is defined as follows:
Z O L = 1 m i = 1 m δ ( a r g m a x y Y h ( x i , y ) )
If the predicted label subset and the true label subset match, the ZOL score is 1 (i.e., δ = 1 ), but if there is no error, the score is 0 (i.e., δ = 0 ).

3. Related Work

There have been a lot of multi-label learning algorithms proposed so far. These multi-label learning algorithms can be divided into problem transform and algorithm adaptation [34,35]. Problem transform is the conversion of multi-label learning into traditional single-label learning, such as Binary Relevance (BR) [36], Pruned Problem Transformation (PPT) [37], and Label Power (LP) [38]. BR treats the prediction of each label as an independent single classification issue and trains an individual classifier for each label with all of the training data [33]. However, it ignores the relationships between the labels, so it is possible to end up with imbalanced data. PPT removes the labels with a low frequency by considering the label set with a predetermined minimum number of occurrences. However, this irreversible conversion will result in the loss of class information [39].
In contrast to problem transform, algorithm adaptation directly enhances the existing single-label data learning algorithms to adapt to multi-label data processing. Algorithm adaption improves the issues caused by problem transformation. Cai et al. [40] propose Robust and Pragmatic Multi-class Feature Selection (RALM-FS) based on an augmented Lagrangian method, where there is just one 2 , 1 -norm loss term in RALM-FS, with an apparent 2 , 0 -norm equality constraint. Lee and Kim [41] propose the D2F method that makes use of interactive information based on mutual information. It is capable of measuring multiple variable dependencies by default, and its definition is as follows:
J ( f k ) = l i L I ( f k ; l i ) f j S l i L I ( f k ; f j ; l i )
where l i L I ( f k ; l i ) and f j S l i L I ( f k ; f j ; l i ) are regarded as the feature relevance term and the feature redundancy term, respectively. The feature relevance of D2F only considers the candidate features in feature relevance, which ignores selected features and label correlations. Lee and Kim [42] propose the Pairwise Multi-label Utility (PMU), which is derived from I ( S ; L ) as follows:
J ( f k ) = l i L I ( f k ; l i ) f j S l i L I ( f k ; f j ; l i ) l i L l j L I ( f k ; l i ; l j )
where l i L I ( f k ; l i ) is to measure the feature relevance and f j S l i L I ( f k ; f j ; l i ) + l i L l j L I ( f k ; l i ; l j ) is to measure the feature redundancy. Afterward, Lee and Kim [43] propose multi-label feature selection based on a scalable criterion for large SCLS. SCLS uses a scalable relevance evaluation approach to assess conditional relevance more correctly:
J ( f k ) = l i L I ( f k ; l i ) f j S I ( f k ; f j ) H ( f k ) l i L I ( f k ; l i ) = 1 f j S I ( f k ; f j ) H ( f k ) l i L I ( f k ; l i )
In fact, the scalable relevance in SCLS considers both candidate features and selected features but ignores label correlations. Liu et al. [44] propose feature selection for multi-label learning with streaming label (FSSL) in which label-specific features are learned for each newly received label, and then label-specific features are fused for all currently received labels. Lin et al. [45] apply a multi-label feature selection method based on fuzzy mutual information (MUCO) to the redundancy and correlation analysis strategies. The next feature that enters S can be selected by the following:
J ( f k ) = F M I ( f k ; L ) 1 | S | f j S ( F M I ( f k ; f j ) )
where F M I ( f k ; L ) denotes the fuzzy mutual information.
When we try to add a new candidate feature f k to the current selected feature subset S, the feature f k , the selected features f j in S, and label correlations in the total label set will all impact feature relevance. To this end, FR is devised by merging the three types of conditional relevance. Therefore, a designed multi-label feature selection method TCRFS that integrates FR with LR is proposed.

4. TCRFS: Feature Selection Combining Three Types of Conditional Relevance

According to the past multi-label feature selection methods, they do not take into account all the three key aspects of influencing feature relevance. That is, the key aspects that influence feature relevance are not comprehensively examined. Here, we utilize three incremental information terms to depict three types of conditional relevance that consider candidate features, selected features, and label correlations comprehensively. The reasons for our consideration are as follows.

4.1. The Three Key Aspects of Feature Relevance We Consider

4.1.1. Candidate Features

We evaluate each candidate feature according to specific criteria. When a candidate feature f k attempts to enter the current selected feature subset S as a new selected feature to generate a new selected feature subset, it will affect the amount of information provided by the current selected feature subset to the label set. The influence of candidate features is represented by a Venn diagram, as shown in Figure 1.
In Figure 1, we assume that f k 1 and f k 2 are two candidate features, f j is a selected feature in S, and l i is a label in the total label set L. f k 1 is irrelevant to f j , and f k 2 is redundant with f j . The amount of information provided by f j to l i is mutual information I ( f j ; l i ) , that is, the area { 2 , 3 } . If f k 1 is selected, then the amount of information provided by f j to l i will be I ( f j ; l i | f k 1 ) , which corresponds to the area { 2 , 3 } . If f k 2 is selected, then the amount of information provided by f j to l i will be I ( f j ; l i | f k 2 ) , which corresponds to the area { 2 } since the area { 2 } is less than the area { 2 , 3 } , I ( f j ; l i | f k 2 ) < I ( f j ; l i | f k 1 ) . Therefore, the higher the label-related redundancy between the candidate feature and the selected feature in the current selected feature subset, the greater the amount of information between f j and l i is reduced. In other words, the label-related redundancy between the candidate feature and the selected features should be kept to a minimum. From this point of view, f k 1 takes precedence over f k 2 .

4.1.2. Selected Features

The influence of selected features is represented by a Venn diagram as shown in Figure 2.
As shown in Figure 2, f k 1 and f k 2 are both redundant with f j . Without considering selected features, the information that f k 1 and f k 2 shared with the label l i are I ( f k 1 ; l i ) and I ( f k 2 ; l i ) , respectively. The area { 1 , 2 } denotes I ( f k 1 ; l i ) , and the area { 5 , 6 } denotes I ( f k 2 ; l i ) . We assume that the area { 1 , 2 } is less than the area { 5 , 6 } , the area { 2 } is less than the area { 5 } , but the area { 1 } is larger than { 6 } . With the selected features taken into account, the information shared by f k 1 and l i is I ( f k 1 ; l i | f j ) (i.e., the area { 1 } ), and the information shared by f k 2 and l i is I ( f k 2 ; l i | f j ) (i.e., the area { 6 } ): I ( f k 1 ; l i ) < I ( f k 2 ; l i ) , but I ( f k 1 ; l i | f j ) > I ( f k 2 ; l i | f j ) . There are two causes for this situation, the first is that the amount of information provided to l i by f k 2 itself is insufficient, and the second is that the label-related redundancy between f k 2 and f j is excessive. Now, in the hypothesis, replace the condition that area { 1 } is larger than the area { 6 } to the area { 1 } is less than the area { 6 } , and we obtain the following result: I ( f k 1 ; l i ) < I ( f k 2 ; l i ) but I ( f k 1 ; l i | f j ) < I ( f k 2 ; l i | f j ) . Therefore, considering the influence of the selected features on feature relevance is necessary.

4.1.3. Label Correlations

It has no influence on the amount of information between candidate features and each label if the labels are independent. The influence of label correlations is represented by a Venn diagram as shown in Figure 3.
In Figure 3, l i and l j are two redundant labels, that is, there exists a correlation between l i and l j . Without the consideration of label correlations, the amount of information provided to l i by f k 1 is I ( f k 1 ; l i ) (the area { 1 , 2 } ) and the amount of information provided to l i by f k 2 is I ( f k 2 ; l i ) (the area { 4 , 5 } ). Then, while taking label correlations into consideration, the amount of information provided to l i by f k 1 is I ( f k 1 ; l i | l j ) (the area { 1 } ) and the amount of information provided to l i by f k 2 is I ( f k 2 ; l i | l j ) (the area { 4 } ). Now, provide the first hypothesis: the area { 1 , 2 } is larger than the area { 4 , 5 } , the area { 2 } is larger than the area { 5 } , but the area { 1 } is less than the area { 4 } . Hence, I ( f k 1 ; l i ) > I ( f k 2 ; l i ) but I ( f k 1 ; l i | l j ) < I ( f k 2 ; l i | l j ) . The second hypothesis modifies the last condition in the first hypothesis: the area { 1 } is larger than the area { 4 } . Hence, I ( f k 1 ; l i ) > I ( f k 2 ; l i ) and I ( f k 1 ; l i | l j ) > I ( f k 2 ; l i | l j ) . We call the area { 2 } and the area { 5 } feature-related label redundancy. Therefore, the original amount of information between candidate features and labels and the feature-related label redundancy can affect the selection of features. Merely using the accumulation of mutual information as the feature relevance will cause the redundant recalculation of feature-related label redundancy.
According to the three key aspects of feature relevance described above, they are indispensable. As a result, we devise FR as the feature relevance term of TCRFS.

4.2. Evaluation Function of TCRFS

4.2.1. Definitions of FR and LR

Regarding the feature relevance evaluation, we distinguish the importance of features based on the closeness of the relationship between features and labels. According to Section 4.1, candidate features, selected features, and label correlations are three key aspects on evaluating feature relevance. In order to be able to perform better in multi-label classification, we utilize three types of conditional relevance ( I ( f k ; l i | f j ) , I ( f j ; l i | f k ) and I ( f k ; l i | l j ) to represent the feature relevance term in the proposed method. By using three incremental information terms to summarize the three key aspects of feature relevance, FR is devised. The three incremental information terms represent the three respective types of conditional relevance.
Definition 1. 
(FR). Suppose that F = { f 1 , f 2 , , f m } and L = { l 1 , l 2 , , 1 n } are the total feature set and the total label set, respectively. Let S be the selected feature set excluding candidate features, that is, f k F S . FR is depicted as follows:
F R ( f k ) = l i L f j S I ( f k ; l i | f j ) + l i L f j S I ( f j ; l i | f k ) + l i L i j , l j L I ( f k ; l i | l j )
where l i L f j S I ( f k ; l i | f j ) denotes the conditional relevance taking candidate features into account while evaluating feature relevance, l i L f j S I ( f j ; l i | f k ) denotes the conditional relevance taking selected features into account while evaluating feature relevance, and l i L i j , l j L I ( f k ; l i | l j ) denotes the conditional relevance taking label correlations into account while evaluating feature relevance. The comprehensive evaluation of the above-mentioned three key aspects of feature relevance is more conducive to capturing the optimal features. Furthermore, FR can be expanded as follows:
F R ( f k ) = l i L f j S I ( f k ; l i | f j ) + l i L f j S I ( f j ; l i | f k ) + l i L i j , l j L I ( f k ; l i | l j ) = l i L f j S [ I ( f k ; l i | f j ) + I ( f j ; l i | f k ) ] + l i L i j , l j L I ( f k ; l i | l j ) = l i L f j S [ I ( f k , f j ; l i ) I ( f j ; l i ) + I ( f j , f k ; l i ) I ( f k ; l i ) ] + l i L i j , l j L [ I ( f k ; l i , l j ) I ( l i ; l j ) ] l i L f j S [ 2 I ( f k , f j ; l i ) I ( f k ; l i ) ] + l i L i j , l j L I ( f k ; l i , l j )
where I ( f j ; l i ) and I ( l i ; l j ) are considered to be two constants in feature selection.
Definition 2. 
(LR). In the initial analysis of the three key aspects of feature relevance, it is mentioned that the label-related feature redundancy is repeatedly calculated in the previous methods, which will impact on capturing the optimal features. Here, LR is devised as follows:
L R ( f k ) = l i L f j S [ I ( f k ; f j ) I ( f k ; f j | l i ) ]
As indicated in Table 2, we have compiled a list of feature relevance terms and feature redundancy terms for TCRFS and the contrasted methods based on information theory.

4.2.2. Proposed Method

We design FR and LR to analyze and discuss feature relevance and feature redundancy, respectively, in Section 4.2.1. Subsequently, TCRFS, a designed multi-label feature selection method that integrates FR with LR, is suggested. The definition of TCRFS is as follows:
J ( f k ) = 1 | L | | S | l i L f j S I ( f k ; l i | f j ) + 1 | L | | S | l i L f j S I ( f j ; l i | f j ) + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i | l j ) 1 | L | | S | l i L f j S [ I ( f k ; f j ) I ( f k ; f j | l i ) ] ,
where | L | and | S | represent the number of the total label set and the number of the selected subset, respectively, and their inversions are 1 | L | and 1 | S | , respectively. The feature relevance term and the feature redundancy term can be balanced using the two balance parameters 1 | L | | S | and 1 | L | | L 1 | . According to Formula (19), Formula (21) can be rewritten as follows:
J ( f k ) = 1 | L | | S | l i L f j S I ( f k ; l i | f j ) + 1 | L | | S | l i L f j S I ( f j ; l i | f j ) + 1 | L | | L 1 | l i L i j , l j L ( I ( f k ; l i | l j ) 1 | L | | S | l i L f j S { I ( f k ; f j ) I ( f k ; f j | l i ) } = 1 | L | | S | l i L f j S { I ( f k ; l i | f j ) + I ( f j ; l i | f j ) I ( f k ; f j ) + I ( f k ; f j | l i ) } + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i | l j ) 1 | L | | S | l i L f j S { 2 I ( f k , f j ; l i ) I ( f k ; l i ) I ( f k ; f j ; l i ) } + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i , l j ) = 1 | L | | S | l i L f j S { 2 I ( f k , f j ; l i ) I ( f k ; l i | f j ) 2 I ( f k ; f j ; l i ) } + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i , l j ) = 1 | L | | S | l i L f j S { 2 I ( f k , f j ; l i ) I ( f k , f j ; l i ) + I ( f j ; l i ) 2 I ( f k ; f j ; l i ) } + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i , l j ) 1 | L | | S | l i L f j S { I ( f k , f j ; l i ) 2 I ( f k ; f j ; l i ) } + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i , l j ) = 1 | L | | S | l i L f j S I ( f k , f j ; l i ) + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i , l j ) 2 | L | | S | l i L f j S I ( f k ; f j ; l i ) ,
where 1 | L | | S | l i L f j S I ( f k , f j ; l i ) + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i , l j ) is regarded as the new feature relevance term and 2 | L | | S | l i L f j S I ( f k ; f j ; l i ) is regarded as the new feature redundancy term. The pseudo-code of TCRFS (Algorithm 1) is as follows:
Algorithm 1. TCRFS.
Input: 
 
 
   A training sample D with a full feature set F = { f 1 , f 2 , , f n } and the label set L = { l 1 , l 2 , , l m } ; User-specified threshold K.
Output: 
 
 
   The selected feature subset S.
1:
S ;
2:
k 0 ;
3:
for i = 1 to n do
4:
   Calculate the feature relevance I ( f i ; l i | l j ) ;
5:
end for
6:
while k < K do
7:
   if k == 0 then
8:
       Select the first feature f j with the largest I ( f i ; l i | l j ) ;
9:
        k = k + 1 ;
10:
      S = S { f j } ;
11:
      F = F { f j } ;
12:
   end if
13:
   for each candidate feature f i F  do
14:
     According to the Formula (21) and calculate the J ( f i ) ;
15:
   end for
16:
   Select the feature f j with the largest J ( f i ) ;
17:
    S = S { f j } ;
18:
    F = F { f j } ;
19:
    k = k + 1 ;
20:
end while
First, in lines 1–5, the selected feature subset S and the number of selected features k in the proposed method are initialized. To capture the initial feature, we calculate the incremental information I ( f i ; l i | l j ) to capture the first feature (lines 6–12). Then, until the procedure is complete, calculate and capture the following feature (lines 13–20).

4.3. Time Complexity

Time complexity is also one of the criteria for evaluating the pros and cons of methods. The time complexity of each contrasted method and TCRFS has been computed here. Assume that there are n, p, and q instances, features, and labels, respectively. The computational complexity of mutual information and conditional mutual information is O ( n ) for all instances that have to be visited for probability. Each iteration of RALM-FS requires O ( p 3 ) . Assume that k denotes the number of selected features. The time complexity of TCRFS is O ( n p q 2 + k n p q ) as three incremental information terms and one label-related feature redundancy term are calculated. Similarly, D2F, PMU, and SCLS have time complexities of O ( n p q + k n p q ) , O ( n p q + k n p q + n p q 2 ) , and O ( n m a + k n m ) , respectively. FSSL has a time complexity of O ( k n p q ) . The time complexity of MUCO is O ( n 2 + p ( p k ) ) since it constructs a fuzzy matrix and incremental search.

5. Experimental Evaluation

Against the demonstrated efficacy of TCRFS, we compare it to 6 advanced multi-label feature selection approaches (RALM-FS [40], D2F [41], PMU [42], SCLS [43], FSSL [44], and MUCO [45]), on 13 benchmark data sets in this section. As a result, we have conducted numerous experiments based on four different criteria using three classifiers, which are Support Vector Machine (SVM), 3-Nearest Neighbor (3NN), and Multi-Label k-Nearest Neighbor (ML-kNN) [46,47]. The 13 multi-label benchmark data sets utilized in the experiments are depicted first. Following that, the findings of the experiments are discussed and examined. Four evaluation metrics that we employed in the experiments have been offered in Section 2.2. The approximate experimental framework is depicted in Figure 4.

5.1. Multi-Label Data Sets

A total of 13 multi-label benchmark data sets from 4 different domains have been depicted in Table 3, which are collected on the Mulan repository [48]. Among them, the Birds data set classifies the birds in Audio [49], the Emotions data set is gathered for Music [38], the Genbase and Yeast data sets are primarily concerned with the Biology category [34], and the remaining 9 data sets are categorized as Text. The 13 data sets we chose have an abundant number of instances, which are split into two parts: training set and test set [48]. Ueda and Saito [50] attempted to classify real Web pages linked from the “yahoo.com” domain, which is composed of 14 top-level categories, each of which is split into many second-level subcategories. They tested 11 of the 14 independent text classification problems by focusing on the second-level categories. For each problem, the training set includes 2000 documents and the test set includes 3000 documents, such as the Arts and Health data sets, and so on [51]. The number of labels and the number of features both vary substantially. Previous research demonstrates that maintaining 10% of the features results in no loss, while retaining 1% of the features results in a slight loss dependent on document frequency [3]. For example, the Arts and Social data sets have more than 20,000 features and 50,000 features, respectively, and they retain about 2% of the features with the highest document frequency. The continuous features of 13 data sets are discretized into equal intervals with 3 bins as indicated in the literature [38,52].

5.2. The Theoretical Justification of TCRFS on an Artificial Data Set

To further justify the indispensability of the three key aspects (candidate features, selected features, and label correlations) for feature relevance evaluation. We employ an artificial data set to compare the classification performance of five information-theoretical-based methods (D2F, PMU, SCLS, MUCO, and TCRFS) that use distinct feature relevance terms. With respect to the feature relevance terms, D2F and PMU employ the amount of information between candidate features and labels, SCLS employs a scalable relevance evaluation, which takes feature redundancy into account in feature relevance, MUCO employs fuzzy mutual information, and TCRFS comprehensively considers the three types of conditional relevance we mentioned to design FR. Table 4 and Table 5 display the training set and the test set, respectively.
Table 6 shows the experimental results and the feature ranking of each approach on the artificial data set. As shown in Table 6, the first feature selected by TCRFS is f 5 . Different from D2F and PMU, f 2 is regarded as the least essential feature. In TCRFS, feature rankings f 0 , f 8 , and f 4 are higher than the feature ranking of SCLS, whereas MUCO selects f 4 as the first feature. TCRFS achieves the best classification performance overall. Therefore, TCRFS, which considers three key aspects (candidate features, selected features, and label correlations), is justified.

5.3. Analysis and Discussion of the Experimental Findings

The experiments that run on a 3.70 GHz Intel Core i9-10900K processor with 32 GB of main memory are performed on four different evaluation criteria regarding three classifiers. Python is used to create the proposed method [53]. Hamming Loss is conducted on the ML-kNN (k = 10) classifier, and Macro- F 1 and Micro- F 1 measures are conducted on the SVM and 3NN classifiers. The number of selected features on the 12 data sets is set to {1%, 2%,..., 20%} of the total number of features when using a step size of 1, whereas the number of selected features on the Medical data set is set to {1%, 2%,..., 17%}. Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 present the classification performance of 6 contrasted approaches and TCRFS on 13 data sets. The average classification results and standard deviations are used to express the classification performance. The average classification results of each method on all data sets are represented in the row “Average”. The data of the best-performing classification results in Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 are bolded.
Observing Table 7 and Table 8, TCRFS delivers the optimum classification performance on SVM classifier regarding Macro- F 1 and Micro- F 1 measures, since the higher the values of the two measures, the more superior the classification performance. In Table 9, except for the Yeast data set, TCRFS beats 6 other contrasted approaches on 12 data sets using 3NN classifier for Macro- F 1 . TCRFS surpasses the other 6 contrasted approaches on 11 data sets using the 3NN classifier for Micro- F 1 in Table 10. According to the properties of the HL and ZOL measures, the lower values of the two measures mean the more excellent classification performance. In Table 11 and Table 12, TCRFS can exhibit the best system performance on 11 data sets on the ML-kNN classifier for the HL and ZOL criteria. In some cases, comprehensive consideration of the three key aspects to assess feature relevance does not achieve the best classification effect. The classification results of D2F takes the first position on the Yeast data set regarding Macro- F 1 on the 3NN classifier. PMU and RALM-FS possess the optimal classification performance on the Yeast data set and the Education data sets, respectively. In terms of HL (Table 11), RALM-FS and SCLS surpass other approaches on the Birds and Emotions data sets, respectively. In terms of ZOL (Table 12), FSSL and D2F surpass other approaches on the Birds and Emotions data sets, respectively. Despite the fact that D2F, PMU, RALM-FS, SCLS and FSSL have the greatest system performance on individual data sets, the overall optimal classification performance is still TCRFS. The average values of each method for different evaluation criteria are illustrated in Figure 5. The abscissa and different colored bars represent different feature selection methods, while the ordinate represents the average value.
Observing the trend of the bar graphs in Figure 5a,b, Macro- F 1 results and Micro- F 1 results achieved on the SVM classifier and 3NN classifier have reached similar classification performance. The average results of TCRFS in terms of Macro- F 1 are roughly 0.2 or above, and the average results of TCRFS in terms of Micro- F 1 are roughly 0.4 or above, which are clearly greater than the average results of other approaches. The average result of TCRFS is less than 0.074 in Figure 5c and less than 0.74 in Figure 5d, which are clearly less than the average results of other approaches. Intuitively, TCRFS clearly presents the most excellent average values in terms of the four evaluation criteria. In order to further observe the classification performance of the seven methods on the data sets, we draw Figure 6, Figure 7, Figure 8 and Figure 9.
Figure 6, Figure 7, Figure 8 and Figure 9 indicate that TCRFS delivers superior classification performance on the Arts, Recreation, Entertain, and Health data sets regarding the four evaluation criteria. As shown in Figure 6, the classification performance of our method is significantly better than the other six contrasted methods. On the Recreation data set (Figure 7), the classification performance of the method is not constantly improved by increasing the number of selected features. TCRFS, for example, may obtain the most significant classification results regarding the ZOL measure when the number of selected features is set at 8% or 11% of the total number of features. On the Entertain data set (Figure 8), TCRFS is clearly in the lead regarding Macro- F 1 when the percentage of the selected features is larger than one. In terms of HL and ZOL, TCRFS also possesses significant advantages among the seven methods. The proposed method can obtain the optimum classification performance for each metric when the percentage of the selected features is set to 6%. In Figure 9, our method outperforms the other six contrasted methods on the Health data set utilizing the four metrics. Although in most cases the performance of feature selection methods improves as the number of selected features increases, as the number of features increases to a certain number, the improvement in the classification performance tends to be flat. When the percentage of the number of features increases to about 16% on the Arts data set (Figure 6a–d) and the percentage of the number of features increases to about 19% on the Entertain data (Figure 8a–d), the classification performance has reached a relatively high level. That is to say, an optimal feature subset is to select a smaller number of features to achieve a better classification performance. However, some methods appear to have the same classification performance as TCRFS in Figure 8d and Figure 9e, but TCRFS is superior on average, and they are not as excellent as TCRFS overall. As a consequence, it is critical to consider the three types of conditional relevance for multi-label feature selection.
We create the final feature subset by starting from an empty feature subset and adding a feature after each calculation of the proposed method. According to the TCRFS evaluation function, the score of each candidate feature is calculated and sorted. Due to TCRFS using three incremental information terms as the evaluation criteria for feature relevance, the incremental information of the remaining candidate features will change after each time the selection operation of candidate features is completed. It needs to be recalculated and scored. Therefore, while achieving better classification performance, more time is consumed.

6. Conclusions

In this paper, a TCRFS that combines FR and LR is proposed to capture the optimal selected feature subset. FR fuses three incremental information terms that take three key aspects into consideration to convey three types of conditional relevance. Then, TCRFS is compared with 1 embedded approach (RALM-FS) and 5 information-theoretical-based approaches (D2F, PMU, SCLS, FSSL, and MUCO) on 13 multi-label benchmark data sets to demonstrate its efficacy. The classification performance of seven multi-label feature selection methods is evaluated through four multi-label metrics (Macro- F 1 , Micro- F 1 , Hamming Loss, and Zero One Loss) for three classifiers (SVM, 3NN, and ML-kNN). Finally, the classification results verify that TCRFS outperforms the other six contrasted approaches. Therefore, candidate features, selected features, and label correlations are critical for feature relevance evaluation, and they can aid in the selection of a more suitable subset of selected features. Our current research is based on a fixed label set for multi-label feature selection. In our future research, we intend to explore multi-label feature selection integrating information theory with the stream label problem.

Author Contributions

Conceptualization, L.G.; methodology, L.G.; software, P.Z. and L.G.; validation, Y.W. and Y.L.; formal analysis, L.G.; investigation, L.G.; resources, Y.W.; data curation, L.H.; writing—original draft preparation, L.G.; writing—review and editing, L.G.; visualization, L.G. and Y.W.; supervision, Y.W.; project administration, L.H.; funding acquisition, L.H. All authors have read and approved the final manuscript.

Funding

This work was supported in part by the National Key Research and Development Plan of China under Grant 2017YFA0604500, in part by the Key Scientific and Technological Research and Development Plan of Jilin Province of China under Grant 20180201103GX, and in part by the Project of Jilin Province Development and Reform Commission under Grant 2019FGWTZC001.

Data Availability Statement

The multi-label data sets used in the experiment are from Mulan Library http://mulan.sourceforge.net/datasets-mlc.html, accessed on 24 November 2021.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhou, Z.H.; Zhang, M.L. Multi-label Learning. 2017. Available online: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/EncyMLDM2017.pdf (accessed on 26 November 2021).
  2. Kashef, S.; Nezamabadi-pour, H. A label-specific multi-label feature selection algorithm based on the Pareto dominance concept. Pattern Recognit. 2019, 88, 654–667. [Google Scholar] [CrossRef]
  3. Zhang, M.L.; Wu, L. Lift: Multi-label learning with label-specific features. IEEE PAMI 2014, 37, 107–120. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
  5. Al-Salemi, B.; Ayob, M.; Noah, S.A.M. Feature ranking for enhancing boosting-based multi-label text categorization. Expert Syst. Appl. 2018, 113, 531–543. [Google Scholar] [CrossRef]
  6. Yu, Y.; Pedrycz, W.; Miao, D. Neighborhood rough sets based multi-label classification for automatic image annotation. Int. J. Approx. Reason. 2013, 54, 1373–1387. [Google Scholar] [CrossRef]
  7. Yu, G.; Rangwala, H.; Domeniconi, C.; Zhang, G.; Yu, Z. Protein function prediction with incomplete annotations. IEEE/ACM Trans. Comput. Biol. Bioinform. 2013, 11, 579–591. [Google Scholar] [CrossRef] [Green Version]
  8. Tran, M.Q.; Li, Y.C.; Lan, C.Y.; Liu, M.K. Wind Farm Fault Detection by Monitoring Wind Speed in the Wake Region. Energies 2020, 13, 6559. [Google Scholar] [CrossRef]
  9. Tran, M.Q.; Elsisi, M.; Liu, M.K. Effective feature selection with fuzzy entropy and similarity classifier for chatter vibration diagnosis. Measurement 2021, 184, 109962. [Google Scholar] [CrossRef]
  10. Tran, M.Q.; Liu, M.K.; Elsisi, M. Effective multi-sensor data fusion for chatter detection in milling process. ISA Trans. 2021. Available online: https://www.sciencedirect.com/science/article/abs/pii/S0019057821003724 (accessed on 26 November 2021). [CrossRef]
  11. Gao, W.; Hu, L.; Zhang, P.; Wang, F. Feature selection by integrating two groups of feature evaluation criteria. Expert Syst. Appl. 2018, 110, 11–19. [Google Scholar] [CrossRef]
  12. Huang, J.; Li, G.; Huang, Q.; Wu, X. Learning label specific features for multi-label classification. In Proceedings of the 2015 IEEE International Conference on Data Mining, Atlantic City, NJ, USA, 14–17 November 2015; pp. 181–190. [Google Scholar] [CrossRef]
  13. Zhang, P.; Gao, W.; Liu, G. Feature selection considering weighted relevancy. Appl. Intell. 2018, 48, 4615–4625. [Google Scholar] [CrossRef]
  14. Gao, W.; Hu, L.; Zhang, P. Class-specific mutual information variation for feature selection. Pattern Recognit. 2018, 79, 328–339. [Google Scholar] [CrossRef]
  15. Zhang, P.; Gao, W. Feature selection considering Uncertainty Change Ratio of the class label. Appl. Soft 2020, 95, 106537. [Google Scholar] [CrossRef]
  16. Liu, H.; Sun, J.; Liu, L.; Zhang, H. Feature selection with dynamic mutual information. Pattern Recognit. 2009, 42, 1330–1339. [Google Scholar] [CrossRef]
  17. Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural. Comput. 2014, 24, 175–186. [Google Scholar] [CrossRef]
  18. Hancer, E.; Xue, B.; Zhang, M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowl. Based Syst. 2018, 140, 103–119. [Google Scholar] [CrossRef]
  19. Brezočnik, L.; Fister, I.; Podgorelec, V. Swarm intelligence algorithms for feature selection: A review. Appl. Sci. 2018, 8, 1521. [Google Scholar] [CrossRef] [Green Version]
  20. Zhu, P.; Xu, Q.; Hu, Q.; Zhang, C.; Zhao, H. Multi-label feature selection with missing labels. Pattern Recognit. 2018, 74, 488–502. [Google Scholar] [CrossRef]
  21. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Appl. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef] [Green Version]
  22. Paniri, M.; Dowlatshahi, M.B.; Nezamabadi-pour, H. MLACO: A multi-label feature selection algorithm based on ant colony optimization. Knowl. Based Syst. 2020, 192, 105285. [Google Scholar] [CrossRef]
  23. Blum, A.L.; Langley, P. Selection of relevant features and examples in machine learning. Appl. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef] [Green Version]
  24. Cherrington, M.; Thabtah, F.; Lu, J.; Xu, Q. Feature selection: Filter methods performance challenges. In Proceedings of the 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 3–4 April 2019; pp. 1–4. [Google Scholar] [CrossRef]
  25. Li, F.; Miao, D.; Pedrycz, W. Granular multi-label feature selection based on mutual information. Pattern Recognit. 2017, 67, 410–423. [Google Scholar] [CrossRef]
  26. Zhang, Z.; Li, S.; Li, Z.; Chen, H. Multi-label feature selection algorithm based on information entropy. Comput. Sci. 2013, 50, 1177. [Google Scholar]
  27. Wang, J.; Wei, J.M.; Yang, Z.; Wang, S.Q. Feature selection by maximizing independent classification information. IEEE Trans. Knowl. Data Eng. 2017, 29, 828–841. [Google Scholar] [CrossRef]
  28. Lin, Y.; Hu, Q.; Liu, J.; Duan, J. Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing 2015, 168, 92–103. [Google Scholar] [CrossRef]
  29. Ramírez-Gallego, S.; Mouriño-Talín, H.; Martínez-Rego, D.; Bolón-Canedo, V.; Benítez, J.M.; Alonso-Betanzos, A.; Herrera, F. An information theory-based feature selection framework for big data under apache spark. IEEE Trans. Syst. 2017, 48, 1441–1453. [Google Scholar] [CrossRef]
  30. Song, X.F.; Zhang, Y.; Guo, Y.N.; Sun, X.Y.; Wang, Y.L. Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Trans. Evol. Comput. 2020, 24, 882–895. [Google Scholar] [CrossRef]
  31. Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2013, 26, 1819–1837. [Google Scholar] [CrossRef]
  32. Hu, L.; Li, Y.; Gao, W.; Zhang, P.; Hu, J. Multi-label feature selection with shared common mode. Pattern Recognit. 2020, 104, 107344. [Google Scholar] [CrossRef]
  33. Zhang, P.; Gao, W.; Hu, J.; Li, Y. Multi-Label Feature Selection Based on High-Order Label Correlation Assumption. Entropy 2020, 22, 797. [Google Scholar] [CrossRef]
  34. Zhang, P.; Gao, W. Feature relevance term variation for multi-label feature selection. Appl. Intell. 2021, 51, 5095–5110. [Google Scholar] [CrossRef]
  35. Xu, S.; Yang, X.; Yu, H.; Yu, D.J.; Yang, J.; Tsang, E.C. Multi-label learning with label-specific feature reduction. Knowl. Based Syst. 2016, 104, 52–61. [Google Scholar] [CrossRef]
  36. Boutell, M.R.; Luo, J.; Shen, X.; Brown, C.M. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef] [Green Version]
  37. Read, J. A pruned problem transformation method for multi-label classification. In New Zealand Computer Science Research Student Conference (NZCSRS 2008); Citeseer: Princeton, NJ, USA, 2008; Volume 143150, p. 41. [Google Scholar]
  38. Trohidis, K.; Tsoumakas, G.; Kalliris, G.; Vlahavas, I.P. Multi-label classification of music into emotions. In Proceedings of the ISMIR, Philadelphia, PA, USA, 14–18 September 2008; Volume 8, pp. 325–330. [Google Scholar]
  39. Lee, J.; Kim, D.W. Memetic feature selection algorithm for multi-label classification. Inf. Sci. 2015, 293, 80–96. [Google Scholar] [CrossRef]
  40. Cai, X.; Nie, F.; Huang, H. Exact top-k feature selection via 2,0-norm constraint. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
  41. Lee, J.; Kim, D.W. Mutual information-based multi-label feature selection using interaction information. Expert Syst. Appl. 2015, 42, 2013–2025. [Google Scholar] [CrossRef]
  42. Lee, J.; Kim, D.W. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 2013, 34, 349–357. [Google Scholar] [CrossRef]
  43. Lee, J.; Kim, D.W. SCLS: Multi-label feature selection based on scalable criterion for large label set. Pattern Recognit. 2017, 66, 342–352. [Google Scholar] [CrossRef]
  44. Liu, J.; Li, Y.; Weng, W.; Zhang, J.; Chen, B.; Wu, S. Feature selection for multi-label learning with streaming label. Neurocomputing 2020, 387, 268–278. [Google Scholar] [CrossRef]
  45. Lin, Y.; Hu, Q.; Liu, J.; Li, J.; Wu, X. Streaming feature selection for multilabel learning based on fuzzy mutual information. IEEE Trans. Fuzzy Syst. 2017, 25, 1491–1507. [Google Scholar] [CrossRef]
  46. Kong, D.; Fujimaki, R.; Liu, J.; Nie, F.; Ding, C. Exclusive Feature Learning on Arbitrary Structures via 1,2-norm. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1655–1663. [Google Scholar]
  47. Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef] [Green Version]
  48. Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J.; Vlahavas, I. Mulan: A java library for multi-label learning. J. Mach. Learn Res. 2011, 12, 2411–2414. [Google Scholar]
  49. Zhang, P.; Gao, W.; Hu, J.; Li, Y. Multi-label feature selection based on the division of label topics. Inf. Sci. 2021, 553, 129–153. [Google Scholar] [CrossRef]
  50. Ueda, N.; Saito, K. Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; pp. 737–744. [Google Scholar]
  51. Zhang, Y.; Zhou, Z.H. Multilabel dimensionality reduction via dependence maximization. ACM Trans. Knowl. Discov. Data 2010, 4, 1–21. [Google Scholar] [CrossRef]
  52. Doquire, G.; Verleysen, M. Feature selection for multi-label classification problems. In International Work-Conference on Artificial Neural Networks; Springer: Berlin/Heidelberger, Germany, 2011; pp. 9–16. [Google Scholar]
  53. Szymański, P.; Kajdanowicz, T. A scikit-based Python environment for performing multi-label classification. arXiv 2017, arXiv:1702.01460. [Google Scholar]
Figure 1. The relationship between features and labels in the Venn diagram.
Figure 1. The relationship between features and labels in the Venn diagram.
Entropy 23 01617 g001
Figure 2. The relationship between features and labels in the Venn diagram.
Figure 2. The relationship between features and labels in the Venn diagram.
Entropy 23 01617 g002
Figure 3. The relationship between features and labels in the Venn diagram.
Figure 3. The relationship between features and labels in the Venn diagram.
Entropy 23 01617 g003
Figure 4. The experimental framework.
Figure 4. The experimental framework.
Entropy 23 01617 g004
Figure 5. The average values of each method for (a) Macro- F 1 , (b) Micro- F 1 , (c) HL, (d) ZOL.
Figure 5. The average values of each method for (a) Macro- F 1 , (b) Micro- F 1 , (c) HL, (d) ZOL.
Entropy 23 01617 g005
Figure 6. The classification performance of seven methods on Arts data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Figure 6. The classification performance of seven methods on Arts data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Entropy 23 01617 g006
Figure 7. The classification performance of seven methods on Recreation data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Figure 7. The classification performance of seven methods on Recreation data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Entropy 23 01617 g007
Figure 8. The classification performance of seven methods on Entertain data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Figure 8. The classification performance of seven methods on Entertain data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Entropy 23 01617 g008
Figure 9. The classification performance of seven methods on Health data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Figure 9. The classification performance of seven methods on Health data set for (a) Macro- F 1 using SVM, (b) Macro- F 1 using 3NN, (c) Micro- F 1 using SVM, (d) Micro- F 1 using 3NN, (e) HL using ML-kNN, (f) ZOL using ML-kNN.
Entropy 23 01617 g009
Table 1. Abbreviations meaning statistics.
Table 1. Abbreviations meaning statistics.
AbbreviationsCorresponding Meanings
FRA novel feature relevance term
LRA label-related feature redundancy term
TCRFSFeature Selection combining three types of Conditional Relevance
Table 2. Feature relevance terms and feature redundancy terms of multi-label feature selection methods.
Table 2. Feature relevance terms and feature redundancy terms of multi-label feature selection methods.
MethodsFeature Relevance TermsFeature Redundancy Terms
D2F l i L I ( f k ; l i ) f j S l i L I ( f k ; f j ; l i )
PMU l i L I ( f k ; l i ) f j S l i L I ( f k ; f j ; l i ) + l i L l j L I ( f k ; l i ; l j )
SCLS 1 f j S I ( f k ; f j ) H ( f k ) l i L I ( f k ; l i ) None
MUCO F M I ( f k ; L ) 1 | S | f j S ( F M I ( f k ; f j ) )
TCRFS 1 | L | | S | l i L f j S [ I ( f k ; l i | f j ) + I ( f j ; l i | f k ) ] + 1 | L | | L 1 | l i L i j , l j L I ( f k ; l i | l j ) 1 | L | | L 1 | l i L f j S [ I ( f k ; f j ) I ( f k ; f j | l i ) ]
Table 3. The depiction of data sets in our experiments.
Table 3. The depiction of data sets in our experiments.
No.Data Set#Domains#Labels#Features#Training#Test#Instance
1BirdsAudio19260322323645
2EmotionsMusic672391202593
3GenbaseBiology271185463199662
4YeastBiology1410315009172417
5MedicalText451449333645978
6EntertainText21640200030005000
7RecreationText22606200030005000
8ArtsText26462200030005000
9HealthText32612200030005000
10EducationText33550200030005000
11ReferenceText33793200030005000
12SocialText391047200030005000
13ScienceText40743200030005000
Table 4. Training set.
Table 4. Training set.
f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 y 0 y 1 y 2 y 3
11000010101001
00011010001111
01010000100011
01001001011000
11100101100000
10000010101100
10001010100101
00101001010000
01010100000110
01100000101001
11000011101101
11011001001000
01110000000110
01101001011000
11000101100110
10100000101101
00001010100010
00101001010001
00010100000100
01101001111100
Table 5. Test set.
Table 5. Test set.
f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 y 1 y 2 y 3 y 4
11000010110110
00111010110010
10110101000101
11000100000010
10011001010110
10101100110101
11000010100111
00101011111010
10111100000100
01000100001101
Table 6. Experimental results on the artificial data set.
Table 6. Experimental results on the artificial data set.
MethodsFeature RankingSVMML-kNN
Macro- F 1 Micro- F 1 Macro- F 1 Micro- F 1 HL ↓ZOL ↓
TCRFS { f 5 , f 0 , f 7 , f 8 , f 3 , f 4 , f 1 , f 6 , f 9 , f 2 } 0.3320.4570.3750.4350.50000.97
D2F { f 5 , f 0 , f 7 , f 8 , f 3 , f 4 , f 1 , f 6 , f 2 , f 9 } 0.3310.4550.3740.4310.51500.97
PMU { f 5 , f 0 , f 7 , f 8 , f 3 , f 4 , f 1 , f 6 , f 2 , f 9 } 0.3310.4550.3740.4310.51500.97
SCLS { f 5 , f 9 , f 3 , f 7 , f 0 , f 6 , f 1 , f 2 , f 8 , f 4 } 0.320.4090.3730.4270.50250.98
MUCO { f 4 , f 6 , f 7 , f 8 , f 1 , f 2 , f 3 , f 0 , f 5 , f 9 } 0.3310.3970.3340.3850.54500.98
Table 7. Classification performance of each method regarding Macro- F 1 on SVM classifier (mean ± std).
Table 7. Classification performance of each method regarding Macro- F 1 on SVM classifier (mean ± std).
Data SetRALM-FSD2FPMUSCLSFSSLMUCOTCRFS
Birds0.058 ± 0.0240.077 ± 0.040.075 ± 0.0360.039 ± 0.0260.049 ± 0.0270.1 ± 0.0510.116 ± 0.058
Emotions0.147 ± 0.1010.315 ± 0.0610.239 ± 0.0950.336 ± 0.0550.35 ± 0.0850.366 ± 0.1270.381 ± 0.089
Genbase0.738 ± 0.1530.706 ± 0.1070.628 ± 0.0930.241 ± 0.0220.762 ± 0.1330.758 ± 0.140.765 ± 0.129
Yeast0.229 ± 0.0360.258 ± 0.0340.262 ± 0.0310.207 ± 0.0140.213 ± 0.0370.227 ± 0.0440.276 ± 0.036
Medical0.129 ± 0.0630.191 ± 0.0550.188 ± 0.0570.079 ± 0.0130.227 ± 0.0860.254 ± 0.0740.311 ± 0.075
Entertain0.059 ± 0.0220.081 ± 0.0060.051 ± 0.0040.067 ± 0.0060.075 ± 0.0280.058 ± 0.0130.119 ± 0.023
Recreation0.024 ± 0.0080.077 ± 0.0090.026 ± 0.0020.044 ± 0.0040.042 ± 0.0240.041 ± 0.0180.105 ± 0.019
Arts0.024 ± 0.0140.031 ± 0.0050.014 ± 0.0070.027 ± 0.0050.025 ± 0.0140.026 ± 0.0140.072 ± 0.024
Health0.062 ± 0.0210.089 ± 0.0080.078 ± 0.0080.089 ± 0.010.087 ± 0.0220.077 ± 0.0210.141 ± 0.028
Education0.024 ± 0.0090.046 ± 0.0090.027 ± 0.0080.038 ± 0.0060.041 ± 0.0150.041 ± 0.0190.065 ± 0.013
Reference0.023 ± 0.010.039 ± 0.0040.026 ± 0.0060.024 ± 0.0040.03 ± 0.0110.04 ± 0.0170.065 ± 0.013
Social0.046 ± 0.0180.07 ± 0.010.052 ± 0.0120.052 ± 0.0060.055 ± 0.020.059 ± 0.0190.101 ± 0.028
Science0.008 ± 0.0060.021 ± 0.0030.009 ± 0.0050.016 ± 0.0040.023 ± 0.0130.024 ± 0.0130.049 ± 0.017
Average0.1210.1540.1290.0970.1520.1590.197
Table 8. Classification performance of each method regarding Micro- F 1 on SVM classifier (mean ± std).
Table 8. Classification performance of each method regarding Micro- F 1 on SVM classifier (mean ± std).
Data SetRALM-FSD2FPMUSCLSFSSLMUCOTCRFS
Birds0.096 ± 0.0460.135 ± 0.0750.129 ± 0.0550.06 ± 0.040.084 ± 0.0490.197 ± 0.0780.207 ± 0.086
Emotions0.178 ± 0.1130.372 ± 0.0380.295 ± 0.0990.422 ± 0.0380.434 ± 0.060.425 ± 0.1180.45 ± 0.07
Genbase0.958 ± 0.1360.968 ± 0.0660.946 ± 0.0660.541 ± 0.0140.969 ± 0.1080.977 ± 0.0710.979 ± 0.067
Yeast0.552 ± 0.0270.565 ± 0.0230.571 ± 0.0210.532 ± 0.0080.54 ± 0.0260.549 ± 0.0310.584 ± 0.027
Medical0.363 ± 0.1470.629 ± 0.070.625 ± 0.0750.37 ± 0.0090.661 ± 0.1680.711 ± 0.0870.753 ± 0.058
Entertain0.108 ± 0.0430.163 ± 0.0150.096 ± 0.0130.149 ± 0.0160.192 ± 0.0620.127 ± 0.0410.251 ± 0.054
Recreation0.043 ± 0.0180.138 ± 0.0160.038 ± 0.0030.07 ± 0.0070.065 ± 0.0380.077 ± 0.0340.198 ± 0.035
Arts0.059 ± 0.0330.075 ± 0.0130.033 ± 0.0160.072 ± 0.0150.062 ± 0.0330.056 ± 0.0310.16 ± 0.051
Health0.401 ± 0.0180.418 ± 0.0120.391 ± 0.0290.406 ± 0.0040.426 ± 0.020.396 ± 0.0610.479 ± 0.026
Education0.073 ± 0.0240.117 ± 0.0170.077 ± 0.0140.138 ± 0.0230.142 ± 0.0560.132 ± 0.060.203 ± 0.045
Reference0.153 ± 0.0770.305 ± 0.0390.265 ± 0.050.259 ± 0.0390.286 ± 0.0620.314 ± 0.0930.344 ± 0.058
Social0.252 ± 0.1070.396 ± 0.0720.31 ± 0.070.384 ± 0.0490.357 ± 0.1050.356 ± 0.0820.426 ± 0.073
Science0.029 ± 0.0150.053 ± 0.010.024 ± 0.0160.058 ± 0.0140.071 ± 0.0340.074 ± 0.0370.122 ± 0.032
Average0.2510.3330.2920.2660.330.3380.397
Table 9. Classification performance of each method regarding Macro- F 1 on 3NN classifier (mean ± std).
Table 9. Classification performance of each method regarding Macro- F 1 on 3NN classifier (mean ± std).
Data SetRALM-FSD2FPMUSCLSFSSLMUCOTCRFS
Birds0.093 ± 0.0360.15 ± 0.0660.122 ± 0.0360.078 ± 0.0280.075 ± 0.0370.131 ± 0.0380.17 ± 0.048
Emotions0.312 ± 0.0740.434 ± 0.0330.413 ± 0.0460.426 ± 0.0420.442 ± 0.1240.434 ± 0.1010.468 ± 0.068
Genbase0.689 ± 0.1320.65 ± 0.0860.604 ± 0.0890.224 ± 0.0180.702 ± 0.120.7 ± 0.1230.71 ± 0.103
Yeast0.3 ± 0.0270.348 ± 0.0380.34 ± 0.030.301 ± 0.0260.309 ± 0.0410.314 ± 0.0330.334 ± 0.039
Medical0.069 ± 0.0290.121 ± 0.0190.114 ± 0.0180.063 ± 0.0060.149 ± 0.040.155 ± 0.030.184 ± 0.025
Entertain0.079 ± 0.0310.108 ± 0.0110.083 ± 0.0140.095 ± 0.0130.094 ± 0.0280.089 ± 0.0140.128 ± 0.019
Recreation0.06 ± 0.0140.082 ± 0.0110.053 ± 0.010.066 ± 0.0110.057 ± 0.0260.057 ± 0.0210.114 ± 0.019
Arts0.036 ± 0.0180.064 ± 0.010.058 ± 0.0140.072 ± 0.0160.061 ± 0.0260.064 ± 0.0190.092 ± 0.02
Health0.064 ± 0.0270.087 ± 0.0110.093 ± 0.0080.087 ± 0.0110.087 ± 0.0240.08 ± 0.0180.122 ± 0.022
Education0.047 ± 0.0110.065 ± 0.0090.057 ± 0.0090.059 ± 0.010.063 ± 0.0150.06 ± 0.0190.074 ± 0.012
Reference0.032 ± 0.010.044 ± 0.0040.034 ± 0.0070.036 ± 0.0050.041 ± 0.010.046 ± 0.0150.07 ± 0.011
Social0.052 ± 0.0130.064 ± 0.0060.054 ± 0.0060.051 ± 0.0040.064 ± 0.0240.058 ± 0.0160.091 ± 0.011
Science0.024 ± 0.0080.04 ± 0.0050.028 ± 0.0080.03 ± 0.0040.039 ± 0.0190.036 ± 0.0110.057 ± 0.012
Average0.1430.1740.1580.1220.1680.1710.201
Table 10. Classification performance of each method regarding Micro- F 1 on 3NN classifier (mean ± std).
Table 10. Classification performance of each method regarding Micro- F 1 on 3NN classifier (mean ± std).
Data SetRALM-FSD2FPMUSCLSFSSLMUCOTCRFS
Birds0.171 ± 0.0660.231 ± 0.0720.203 ± 0.050.144 ± 0.0430.159 ± 0.0540.227 ± 0.0570.273 ± 0.061
Emotions0.353 ± 0.0510.469 ± 0.020.445 ± 0.0220.46 ± 0.0280.478 ± 0.1140.471 ± 0.0790.503 ± 0.05
Genbase0.956 ± 0.1340.95 ± 0.0610.919 ± 0.0640.518 ± 0.0120.959 ± 0.1260.974 ± 0.0740.977 ± 0.065
Yeast0.529 ± 0.0190.549 ± 0.0410.553 ± 0.0140.518 ± 0.0350.526 ± 0.0490.523 ± 0.0410.552 ± 0.041
Medical0.294 ± 0.1080.53 ± 0.0380.522 ± 0.0370.353 ± 0.0130.558 ± 0.1210.591 ± 0.0530.638 ± 0.032
Entertain0.187 ± 0.0850.241 ± 0.0320.22 ± 0.0530.217 ± 0.0310.229 ± 0.0370.234 ± 0.0480.249 ± 0.032
Recreation0.102 ± 0.0140.159 ± 0.0240.094 ± 0.020.115 ± 0.0170.111 ± 0.0450.112 ± 0.0410.224 ± 0.033
Arts0.095 ± 0.0450.15 ± 0.0310.137 ± 0.0280.172 ± 0.0280.126 ± 0.0440.155 ± 0.0290.237 ± 0.028
Health0.2 ± 0.0970.367 ± 0.050.361 ± 0.0380.366 ± 0.0640.33 ± 0.0920.339 ± 0.0380.38 ± 0.063
Education0.254 ± 0.0260.19 ± 0.0320.18 ± 0.040.19 ± 0.0330.238 ± 0.0320.191 ± 0.0540.22 ± 0.036
Reference0.164 ± 0.0730.364 ± 0.0480.35 ± 0.0430.294 ± 0.0480.334 ± 0.0490.319 ± 0.0850.42 ± 0.046
Social0.302 ± 0.040.39 ± 0.0510.363 ± 0.0510.368 ± 0.040.354 ± 0.0690.349 ± 0.0560.432 ± 0.045
Science0.08 ± 0.0370.123 ± 0.0190.099 ± 0.0180.147 ± 0.0340.112 ± 0.0410.136 ± 0.0370.153 ± 0.031
Average0.2840.3630.3420.2970.3470.3550.404
Table 11. Classification performance of each method regarding HL on ML-kNN classifier (mean ± std).
Table 11. Classification performance of each method regarding HL on ML-kNN classifier (mean ± std).
Data SetRALM-FSD2FPMUSCLSFSSLMUCOTCRFS
Birds0.05081 ± 0.001060.05269 ± 0.001640.05227 ± 0.00170.0544 ± 0.001880.0526 ± 0.001430.05138 ± 0.001330.05147 ± 0.00103
Emotions0.33752 ± 0.013180.29408 ± 0.013240.31854 ± 0.009140.27947 ± 0.007160.2922 ± 0.013560.28878 ± 0.020790.28012 ± 0.01018
Genbase0.00377 ± 0.00680.00315 ± 0.003910.00469 ± 0.004050.03093 ± 0.000420.00301 ± 0.005850.00296 ± 0.004330.00269 ± 0.00396
Yeast0.23706 ± 0.004340.22784 ± 0.002870.22793 ± 0.003560.2332 ± 0.004310.23182 ± 0.002930.23341 ± 0.003770.22565 ± 0.00404
Medical0.02702 ± 0.00070.01955 ± 0.001050.01972 ± 0.001070.02332 ± 0.000180.01842 ± 0.002370.01852 ± 0.001080.01774 ± 0.0009
Entertain0.06652 ± 0.000570.06568 ± 0.001330.06708 ± 0.001120.06587 ± 0.001440.06415 ± 0.001030.06631 ± 0.000850.06315 ± 0.00145
Recreation0.06513 ± 0.000380.06239 ± 0.000770.06484 ± 0.000680.06444 ± 0.00060.06513 ± 0.000690.06419 ± 0.00070.06144 ± 0.00111
Arts0.06285 ± 0.000230.0635 ± 0.001220.06441 ± 0.001040.06339 ± 0.000740.06389 ± 0.000570.06412 ± 0.000750.06135 ± 0.00063
Health0.04969 ± 0.001320.04831 ± 0.000510.04934 ± 0.000590.04848 ± 0.001140.04764 ± 0.001010.04898 ± 0.000680.04545 ± 0.00111
Education0.04414 ± 0.000340.04427 ± 0.000730.04453 ± 0.000820.04408 ± 0.001010.04403 ± 0.00060.0444 ± 0.000540.04303 ± 0.00069
Reference0.03503 ± 0.000350.03223 ± 0.001170.03357 ± 0.000950.0329 ± 0.000210.03262 ± 0.000680.03332 ± 0.000610.03133 ± 0.00075
Social0.03061 ± 0.001220.03032 ± 0.000460.03091 ± 0.000310.02866 ± 0.00070.02906 ± 0.000920.02967 ± 0.000550.02766 ± 0.00077
Science0.03615 ± 0.000280.03579 ± 0.00040.03626 ± 0.000360.03583 ± 0.000410.03567 ± 0.000270.0361 ± 0.000580.03543 ± 0.00042
Average0.080480.075370.078010.077310.07540.075550.07281
Table 12. Classification performance of each method regarding ZOL on ML-kNN classifier (mean ± std).
Table 12. Classification performance of each method regarding ZOL on ML-kNN classifier (mean ± std).
Data SetRALM-FSD2FPMUSCLSFSSLMUCOTCRFS
Birds0.53239 ± 0.006190.53352 ± 0.014840.55013 ± 0.021170.53543 ± 0.005510.52745 ± 0.007890.53007 ± 0.008640.54019 ± 0.01396
Emotions0.92468 ± 0.037240.82815 ± 0.028030.88331 ± 0.050540.85502 ± 0.030480.85431 ± 0.038560.83982 ± 0.035920.83522 ± 0.02541
Genbase0.07909 ± 0.151790.06976 ± 0.078960.09236 ± 0.070040.56379 ± 0.011540.06285 ± 0.126670.06058 ± 0.078390.05795 ± 0.0815
Yeast0.94729 ± 0.027270.88602 ± 0.027230.89168 ± 0.028070.91671 ± 0.011470.9233 ± 0.031390.91613 ± 0.034830.88586 ± 0.01848
Medical0.86604 ± 0.072970.65611 ± 0.037020.66257 ± 0.040580.82617 ± 0.006420.62048 ± 0.09810.61537 ± 0.04840.58932 ± 0.0373
Entertain0.94447 ± 0.019550.90565 ± 0.010020.94136 ± 0.008630.90345 ± 0.013030.88309 ± 0.034070.91441 ± 0.029570.85752 ± 0.02652
Recreation0.97955 ± 0.010570.92066 ± 0.008980.97122 ± 0.006090.95327 ± 0.005430.95681 ± 0.021780.9493 ± 0.022120.87796 ± 0.01967
Arts0.96399 ± 0.01810.9548 ± 0.011010.97061 ± 0.01670.9529 ± 0.010860.96364 ± 0.021750.96234 ± 0.021650.92196 ± 0.02549
Health0.7561 ± 0.06620.77159 ± 0.052710.77152 ± 0.044860.73661 ± 0.04370.74891 ± 0.050060.7876 ± 0.056940.70867 ± 0.04394
Education0.95281 ± 0.01620.94833 ± 0.009360.95489 ± 0.014280.9339 ± 0.013880.94176 ± 0.026660.93868 ± 0.029750.90171 ± 0.02493
Reference0.90776 ± 0.057550.80313 ± 0.038020.81068 ± 0.052080.8284 ± 0.03720.80829 ± 0.047540.80433 ± 0.06580.7591 ± 0.06182
Social0.84735 ± 0.072550.73236 ± 0.087270.77499 ± 0.068470.74463 ± 0.042510.75138 ± 0.080650.76243 ± 0.0520.72314 ± 0.05028
Science0.98663 ± 0.006420.9725 ± 0.005830.98477 ± 0.008150.95488 ± 0.011920.95139 ± 0.019950.96111 ± 0.020840.94441 ± 0.0112
Average0.822170.767890.789240.823470.768740.772470.73869
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gao, L.; Wang, Y.; Li, Y.; Zhang, P.; Hu, L. Multi-Label Feature Selection Combining Three Types of Conditional Relevance. Entropy 2021, 23, 1617. https://doi.org/10.3390/e23121617

AMA Style

Gao L, Wang Y, Li Y, Zhang P, Hu L. Multi-Label Feature Selection Combining Three Types of Conditional Relevance. Entropy. 2021; 23(12):1617. https://doi.org/10.3390/e23121617

Chicago/Turabian Style

Gao, Lingbo, Yiqiang Wang, Yonghao Li, Ping Zhang, and Liang Hu. 2021. "Multi-Label Feature Selection Combining Three Types of Conditional Relevance" Entropy 23, no. 12: 1617. https://doi.org/10.3390/e23121617

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop