Next Article in Journal
Representation of Some Ratios of Horn’s Hypergeometric Functions H7 by Continued Fractions
Next Article in Special Issue
Distance Metric Optimization-Driven Neural Network Learning Framework for Pattern Classification
Previous Article in Journal
Optimal Reinsurance–Investment Strategy Based on Stochastic Volatility and the Stochastic Interest Rate Model
Previous Article in Special Issue
Information Processing with Stability Point Modeling in Cohen–Grossberg Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification

School of Mathematics and Information Sciences, North Minzu University, Yinchuan 750021, China
*
Author to whom correspondence should be addressed.
Axioms 2023, 12(8), 737; https://doi.org/10.3390/axioms12080737
Submission received: 24 June 2023 / Revised: 23 July 2023 / Accepted: 24 July 2023 / Published: 27 July 2023
(This article belongs to the Special Issue Mathematics of Neural Networks: Models, Algorithms and Applications)

Abstract

:
In this work, we address the problem of improving the classification performance of machine learning models, especially in the presence of noisy and outlier data. To this end, we first innovatively design a generalized adaptive robust loss function called  V θ ( x ) . Intuitively,  V θ ( x )  can improve the robustness of the model by selecting different robust loss functions for different learning tasks during the learning process via the adaptive parameter  θ . Compared with other robust loss functions,  V θ ( x )  has some desirable salient properties, such as symmetry, boundedness, robustness, nonconvexity, and adaptivity, making it suitable for a wide range of machine learning applications. Secondly, a new robust semi-supervised learning framework for pattern classification is proposed. In this learning framework, the proposed robust loss function  V θ ( x )  and capped  L 2 , p -norm robust distance metric are introduced to improve the robustness and generalization performance of the model, especially when the outliers are far from the normal data distributions. Simultaneously, based on this learning framework, the Welsch manifold robust twin bounded support vector machine (WMRTBSVM) and its least-squares version are developed. Finally, two effective iterative optimization algorithms are designed, their convergence is proved, and their complexity is calculated. Experimental results on several datasets with different noise settings and different evaluation criteria show that our methods have better classification performance and robustness. With the Cancer dataset, when there is no noise, the classification accuracy of our proposed methods is  94.17 %  and  95.62 % , respectively. When the Gaussian noise is  50 % , the classification accuracy of our proposed methods is  91.76 %  and  90.59 % , respectively, demonstrating that our method has satisfactory classification performance and robustness.

1. Introduction

Data collecting and reasonable processing are becoming increasingly crucial as modern computer technology advances. As an excellent machine learning tool, support vector machine (SVM) [1,2,3] has been widely used in bioinformatics, computer vision, data mining, robotics, and other fields in recent years. The main idea behind SVM classification based on statistical learning theory and optimization theory is to construct a pair of parallel hyperplanes to maximize the minimum distance between two classes of samples. SVMs implement the structural risk minimization (SRM) principle in addition to empirical risk minimization. Although SVM can achieve good classification performance, it needs to solve a large-scale quadratic programming problem (QPP), and learning it takes a lot of time, which seriously hinders the application of SVM in large-scale classification tasks [4]. Furthermore, when dealing with complicated data, the simple SVM model would run into various issues, which will stymie its development and practical implementation, such as the “XOR” problem.
To overcome the difficulties brought by SVM to solve a QP problem, Jayadeva et al. [5] proposed a twin support vector machine (TSVM) for pattern classification based on generalized eigenvalue approximation support vector machine (GEPSVM). Since TSVM solves two smaller QPP problems instead of a single large QPP problem, it can theoretically learn four times faster than a standard SVM. The main goal of TSVM is to find two parallel hyperplanes, each of which is as close as possible to the corresponding class in the sample data, while being as far away from the other classes as possible. Further, to overcome the problem that TSVM only considers empirical risk minimization without considering the principle of structural risk minimization, Shao et al. [6] proposed a twin bounded support vector machine (TBSVM) by introducing two regularization terms. Compared with TSVM, a significant advantage of TBSVM is the principle of structural risk minimization, which embodies the essence of statistical learning theory, so this improvement can improve the classification performance of TSVM. In recent years, some TSVM-based variant algorithms have been proposed for pattern classification tasks, such as least squares twin support vector machine (LSTSVM) [4], recursive projection twin support vector machine (RPTSVM) [7], pinball twin support vector machine (Pin-TSVM) [8], sparse pinball twin support vector machine (SPTWSVM) [9], least squares recursive projection twin support vector machine (LSRPTSVM) [10], fuzzy twin support vector machine (FBTSVM) [11], and so on, which greatly promoted the development of TSVM.
It is well known that distance metrics play a crucial role in many machine learning algorithms [12]. Although the above algorithms show good performance in pattern classification, it is worth noting that most of them adopt the  L 2 -norm distance metric, whose squaring operation will exaggerate the impact of outliers on model performance. To effectively alleviate the impact of the  L 2 -norm distance metric on the robustness of the algorithm, the  L 1 -norm distance metric c with bounded derivative has received extensive attention and research in many fields of machine learning in recent years [13,14,15,16,17,18]. For example, Zhu et al. [13] proposed 1-norm SVM (1-SVM) based on an SVM learning framework. Mangasarian [14] proposed an exact  L 1 -norm support vector machine based on unconstrained convex differentiable minimization. Gao [15] developed a new 1-norm least squares TSVM (NELSTSVM). Ye et al. [16] proposed a  L 1 -norm distance minimization-based robust TSVM. Yan et al. [17] proposed 1-norm projection TSVM (1-PTSVM), and so on. As mentioned earlier, the  L 1 -norm is a better alternative to the squared  L 2 -norm in terms of enhancing the robustness of the algorithm. However, when the outliers are large, the existing classification methods based on  L 1 -norm distance often cannot achieve satisfactory classification results.
Recently, more and more researchers have paid attention to the capped  L 1 -norm and achieved some excellent research results [19,20,21,22,23,24]. Research shows that capped  L 1 -norm is considered to be a better approximation of  L 0 -norm and more robust than  L 1 -norm. In general, the capped  L 1 -norm is considered to be a better approximation of the  L 1 -norm, with stronger robustness than the  L 1 -norm. Some excellent algorithms based on capped  L 1 -norm have been proposed for robust classification tasks. For example, Wang et al. [25] proposed a new robust TSVM (CTSVM) by applying capped  L 1 -norm. CTSVM retains the advantages of TSVM and improves the robustness of classification. The experimental results on multiple datasets show that the CTSVM algorithm has good robustness and effectiveness to outliers. The capped  L 1 -norm metrics are neither convex nor smooth, which makes them difficult to optimize. There are two general strategies for solving nonconvex optimization problems. The first strategy is to design efficient algorithms, such as the bump process algorithm and the abnormal path algorithm. The second strategy is to smooth the metric function to reduce the complexity of the algorithm. To overcome the shortcomings of capped  L 1 -norm, many scholars proposed capped  L 2 , p -norm for robust learning [26,27]. Zhang et al. [28] proposed a new large-scale semi-supervised classification algorithm based on ridge regression and capped  L 2 , p -norm loss function. It is worth noting that by setting the appropriate p-value, the capped  L 1 -norm and capped  L 2 -norm are special forms of capped  L 2 , p -norm: when  p = 1  or  p = 2 , the capped  L 2 , p -norm corresponds to the capped  L 1 -norm or capped  L 2 -norm. These algorithms show that the capped distance metric is robust against outliers. However, there are few extensions and related applications of the capped  L 2 , p -norm for twin support vector machine.
In the current scenario, although data collection is easy, obtaining labeled data is difficult [29]. To address this issue, researchers have proposed semi-supervised learning (SSL) [29], which uses less labeled data and more unlabeled data to build more reliable classifiers. Graph-based SSL algorithms are a significant branch of SSL. The learning strategy involves first forming edges by connecting points between labeled and unlabeled data points and then creating a graph from these edges that represents the similarity between samples. Manifold regularization-based SSL [30] is one of the graph-based SSL methods that preserve the manifold structure to improve the discriminative property of the data [31]. The learning strategy involves mining the geometric distribution information of the data and representing it in the form of regularization terms. The reference [31] first introduced MR to SSL by proposing the Laplace support vector machine (Lap-SVM) and Laplace regularized least squares (Lap-RLS). Qi et al. [32] developed a Laplace TSVM (LapTSVM) based on a pair of non-parallel hyperplanes of TSVM. Although the classifier’s generalization performance is improved, the method’s parameter adjustment may be impacted by different datasets, and it may not be able to handle large-scale problems effectively due to high computational complexity. Xie et al. [33] propose a novel Laplacian  L p -norm least squares twin support vector machine (Lap- L p LSTSVM). The experimental results on both synthetic and real-world datasets show that Lap- L p LSTSVM outperforms other state-of-the-art methods and can also deal with noisy datasets [34,35].
To summarize, prior research on improving the TBSVM classification performance while considering robustness and discriminability is limited. In response, we introduce the WMRTBSVM and WMLSRTBSVM models. Specifically, we replace the hinge loss term in TBSVM with the  L 2 , p -norm, and we replace the second term in TBSVM with the Welsch Loss with p-power. This improves the model’s classification performance and robustness. Furthermore, we incorporate a manifold structure into the model to further enhance its classification performance and discriminability. The main contributions of this paper are summarized as follows:
(1)
A generalized adaptive robust loss function called  V θ ( x )  is innovatively designed. Intuitively,  V θ ( x )  can improve the robustness of the model by selecting different robust loss functions for different learning tasks during the learning process via the adaptive parameter  θ . Compared with other robust loss functions,  V θ ( x )  has some desirable salient properties, such as symmetry, boundedness, robustness, nonconvexity, and adaptivity.
(2)
A novel robust manifold learning framework for semi-supervised pattern classification is proposed. In this learning framework, the proposed robust loss function  V θ ( x )  and capped  L 2 , p -norm robust distance metric are introduced to improve the robustness and generalization performance of the model, especially when the outliers are far from the normal data distributions.
(3)
Two effective iterative optimization algorithms are designed for solving our methods by the half-quadratic (HQ) optimization algorithm, and the convergence of the algorithms is demonstrated.
(4)
Experimental results on artificial and benchmark datasets with different noise settings and different evaluation criteria show that our methods have better classification performance and robustness.
In Section 2, we introduce the formulas involved in TBSVM and manifold regularization since our model is based on these two approaches. In Section 3, we present a novel robust manifold learning framework for semi-supervised pattern classification. Finally, we discuss experiments and conclusions in Section 4 and Section 5, respectively.
The structure of the rest of this paper is as follows: In Section 2, as our model is based on TBSVM and manifold regularization, in order to improve our formulas and their derivation, we will introduce the formulas involved in TBSVM and manifold regularization, respectively. In Section 3, we present a novel robust manifold learning framework for semi-supervised pattern classification. Finally, in Section 4 and Section 5, we discuss experiments and conclusions.

2. Related Works

This section presents a review of related works, which include TBSVM and manifold regularization. The binary classification problem in the n-dimensional real vector space  R n  is considered. All vectors are represented as columns. Given a training dataset  T = ( x 1 , y 1 ) , , ( x m , y m ) , where  x i R n  is the input and  y i = { 1 , 1 }  is the corresponding output for  i = 1 , , m . T is composed of  m 1  positive class and  m 2  negative class samples, where m m 1  +  m 2 . The data samples from class i form the data matrix  X i R n × n , where each column represents a sample.  A R n × m 1  represents all positive class samples (i.e.,  y i = 1 ), and  B R n × m 2  represents all negative classes (i.e.,  y i = 1 ).

2.1. TBSVM

In this subsection, we provide a brief review of the twin bounded support vector machine (TBSVM). The optimization objective of TBSVM is to ensure that each hyperplane is as close as possible to the samples in the corresponding class and as far away as possible from the samples in the other class. For the linear case, TBSVM defines two nonparallel hyperplanes:
f 1 ( x ) = ω 1 T x + b 1 = 0 a n d f 2 ( x ) = ω 2 T x + b 2 = 0 .
To improve the classification ability of TSVM and realize the principle of structural risk minimization, an improved version of TSVM named TBSVM is obtained by introducing an  L 2 -regularization term based on TSVM:
min ω 1 , b 1 , ξ 1 1 2 A ω 1 + e 1 b 1 2 2 + c 1 e 2 T ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0 ,
and
min ω 2 , b 2 , ξ 2 1 2 B ω 2 + e 2 b 2 2 2 + c 2 e 1 T ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) , s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1 , ξ 2 0 .
To avoid the impact of singular problems caused by inverse matrices, positive scales  λ 1 I  and  λ 2 I  are introduced, where  λ 1  and  λ 2  are small positive constants, and 0 and I represent the zero vector matrix and the identity matrix, respectively, on the appropriate dimension. Therefore, based on the dual theory, we can obtain the dual problem of (2) and (3):
min α 1 2 α T G ( H T H + c 3 I ) 1 G T α e 2 T α s . t . 0 α c 1 e 2 ,
and
min β 1 2 β T H ( G T G + c 4 I ) 1 H T β e 1 T β , s . t . 0 β c 2 e 1 .
where  c 1 , c 2 , c 3 , c 4 > 0  represent regularization parameters,  e 1 R m 1  and  e 2 R m 2  are vectors of ones, and  ξ 1  and  ξ 2  are slack vectors. The prime superscript T is used to transform column vectors into row vectors, and the matrices  G = [ B e 2 ]  and  H = [ A e 1 ] . The dual problems are revised as  α R m 2  and  β R m 1 , which are Lagrange multipliers. By solving (4) and (5), two nonparallel hyperplanes can be obtained:
ω 1 b 1 = ( H T H + c 3 I ) 1 G T α a n d ω 2 b 2 = ( G T G + c 4 I ) 1 H T β .
A new data point  x R n  is then assigned to the positive or negative class, depending on which of the two hyperplanes (1) it lies closest to, i.e.,
f ( x ) = a r g m i n k = 1 , 2 | x ω k + b k | ω k ,
where  | . |  is the absolute value operation,  . p  means the  L p -norm for  p > 0 , when  p = 2 . 2  is written as  .  for brevity.

2.2. Manifold Regularization

In this subsection, we briefly review graph-based semi-supervised learning (SSL). Manifold regularization (MR) is one of the graph-based SSL methods, whose learning strategy is to mine the geometric distribution information of the data and represent it in the form of regularization terms. In [30], the authors point out that data distributions on manifolds are often complex and may exhibit nonlinear structures, and traditional methods may not be able to effectively capture their intrinsic structures and characteristics. Based on this, the authors propose a regularization method based on the Laplacian graph. On the basis of ensuring smoothness, the method maintains the Euclidean distance relationship of the original data sample as far as possible, enabling it to better reflect the distribution of data in the manifold space.
Consider a binary semi-supervised classification problem in the n-dimensional real space  R n . The set of training data is represented by  T = { ( x 1 , y 1 ) , , ( x l , y l ) , x l + 1 , , x l + u } , where  l + u = n , dataset  X l = { x i } i = 1 l R l × n  are the labeled data with corresponding labels  Y l = { y i } i = 1 l { 1 , 1 } , and dateset  X u = { x i } i = 1 u R u × n  are the unlabeled data with corresponding labels  Y u = 0 , where  X = X l + X u  represent the whole dateset. We model  X  as a graph  G W  is the adjacency matrix of graph  G ,
w i j : = exp ( x i x j 2 2 σ 2 ) , x i N k ( x j ) o r x j N k ( x i ) , 0 , O t h e r w i s e ,
denotes the similarity between examples  x i  and  x j , where  N k ( x j )  represents the k nearest neighbors of  x i . Based on the adjacency matrix  W , the Laplacian matrix  L  of the graph  X  can be computed by  L = D W , where  D = d i a g ( j = 1 n W 1 j , j = 1 n W 2 j , , j = 1 n W n j ) .
In RKHS, the optimization of manifold regularization can be written as follows:
f * = a r g min f H R e m p ( f ) + γ H f H 2 + γ M f M 2 ,
where  R e m p ( f )  denotes the empirical risks on the labeled data  Y , which also denote the loss function.  γ H  and  γ M  are non-negative regularization parameters.  f H 2  is the regularization term to prevent overfitting.  f M 2  is the smoothness term, which can be expressed as:
f M 2 = 1 ( l + u ) 2 i , j = 1 l + u w i j ( f ( x i ) f ( x j ) ) 2 = f T L f .

3. Main Contributions

In this section, we begin by outlining the key motivation behind our proposed model. We then present the model formulation and describe its components in detail. Finally, we provide a convergence analysis of the proposed model in Section 3.3.

3.1. Generalized Adaptive Robust Loss Function

To improve the robustness, classification performance, and generalization ability of the TBSVM framework, we propose a new robust loss function called the generalized adaptive robust loss function  V θ ( x ) . The  V θ ( x )  loss function is symmetric and has bounded non-negativity. The  V θ ( x )  is defined for any  x R n  as follows:
V θ ( x ) = c 2 2 [ 1 exp ( x 2 2 c 2 ) ] θ ,
where  θ > 0  is the power parameter, and c is a trade-off parameter that penalizes outliers.
Remark 1.
When  θ = 1 , the  V θ ( x ) -Loss will degenerate into Welsch Loss [36]. That is, Welsch Loss is a special case of  V θ ( x ) -Loss.
Property 1.
V θ ( x )  has boundedness, non-negativity, symmetry, lack of smoothness, and non-convexity. Secondly, its value is limited to a constant and does not increase, which ensures better robustness and desirability of the loss function.
In Figure 1, we compare the robustness of different loss functions, namely  L 2 -loss,  L 1 -loss, Welsch loss, and  V θ ( x ) l o s s  ( c = 1 ), against outliers. As shown in the figure, the Welsch Loss with  θ -power (red curves) is the most robust, highlighting its effectiveness in suppressing the impact of noisy outliers on the model performance. In Figure 2, we plot the loss curve of the Welsch Loss with  θ -power under different values of the parameter  θ . We observe that as  θ  decreases (from 4 to 2, 1, and  0.5 ), the function becomes narrower while remaining symmetric and bounded, further demonstrating its suitability for handling noise and outliers.

3.2. Our Method

In this subsection, we present our model and provide an explanation of it. For the binary classification task, we aim to find a pair of optimal classification hyperplanes to separate the positive and negative samples. Specifically, we consider a pair of constrained optimization problems:
min ω 1 , b 1 , ξ 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 c 2 ) ] θ + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 f 1 T L f 1 s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0 ,
and
min ω 2 , b 2 , ξ 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 3 ) + c 2 i = 1 m 2 [ 1 exp ( ξ 2 , i 2 2 c 2 ) ] θ + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 f 2 T L f 2 s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 e 1 , ξ 2 0 .
where,  c 1 , c 2 , c 3 , c 4 , c 5 , and  c 6  are positive regularization parameters, while c is an adjustment parameter that controls the degree of penalty for outliers. As stated in (6):
f 1 M 2 = 1 ( l + u ) 2 i , j = 1 l + u W i j ( f 1 ( x i ) f 1 ( x j ) ) 2 = f 1 T L f 1
and
f 2 M 2 = 1 ( l + u ) 2 i , j = 1 l + u W i j ( f 2 ( x i ) f 2 ( x j ) ) 2 = f 2 T L f 2 .
where  L = D W  refers to the Graph Laplacian matrix. D is a diagonal matrix associated with W, where the diagonal element is  D i j = i , j = 1 l + u W i j . The vector  f 1 = [ f 1 ( x 1 , , f 1 ( x l + u ) ] T  equals  M ω 1 + e b 1 , while  f 2 = [ f 2 ( x 1 , , f 2 ( x l + u ) ] T  equals  M ω 2 + e b 2 , where  M R ( l + u ) × n  represents all training data, including labeled and unlabeled data and e is an appropriate vector. Thus, the primary problem of (8) and (9) can be written as:
min ω 1 , b 1 , ξ 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 i = 1 m 1 [ 1 exp ( ξ 1 , i 2 2 c 2 ) ] θ + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0 ,
and
min ω 2 , b 2 , ξ 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 3 ) + c 2 i = 1 m 2 [ 1 exp ( ξ 2 , i 2 2 c 2 ) ] θ + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 e 1 , ξ 2 0 .
Since the two terms are quite similar, we can solve one of them and obtain a solution for the other in a similar manner. For the purpose of illustration, let us consider solving (10) in two parts:
P ( ω 1 , b 1 ) = min ω 1 , b 1 , ξ 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) R ( ω 1 , b 1 ) = c 1 i = 1 m 2 [ 1 exp ( ξ 1 2 , i 2 c 2 ) ] θ
Then, we can rewrite the Formula (10) as:
max ω 1 , b 1 , ξ 1 M ( ω 1 , b 1 , ξ 1 ) = R ¯ ( ω 1 , b 1 ) P ( ω 1 , b 1 ) ,
where  R ¯ ( ω 1 , b 1 ) = c 1 i = 1 m 2 [ exp ( ξ 1 , i 2 2 c 2 ) ] θ . We define a convex function
g ( v ) = v log ( v ) + v , v < 0 .
From the theory of conjugate functions, we obtain:
exp ( ξ 1 2 2 c 2 ) θ = sup v < 0 [ v ξ 1 2 2 c 2 g ( v ) ] θ , v = exp ( ξ 1 2 2 c 2 ) θ .
Then, we obtain:
max ω 1 , b 1 , ξ 1 M ( ω 1 , b 1 , ξ 1 ) = i = 1 m 2 ( [ v i ξ 1 , i 2 2 c 2 g ( v i ) ] ) θ P ( ω 1 , b 1 ) .
Thus, the (10) and (11) can be rewritten as:
min ω 1 , b 1 , ξ 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0 ,
and
min ω 2 , b 2 , ξ 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 2 ) + c 1 2 c 2 ξ 1 T Ω 2 ξ 1 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1 , ξ 2 0 ,
where  Ω j = d i a g ( v j , i s , 0 ) j = 1 , 2 . To optimize the objective function smoothly, we introduce concave duality, as illustrated in Lemma 1 [37,38].
Lemma 1.
Let  g ( θ ) : R n R  be a continuous nonconvex function, suppose  h ( θ ) : R n Ξ  is a map with range Ξ. We assume that a concave function  g ¯ ( u )  exists defined on Ξ, such that  g ( θ ) = g ( h ( θ ) )  holds.
Therefore, the nonconvex function  g ( θ )  can be expressed as:
g ( θ ) = inf v R n [ v T h ( θ ) g * ( v ) ] .
According to concave duality,  g * ( v )  is the concave dual of  g ¯ ( u )  given as:
g * ( v ) = inf u [ v T h ( θ ) g * ( v ) ] .
In addition, the minimum value to the right is as follows:
v * = g ¯ ( θ ) θ | u = h ( θ ) .
Based on the Lemma 1, we give a non-convex function  g ¯ ( θ ) : R R  make any arbitrary  θ > 0 ,
g ¯ ( θ ) = min ( θ p 2 , ε ) .
Assuming that  h ( μ ) = μ 2 , we obtain
min ( ω x i + b 2 p , ε ) = g ( h ( μ ) ) , μ = ω x i + b 2
Based on (23), the first term of (17) and (18) can be rewritten as:
min ω 1 , b 1 , ξ 1 i = 1 m 1 g ¯ ( ω 1 x i + b 1 2 2 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0 ,
and
min ω 2 , b 2 , ξ 2 i = 1 m 2 g ¯ ( ω 2 x i + b 2 2 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 e 1 , ξ 2 0 .
Let  θ 1 = h ( μ 1 ) = ω 1 x i + b 1 2 2 . By Formula (19), the first term of (17) can be expressed as:
min ( ω 1 x i + b 1 2 p , ε 1 ) = g ¯ ( ω 1 x i + b 1 2 2 ) = inf f i i 0 ( f i i h ( μ 1 ) g * ( f i i ) ) = inf f i i 0 f i i θ 1 g * ( f i i ) .
Therefore, the nonconvex dual function of  g ¯ ( θ 1 )  given as:
g * ( f i i ) = inf θ 1 [ f i i θ 1 g ¯ ( θ 1 ] = inf θ 1 f i i θ 1 θ 1 p 2 , θ 1 p 2 < ε 1 , f i i θ 1 ε 1 , θ 1 p 2 ε 1 .
By optimizing  θ 1  for (27):
g * ( f i i ) = f i i ( 2 p f i i ) 2 p 2 ( 2 p f i i ) 2 p 2 , θ 1 p 2 < ε 1 , f i i ε 1 2 p ε 1 , θ 1 p 2 ε 1 .
Finally, the objective function (17) first term can be further written as:
min ( ω 1 x i + b 1 2 p , ε 1 ) = inf f i i 0 L i ( ω 1 , b 1 , f i i , ε 1 ) ,
where
L i ( ω 1 , b 1 , f i i , ε 1 ) f i i θ 1 f i i ( 2 p f i i ) 2 p 2 + ( 2 p f i i ) 2 p 2 , θ 1 p 2 < ε 1 , f i i θ 1 f i i ε 1 2 p + ε 1 , θ 1 p 2 ε 1 .
Therefore, Formula (17) can be rewritten as:
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) min ω 1 , b 1 i = 1 m 1 inf f i i 0 L i ( ω 1 , b 1 , f i i , ε 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) min ω 1 , b 1 , f i i 0 i = 1 m 1 L i ( ω 1 , b 1 , f i i , ε 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) .
Similarly, Formula (18) can be rewritten as:
min ω 2 , b 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) min ω 2 , b 2 i = 1 m 2 inf d i i 0 L i ( ω 2 , b 2 , d i i , ε 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) min ω 2 , b 2 , d i i 0 i = 1 m 2 L i ( ω 2 , b 2 , d i i , ε 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) .
The objective functions (30) and (31) are solved by learning optimal classifiers through alternative optimization algorithms. We calculate the gradient of the function  g ( θ )  with respect to  θ , expressed as:
g ¯ ( θ ) θ = p 2 θ p 2 1 , 0 < θ < ε 2 p , 0 , θ > ε 2 p .
If  θ 1 = h ( μ 1 ) = ω 1 x i + b 1 2 2 , we fix  ω 1  and  b 1 :
f i i = g ¯ ( θ 1 ) θ 1 | θ 1 = ω 1 x i + b 1 2 2 = p 2 ω 1 x i + b 1 2 p 2 , 0 < ω 1 x i + b 1 2 p < ε 1 , 0 , e l s e .
Similarly, if  θ 2 = h ( μ 2 ) = ω 2 x i + b 2 2 2 , we fix  ω 2  and  b 2 :
d i i = g ¯ ( θ 2 ) θ 2 | θ 2 = ω 2 x i + b 2 2 2 = p 2 ω 2 x i + b 2 2 p 2 , 0 < ω 2 x i + b 2 2 p < ε 3 , 0 , e l s e .
To understand the relationship between parameters more clearly, we set the distance from sample  x i  to the hyperplane as X. If  X > ε 1  and  f i i  almost equals 0, then the sample  x i  is considered an outlier and is discarded. Furthermore,  d i i  is similar to  f i i . When the variables  f i i  and  d i i  are fixed to solve the classifier-related parameters  ω 1 ω 2 b 1 , and  b 2 , the optimization problem (30) and (31) can be written as:
min ω 1 , b 1 i = 1 m 1 f i i ( ω 1 x i + b 1 2 2 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 )
and
min ω 2 , b 2 i = 1 m 2 d i i ( ω 2 x i + b 2 2 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 )
Let  F = d i a g ( f 11 , , f m 1 , m 1 )  be an  m 1 × m 1  diagonal matrix, and  D = d i a g ( d 11 , , d m 2 , m 2 )  be an  m 2 × m 2  diagonal matrix. The optimization problem (35) and (36) can be rewritten as:
min ω 1 , b 1 , ξ 1 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 ,
and
min ω 2 , b 2 , ξ 1 ( B ω 2 + e 2 b 2 ) T D ( B ω 2 + e 2 b 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 e 1 .
The corresponding Lagrange function of the above optimization problem (37) can be rewritten as:
L ( ω 1 , b 1 , ξ 1 , α ) = 1 2 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) α T ( ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 ) ,
where  α  is a Lagrange multiplier, we derive the Lagrange function about  ω 1  and  b 1  and obtain the following Karush–Kuhn–Tucker (KKT) conditions.
L ω 1 = A T F ( A ω 1 + e 1 b 1 ) + c 3 ω 1 + c 5 M T L ( M ω 1 + e b 1 ) + B T α = 0 , L b 1 = e 1 T F ( A ω 1 + e 1 b 1 ) + c 3 b 1 + c 5 e T L ( M ω 1 + e b 1 ) + e 2 T α = 0 , L ξ 1 = c 1 Ω 1 ξ 1 α = 0 , α T ( B ω 1 + e 2 b 1 + ξ 1 e 2 ) = 0 , α 0 . ( v )
Let
H = A e 1 T , E = B e 2 T , Z = M e T a n d θ ¯ 1 = ω 1 b 1 .
Thus, we have
A T e 1 T F A e 1 ω 1 b 1 + c 3 L M T e T M e 1 ω 1 b 1 + B T e 2 T α = 0 .
Further, we can get
( H T F H + c 3 I + c 3 Z T L Z ) θ ¯ + E T α = 0 ,
where I is an identity matrix of appropriate dimensions. According to matrix theory, it can be easily proved that  H T F H + c 3 I + c 3 Z T L Z  is a positive definite matrix. Therefore, we have
θ ¯ 1 = [ ω 1 , b 1 ] T = ( H T F H + c 3 I + c 5 Z T L Z ) 1 E T α .
Furthermore, we can obtain the dual problem of (8) as follows:
min α 1 2 α T ( E ( H T F H + c 3 I + c 3 Z T L Z ) 1 E T + c 1 Ω 1 1 ) α e 2 T α s . t . 0 α c 1 e 2 .
Similarly, the dual problem of (9) can be written as:
min β 1 2 β T ( H ( E T D E + c 4 I + c 4 Z T L Z ) 1 H T + c 2 Ω 2 1 ) α e 1 T β s . t . 0 β c 2 e 1 ,
where  β  is the Lagrange multiplier and the augmented vector
θ ¯ 2 = [ ω 2 , b 2 ] T = ( E T D E + c 4 I + c 6 Z T L Z ) 1 H T β .
Once vectors  θ ¯ 1  and  θ ¯ 2  are obtained, a new data point  X R n  is then assigned to the positive or negative class, depending on which the two hyperplanes it lies closest to, i.e.,
f ( x ) = a r g m i n k = 1 , 2 | x ω k + b k | ω k ,
where  | . |  is the absolute value operation,  . p  means the  L p -norm for  p > 0 , when  p = 2 . 2  is written as  .  for brevity.
Based on the above discussion, our algorithm will be presented in Algorithm 1.
Algorithm 1 Solving WMRTBSVM
  • Input: Data matrices  A R m 1 × n  and  B R m 2 × n ; Parameters  c i , ( i = 1 , 2 , 3 , 4 , 5 , 6 ) , cut off level  ε i , ( i = 1 , 2 , 3 , 4 ) .
  • Output:  θ 1 *  and  θ 2 *  are the optimal values for  θ 1  and  θ 2 .
  • Process:
  • 1. Initialize  F R m 1 × m 1  and  Ω 1 R m 1 × m 1 D R m 2 × m 2  and  Ω 2 R m 2 × m 2 .
  • 2. Calculate by the KKT conditions can get  α  and  β  by (45) and (46).
  • 3. Get  θ 1  and  θ 2  by
               θ 1 = ( H T F H + c 3 I + c 5 Z T L Z ) 1 E T α
              and
               θ 2 = ( E T D E + c 4 I + c 6 Z T L Z ) 1 H T β .
  • 4. Update matrix separately F and D Ω 1  and  Ω 2  by (24), (25), (33) and (34).
To improve the computational power of WMTBSVM, we further propose the least squares version of WMTBSVM.
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 c 2 ) ] θ + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 f 1 T L f 1 s . t ( B ω 1 + e 2 b 1 ) + ξ 1 = e 2 , ξ 1 0 ,
and
min ω 2 , b 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 3 ) + c 2 i = 1 m 2 [ 1 exp ( ξ 2 , i 2 2 c 2 ) ] θ + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 6 f 2 T L f 2 s . t ( A ω 1 + e 1 b 2 ) + ξ 2 = e 1 , ξ 2 0 .
Like (37) and (38) in WMTBSVM, (48) and (49) can be rewritten as follows:
min ω 1 , b 1 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + c 1 2 c 2 ξ 1 T Ω 1 ξ 1 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 = e 2 ,
and
min ω 2 , b 2 ( B ω 2 + e 2 b 2 ) T D ( B ω 2 + e 2 b 2 ) + c 2 2 c 2 ξ 2 T Ω 2 ξ 2 + c 4 2 ( ω 2 2 2 + b 2 2 ) + c 5 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 ) s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 = e 1 .
By bringing the equality constraint into the objective function,
min ω 1 , b 1 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 2 b 1 ) + c 1 2 c 2 e 2 + B ω 1 + e 2 b 1 | | 2 2 + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 ( ω 1 T M T + e T b 1 ) L ( ω 1 M + e b 1 )
and
min ω 2 , b 2 ( B ω 2 + e 2 b 2 ) T D ( A ω 2 + e 1 b 2 ) + c 1 2 c 2 e 1 A ω 2 e 1 b 2 | | 2 2 + c 4 2 ( ω 2 2 2 + b 1 2 ) + c 6 ( ω 2 T M T + e T b 2 ) L ( ω 2 M + e b 2 )
The solution of (52) can be expressed as:
θ 1 ¯ = ( 2 c 2 c 1 H T F H + E T Ω 1 E + c 3 c 1 I + c 5 Z T L Z ) 1 E T Ω 1 e 2 , θ ¯ 2 = ( 2 c 2 c 2 E T D E + H T Ω 2 H + c 4 c 2 I + c 6 Z T L Z ) 1 H T Ω 2 e 1 ,
where H, F,Z θ ¯ , E, and D are the same as those of WMTBSVM.
Once vectors  θ ¯ 1  and  θ ¯ 2  are obtained, a new data point  X R n  is then assigned to the positive or negative class, depending on which of the two hyperplanes it lies closest to, i.e.,
f ( x ) = a r g m i n k = 1 , 2 | x ω k + b k | ω k ,
where  | . |  is the absolute value operation;  . p  means that the  L p -norm for  p > 0 , when  p = 2 . 2  is written as  .  for brevity. Based on the above discussion, our algorithm will be presented in Algorithm 2.
Algorithm 2 Solving WMLSRTBSVM
  • Input: Data matrices  A R m 1 × n  and  B R m 2 × n ; Parameters  c i , ( i = 1 , 2 , 3 , 4 , 5 , 6 ) , cut off level  ε i , ( i = 1 , 2 , 3 , 4 ) .
  • Output:  θ 1 *  and  θ 2 *  are the optimal values for  θ 1  and  θ 2 .
  • Process:
  • 1. Initialize  F R m 1 × m 1  and  Ω 1 R m 1 × m 1 D R m 2 × m 2  and  Ω 2 R m 2 × m 2 .
  • 2. Calculate by the KKT conditions can get  α  and  β  by (52a) and (52b).
  • 3. Get  θ 1  and  θ 2  by
          θ 1 = ( 2 c 2 c 1 H T F H + E T Ω 1 E + c 3 c 1 I + c 5 Z T L Z ) 1 E T Ω 1 e 2 ,
         and
          θ 2 = ( 2 c 2 c 2 E T D E + H T Ω 2 H + c 4 c 2 I + c 6 Z T L Z ) 1 H T Ω 2 e 1 .
  • 4. Update matrix separately F and D Ω 1  and  Ω 2  by (24), (25), (33) and (44).

3.3. Convergence Analysis

In this subsection, we prove the convergence of the proposed algorithms (see Appendix A).

3.4. Complexity Analysis

In this section, we briefly analyze the complexity of our proposed Algorithms 1 and 2. We know that computational complexity is mainly determined by matrix multiplication and matrix inversion. In Algorithms 1 and 2, assuming the size of the dataset is  R m × n , where there are  m 1  and  m 2  positive and negative samples, respectively, and  A R m 1 × n  and  B R m 2 × n .
In (44) and (47),  θ ¯ 1 = [ ω 1 , b 1 ] T = ( H T F H + c 3 I + c 5 Z T L Z ) 1 E T α  and  θ ¯ 2 = [ ω 2 , b 2 ] T = ( E T D E + c 4 I + c 6 Z T L Z ) 1 H T β . The computational costs of matrix multiplication are both  O ( m × ( n ) 2 ) , while the computational cost of matrix inversion is  O ( ( n ) 3 ) . Therefore, the upper bound of the total computational cost of Algorithm 1 is  O ( 2 T ( m × ( n ) 2 + ( n ) 3 ) ) , where T is the number of iterations, which is usually less than 10 in similar algorithms to our model. In addition, in our experiment, the number of samples m is generally much larger than the dimension of samples n, so the total computational cost of Algorithm 1 is  O ( 2 T ( m × ( n ) 2 ) ) .
In (53), the computational costs of matrix multiplication are  O ( m 1 × ( n ) 2 )  and  O ( m 2 × ( n ) 2 ) , respectively, and the computational cost of matrix inversion is  O ( ( n ) 3 ) . Therefore, the upper bound of the total computational cost of Algorithm 2 is  O ( ( m × ( n ) 2 + ( n ) 3 ) ) , where  m > n . Consequently, the total computational cost of this algorithm is  O ( ( m × ( n ) 2 ) ) .

4. Experimental Results and Analysis

In this section, we test the performance of our proposed model. For a fair comparison, we implemented six classification algorithms in MATLAB R2021a. The experimental environment consisted of a Windows 11 machine (CPU: Intel Core i5; RAM: 16.00 GB; OS: 64-bit Windows 11).

4.1. Experimental Setting

To validate and evaluate the validity and reliability of our proposed model, we compared WM-TBSVM and WM-LSTBSVM with other related methods, including twin support vector machine (TSVM), twin bounded support vector machine (TBSVM), least squares twin support vector machine (LSTSVM), WMRTBSVM, and WMLSRTBSVM. Furthermore, the conventional accuracy ( A C C ) was used to measure the classification performance of all algorithms, which is defined as follows:
A C C = T P + T N T P + F N + T N + F P ,
where TP and TN denote the true positive and true negative, respectively, and FP and FN denote the false positive and false negative, respectively. The higher the ACC value, the better the model value.
In the experiment, data preprocessing is carried out first. We divided the dataset into a training dataset and a test dataset, and all sample data were normalized to reduce the difference in features among different samples. In order to overcome the randomness of the test results, the experimental parameters were selected by 10-fold cross-validation, each dataset was tested 10 times, and the classification accuracy was averaged 10 times. In order to obtain the best generalization ability, the parameters involved in the experiment were selected as follows:
The value range of the  c i ( i = 1 , 2 , , 6 )  is  { 2 i | i = 7 , 6 , , 6 , 7 } ε i ( i = 1 , 2 , 3 , 4 )  =  10 5 σ  and  ε  is  { 10 i | i = 7 , 6 , , 6 , 7 } .

4.2. General Experimental Results

In order to verify the classification performance of the proposed method and other related algorithms in a noise-free setting, we ran them on twelve UCI datasets from the UCI Machine Learning Repository. We split each dataset into a training set and a testing set with a sample ratio of 7:3. That is, in each experiment, we randomly selected 70% points of both classes at a time as the training set and the rest as the testing set. In addition, we used the grid method with 10-fold cross-validation to find the optimal parameters. The process was repeated 10 times. The general experimental results are shown in Table 1, with the best results for each testing set shown in bold. Here, ACC is the average classification accuracy in the testing set, and “time (s)” represents the average running time in the testing set in seconds obtained by each algorithm according to the optimal parameters.
UCI datasets: Australian, Balance, Backnote, Cancer, German, Hepat, Pima Indian (Pima), QSAR, Spect, Vote, Wisconsin diagnostic breast cancer (WDBC), and Wholesale. See Table 2 for details of the twelve UCI datasets.
As shown in Table 1, we observe that the classification accuracy of WMRTBSVM and WMLSRTBSVM is generally higher than that of other methods. Additionally, the classification accuracy of CTSVM is generally higher than that of TSVM, TBSVM, and LSTBSVM. CTSVM, WMRTBSVM, and WMLSRTBSVM all contain capped norm distances. In general, LSTBSVM and WMLSRTBSVM have shorter running times, but WMLSRTBSVM has higher classification accuracy. Based on this, we can objectively conclude that the use of a capped   L 2 , p -norm distance metric in the TBSVM framework can improve classification performance, and the addition of the Welsch Loss with p-power can further enhance classification performance.

4.3. Convergence Analysis

In Section 3.3, we theoretically proved that the iterative optimization algorithm we designed is convergent. In this section, we conducted experiments on the Cancer dataset to further verify its convergence. As shown in Figure 3, the value of the objective function decreases with each iteration. In addition, the algorithm reached the optimal value in less than 10 iterations on the Cancer dataset. This also proves the feasibility and effectiveness of our algorithm.

4.4. Robustness Analysis

We conducted experiments on both artificial datasets and UCI datasets in a noisy environment. The dataset includes one synthetic dataset and twelve benchmark datasets from the UCI Machine Learning Repository. Please refer to Figure 4 and Table 2 for details on the artificial and UCI datasets.
Artificial datasets The dataset consists of 104 two-dimensional points, with 52 samples in each class. These points are generated by disturbing points located on two intersecting planes, where each plane corresponds to a class of data. We used “∘” and “+” to distinguish between the two classes. To test the effect of outliers on classification performance, we added four outliers to the dataset, two of which belong to class  + 1 , and two belong to class  1 . This is illustrated in Figure 4.
In order to visually evaluate the classification performance and robustness differences between WMRTBSVM, WMLSRTBSVM, and the other four algorithms, we conducted experiments on artificial datasets with four outliers. The experimental results are shown in Figure 5.
From the results depicted in Figure 5, we can see intuitively that WMRTBSVM and WMLSRTBSVM have better performance. The accuracy of six algorithms (TBSVM, LSTBSVM, CTSVM, WMRTBSVM, and WMLSRTBSVM) were  62.23 % 65.10 % 71.96 % 77.00 % ,   80.08 % , and  81.54 % , respectively. These results indicate that WMRTBSVM and WMLSRTBSVM can deal with outliers better than other methods after the introduction of outliers. Additionally, the classification effect of CTSVM is also good, which may neutralize the negative impact of outliers due to the capped  L 1 -norm distance. Experimental results demonstrate that WMRTBSVM and WMLSRTBSVM have good classification accuracy after introducing outliers, which may be due to the use of capped  L 2 , p -norm distance. The robustness of WMRTBSVM and WMLSRTBSVM to outliers has been demonstrated effectively.
In addition, we also evaluated the robustness of WMRTBSVM and WMLSRTBSVM by introducing Gaussian noise of  10 % 30 % , and  50 %  in the UCI datasets. Table 3, Table 4 and Table 5 show the experimental results on the dataset with  10 % 30 % , and  50 %  Gaussian noises, respectively.
Table 3, Table 4 and Table 5 present the comparison of the 6 algorithms on the 12 UCI datasets with  10 % 30 %  and  50 %  Gaussian noise, respectively. The experimental results reveal that the classification accuracy of each algorithm decreases after the introduction of noise. However, in most cases, WMTBSVM and WMLSTBSVM display higher classification accuracy than other algorithms, particularly when the noise surpasses  30 % . Moreover, LSTBSVM and WMLSTBSVM demonstrate less runtime. Overall, WMTBSVM and WMLSTBSVM are superior to the other four algorithms in terms of accuracy and robustness. This implies that WMTBSVM and WMLSTBSVM are robust learning algorithms that facilitate the classification of noise-contaminated samples.
Based on the results shown in Figure 6, we observe that the accuracy of the six algorithms decreases to varying degrees as noise increases from  0 %  to  10 % 30 % , and  50 % . This indicates that the algorithms’ robustness is impacted by the number of noise points. However, our proposed models, WMTBSVM (represented by the red curve) and WMLSTBSVM (represented by the blue curve), maintain the highest accuracy. Even when noise points reach  50 % , our algorithms still show clear advantages over the others. In the smaller datasets (a:  690 × 14 , c:  440 × 7 , and d:  699 × 9 ), the CTSVM (represented by the magenta curve), WMTBSVM (represented by the red curve), and WMLSTBSVM (represented by the blue curve) curves show relatively smooth variations. This may be attributed to the truncation loss used in the algorithms. The performance of the three truncation-based algorithms was also good in the larger datasets (b:  1372 × 4 , e:  1000 × 4 , and f:  1055 × 41 ). However, overall, WMTBSVM and WMLSTBSVM showed the best performance, likely due to their use of Welsch Loss with p-power.

4.5. Statistical Analysis

This section describes the analysis of the significant differences among the seven algorithms on the 12 UCI datasets using the Friedman test [39]. The Friedman test is a simple, safe, and robust non-parametric test that assumes the null hypothesis that all algorithms have the same performance. If the null hypothesis is rejected, we can perform a post-hoc test of the Nemeny test [39]. We calculated the average ranking and accuracy of the seven algorithms on the ten datasets, and the results are presented in Table 6.
To begin with, taking Gaussian kernel datasets with  30 %  unlabeled samples as an example, we calculate the Friedman statistic variable by using the following formulation:
X F 2 = 12 N k ( k + 1 ) [ j R j 2 k ( k + 1 ) 2 4 ] = 44.49 ,
where k is the number of algorithms, N is the number of UCI datasets, and  R j  is the average rank of the jth algorithm on the employed datasets. Notice that  k = 6  and  N = 12  in our paper. Furthermore, according to the  X F 2  distribution with  ( k 1 )  degrees of freedom, we have
F F = ( N 1 ) X F 2 X F 2 N ( k 1 ) = 11.344 ,
where  F F ( ( k 1 ) , ( k 1 ) ( N 1 ) )  obeys the F-distribution with  ( k 1 )  and  ( k 1 ) ( N 1 )  degrees of freedom. In addition, for  α = 0.01 , we obtain  F α = ( 5 , 55 ) = 3.340 . Obviously, the value of  F F  is greater than  F α ; thus, we can reject the null hypothesis. From Table 6, we see that the average ranking of WMTBSVM and WMLSTBSVM was much lower than the rest of the algorithms, which means that our WMTBSVM and WMLSTBSVM are more effective than the other algorithms.
Furthermore, we compared the seven algorithms in pairs using the Nemenyi post-hoc test. The difference in performance between the two algorithms was significant when the average rank difference between the two algorithms was larger than the critical value; otherwise, the difference was not significant. By dividing the Studentized range statistic by  2 , we obtain  q α = 0.01 = 2.209 . Therefore, we calculate the critical difference  ( C D )  by the following formula:
C D = q α = 0.01 k ( k + 1 ) 6 N = 2.209 × 6 ( 6 + 1 ) 6 × 12 = 1.701 .
From Figure 7, we see that WMTBSVM and WMLSTBSVM perform significantly better than TSVM, TBSVM, LSTBSVM, and CTSVM. It can further be seen that there is no significant difference between the proposed methods WMTBSVM and WMLSTBSVM, as the difference is smaller than the CD value. Therefore, through statistical analysis, it can be a safe conclusion that the proposed methods WMTBSVM and WMLSTBSVM have better performance.

5. Conclusions

In this paper, a generalized adaptive robust loss function  V θ ( x )  is designed.  V θ ( x )  has several significant and satisfactory characteristics, such as symmetry, boundedness, and non-convexity. By setting appropriate parameters to improve the adaptability and robustness of WMTBSVM, we achieve better generalization performance and robustness. Secondly, we introduce the capped  L 2 , p -norm distance measure into WMRTBSVM to improve the generalization performance and robustness of the model. This is done by setting appropriate p and upper bound parameter values, especially when the outliers are far from the normal data distribution. We also add MR into WMTBSVM to improve the discriminability and classification ability of our model. To improve the computational efficiency of WMRTBSVM, we use the least square method to obtain WMLSRTBSVM. Two effective iterative optimization algorithms are designed, and theoretical support is given for both WMRTBSVM and WMLSRTBSVM. We mainly conducted accuracy test experiments on manual datasets and UCI datasets. The experimental results show that WMRTBSVM and WMLSRTBSVM have better classification performance and robustness. In future work, we hope to apply WMRTBSVM and WMLSRTBSVM to multi-classification tasks to further study their performance and our theoretical work. We also plan to study how to combine our method with sparse kernel SVM to develop better performance and faster algorithms. In addition, we designed the generalized adaptive robust loss function  V θ ( x ) , which we hope can be combined with other loss functions to further improve the adaptability and robustness of the correlation algorithms. Ultimately, we hope that  V θ ( x )  can be applied to ensemble learning to deal with unbalanced datasets.

Author Contributions

B.M.: writing—original draft, conceptualization, writing—reviewing and editing, software, data curation. G.Y.: writing—original draft, supervision, validation, project administration, funding acquisition. J.M.: writing—original draft, conceptualization, writing—reviewing and editing, software, data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Ningxia Provincial of China (No. 2022AAC03260, No. 2023AAC02053), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), in part by the Fundamental Research Funds for the Central Universities (No. 2021KYQD23, No. 2022XYZSX03), in part by the National Natural Science Foundation of China (No. 11861002).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

All of the benchmark datasets used in our numerical experiments are from the UCI Machine Learning Repository, and are available at http://archive.ics.uci.edu/ml/ (accessed on 21 March 2023).

Conflicts of Interest

There are no conflict of interest in this study.

Appendix A. Convergence Analysis

Lemma A1.
For any scalar t, when  0 < p 2 , inequality  2 | t | p p t 2 + p 2 0  holds.
Proof. 
Let  f ( t ) = 2 | t | p 2 p t + p 2 , find the first derivative of  f ( t ) , respectively:
f ( t ) = p ( t p 2 2 1 )
and
f ( t ) = p ( p 2 ) 2 t p 4 2 .
If  t > 0  and  0 < p 2 , then  f ( t ) 0  and  t = 1  is only point that  f ( t ) = 0 . Note that  f ( 1 ) = 0 , thus when  t > 0  and  0 < p 2 , then  f ( t ) 0 . Thus  f 2 ( t ) 0 , which indicates  2 | t | p p t 2 + p 2 0  holds. □
Lemma A2.
For any nonzero vectors α, β, when  0 < p 2 , the following inequality holds.
α 2 p p 2 β 2 p 2 α 2 2 β 2 p p 2 β 2 p 2 β 2 2 .
Proof. 
According to Lemma A1, we obtain:
2 ( α 2 β 2 ) p p ( α 2 β 2 ) 2 + p 2 0
          ⇒
2 α 2 p p β 2 p 2 α 2 2 ( 2 p ) β 2 p
          ⇒
α 2 p p 2 β 2 p 2 α 2 2 β 2 p p 2 β 2 p 2 β 2 2 .
Theorem A1.
Algorithm 1 will monotonically decrease the objective (17) and (18) in each iteration until it converges.
Proof. 
Recall our framework
J = min ω 1 , b 1 , ξ 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 c 2 ) ] θ + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 f 1 T L f 1
= J 1 + J 2 + J 3 + J 4 ,
J = min z i = 1 m 1 min ( h i z 1 2 p , ε 1 ) + c 3 2 z 1 T z 1 + J 2 + J 4 ,
where  h i = ( x i , 1 ) z 1 = ( w 1 T , b 1 ) T . When  h i z 1 2 p  is smaller than  ε 1 , the above equation is equivalent to:
J = min z i = 1 m 1 min h i z 1 2 p + c 3 2 z 1 T z 1 + J 2 + J 4 ,
Suppose  z 1 k + 1  is the solution of the  ( k + 1 ) th iteration of the algorithm, based on (47) we have:
z 1 k + 1 = min z 1 2 ( H z 1 ( k + 1 ) ) T F ( k + 1 ) H z 1 ( k + 1 ) + c 3 ( z 1 ( k + 1 ) ) T z 1 ( k + 1 ) + J 2 ( k + 1 ) + J 4 ( k + 1 ) .
At the kth iteration:
( H z 1 ( k + 1 ) ) T F ( k + 1 ) H z 1 ( k + 1 ) + c 3 ( z 1 ( k + 1 ) ) T z 1 ( k + 1 ) + J 2 ( k + 1 ) + J 4 ( k + 1 )
                              ≤
( H z 1 ( k ) ) T F ( k ) H z 1 ( k ) + c 3 ( z 1 ( k ) ) T z 1 ( k ) + J 2 ( k ) + J 4 ( k ) .
Which is equality:
p 2 H z 1 ( k + 1 ) 2 p p 2 H z 1 ( k + 1 ) 2 p 2 + c 3 ( z 1 ( k + 1 ) ) T z 1 ( k + 1 ) + J 2 ( k + 1 ) + J 4 ( k + 1 )
                              ≤
p 2 H z 1 ( k ) 2 p p 2 H z 1 ( k ) 2 p 2 + c 3 ( z 1 ( k ) ) T z 1 ( k ) + J 2 ( k ) + J 4 ( k ) .
Based on Lemma A2, we obtain:
H z 1 ( k + 1 ) 2 p p 2 H z 1 ( k + 1 ) 2 p 2 H z 1 ( k + 1 ) 2 2 H z 1 ( k ) 2 p p 2 H z 1 ( k ) 2 p 2 H z 1 ( k ) 2 2 .
Here, according to the Formulas (A6) and (A7), we have:
H z 1 ( k + 1 ) 2 p + c 3 ( z 1 ( k + 1 ) ) T z 1 ( k + 1 ) + J 2 ( k + 1 ) + J 4 ( k + 1 ) H z 1 ( k ) 2 p + c 3 ( z 1 ( k ) ) T z 1 ( k ) + J 2 ( k ) + J 4 ( k ) .
Thus, we have  J ( z 1 ( k + 1 ) ) J ( z 1 ( k ) ) . If  h i z 1 2 p  is the biggest and  ε 1 , we obtain  J ( z 1 ( k + 1 ) ) = J ( z 1 ( k ) ) . Therefore, the  J ( z 1 ( k + 1 ) ) J ( z 1 ( k ) )  holds, meaning that Algorithm 1 decreases the objective of problems (17) until convergence. For problem (18), we have the same proof process. Since the Formulas (17) and (18) are lower bounded by 0, Algorithm 1 will converge. □
Lemma A3.
For all positive real number a and b, the following inequality holds:
a a 2 b b b 2 b .
Theorem A2.
Algorithm 1 will converge to a local minimal solution of the problem (17) and (18).
Proof. 
Recall our framework
J = min ω 1 , b 1 , ξ 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + c 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 c 2 ) ] θ + c 3 2 ( ω 1 2 2 + b 1 2 ) + c 5 f 1 T L f 1 ,
J = min z i = 1 m 1 min ( h i z 1 2 p , ε 1 ) + c 3 2 z 1 T z 1 + J 2 + J 4 ,
where  h i = ( x i , 1 ) z 1 = ( w 1 T , b 1 ) T . First we consider the  J 2 = c 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 c 2 ) ] θ , and we first define two functions
R + R + : c o n c a v e f u n c t i o n θ ( V ) = ϑ ( V ) , V [ 0 , ) , θ ( v ) = ϑ ( V ) 2 V ,
R R + : ( θ ) 1 .
Based on conjugate function theory, there exists a convex conjugate function of the convex function  θ ( v )  in  R :
( θ ) * ( z ) = sup v 0 { z v + θ ( v ) } , z < 0 ,
where
( θ ) * ( z ) = z ( θ ) 1 ( z ) + θ [ ( θ ) 1 ( z ) ] , z < 0 .
Because the conjugate function of a convex function’s conjugate function is the convex function itself, we have
θ ) ( v ) = sup z < 0 { z v ( θ ) * ( v ) } , v 0 .
Let  z = 1 2 s , and define a convex function  ψ ( s ) = θ * ( 1 2 s ) ,
θ ( v ) = sup s > 0 { 1 2 s v ψ ( s ) } , v 0 ,
which is equivalent to
θ ( v ) = inf s > 0 { 1 2 s v + ψ ( s ) } , v 0 .
In (A18),  1 2 s v + ψ ( s )  by  s > 0  is convex, then we can obtain a minimum solution  s * = 2 θ ( v )  by derivation. Define  ψ ( v ) = 1 exp ( v 2 ) , where  v = ε 1 2 c , due to  ψ ( v ) = θ ( v 2 ) , we have:
φ ( v ) = θ ( v 2 ) = inf s > 0 { 1 2 s v 2 + ψ ( s ) } , v .
When  v > 0 , there exists a minimum solution  s * = 2 θ ( v 2 )  in the right hand of the above equation, i.e.,
s * = φ ( v ) v
Combining the Formulas (A19) and (A20):
inf s > 0 { 1 2 s v 2 + ψ ( s ) } = 1 2 s * v 2 + ψ ( s * ) , v ,
where  s * = 2 exp ( v 2 ) . Then, we can say that Algorithm 1 will converge to a local minimum solution of  J 2 . For  J 4 = c 5 f 1 T L f 1 , in the  ( k + 1 ) th iteration, we have:
J 4 ( k + 1 ) J 4 ( k ) .
With Lemma A3, we set
a = | J 4 ( k + 1 ) | 2 ,
b = | J 4 ( k ) | 2 ,
then, we can easily obtain the following inequality:
J 4 ( k + 1 ) | J 4 ( k + 1 ) | 2 2 J 4 ( k ) J 4 ( k ) | J 4 ( k ) | 2 2 J 4 ( k ) .
Combining (A22) and (A24), we can obtain
| J 4 ( k + 1 ) | | J 4 ( k ) | .
Then, we can say that Algorithm 1 will converge to a local minimum solution of  J 4 . For
J 1 + J 3 = min z i = 1 m 1 min ( h i z 1 2 p , ε 1 ) + c 3 2 z 1 T z 1 .
Define the Lagrangian function of (A26) as  τ ( z 1 ) , with the KKT condition of (A26), we have:
c 3 z 1 + Σ p h i z 1 2 p 1 h i T , 0 h i z 1 2 p < ε 1 , 0 , o t h e r w i s e .
We substitute the  f i i  in (33) into the above equation:
2 H T F H z 1 + c 3 z 1 = 0 .
Combining (A28) and (47), we obtain:
( H z 1 ) T F ( H z 1 ) + c 3 z 1 T z 1 .
Similarly, we obtain the Lagrangian function of Formula (A29):
2 H T F H z 1 + c 3 z 1 = 0 .
Then, we can say that Algorithm 1 will converge to a local minimum solution of  J 1 + J 3 . Furthermore, we can say that Algorithm 1 will converge to a local minimum solution of J. □

References

  1. Brown, M.P.; Grundy, W.N.; Lin, D.; Cristianini, N.; Sugnet, C.W.; Furey, T.S.; Ares, M., Jr.; Haussler, D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 2000, 97, 262–267. [Google Scholar] [CrossRef] [PubMed]
  2. Ma, S.; Cheng, B.; Shang, Z.; Liu, G. Scattering transform and LSPTSVM based fault diagnosis of rotating machinery. Mech. Syst. Signal Process. 2018, 104, 55–170. [Google Scholar] [CrossRef]
  3. Suykens, J.A.K.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
  4. Kumar, M.A.; Gopal, M. Least squares twin support vector machines for pattern classification. Expert Syst. Appl. 2009, 36, 7535–7543. [Google Scholar] [CrossRef]
  5. Jayadeva, N.; Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef]
  6. Shao, Y.H.; Zhang, C.H.; Wang, X.B.; Deng, N.Y. Improvements on twin support vector machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef]
  7. Chen, X.; Yang, J.; Ye, Q.; Liang, J. Recursive projection twin support vector machine via within-class variance minimization. Pattern Recognit. 2011, 44, 2643–2655. [Google Scholar] [CrossRef]
  8. Xu, Y.; Yang, Z.; Pan, X. A novel twin support-vector machine with pinball loss. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 359–370. [Google Scholar] [CrossRef]
  9. Tanveer, M.; Tiwari, A.; Choudhary, R.; Jalan, S. Sparse pinball twin support vector machines. Appl. Soft Comput. 2019, 78, 164–175. [Google Scholar] [CrossRef]
  10. Shao, Y.H.; Deng, N.Y.; Yang, Z.M. Least squares recursive projection twin support vector machine for classification. Pattern Recognit. 2012, 45, 2299–2307. [Google Scholar] [CrossRef]
  11. Chen, S.G.; Wu, X.J. A new fuzzy twin support vector machine for pattern classification. Int. J. Mach. Learn. Cybern. 2018, 9, 1553–1564. [Google Scholar] [CrossRef]
  12. Hou, Y.Y.; Li, J.; Chen, X.B.; Ye, C.Q. Quantum adversarial metric learning model based on triplet loss function. arXiv 2023, arXiv:2303.08293. [Google Scholar] [CrossRef]
  13. Zhu, J.; Rosset, S.; Tibshirani, R.; Hastie, T. 1-norm support vector machines. Adv. Neural Inf. Process. Syst. 2003, 16. [Google Scholar]
  14. Mangasarian, O.L.; Bennett, K.P.; Parrado-Hernández, E. Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization. J. Mach. Learn. Res. 2006, 7, 1517–1530. [Google Scholar]
  15. Gao, S.; Ye, Q.; Ye, N. 1-Norm least squares twin support vector machines. Neurocomputing 2011, 74, 3590–3597. [Google Scholar] [CrossRef]
  16. Ye, Q.; Zhao, H.; Li, Z.; Yang, X.; Gao, S.; Yin, T.; Ye, N. L1-Norm distance minimization-based fast robust twin support vector k-plane clustering. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4494–4503. [Google Scholar] [CrossRef]
  17. Yan, H.; Ye, Q.; Zhang, T.A.; Yu, D.J.; Yuan, X.; Xu, Y.; Fu, L. Least squares twin bounded support vector machines based on L1-norm distance metric for classification. Pattern Recognit. 2018, 74, 434–447. [Google Scholar] [CrossRef]
  18. Hazarika, B.B.; Gupta, D. 1-Norm random vector functional link networks for classification problems. Complex Intell. Syst. 2022, 8, 3505–3521. [Google Scholar] [CrossRef]
  19. Jiang, W.; Nie, F.; Huang, H. Robust dictionary learning with capped L1-norm. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  20. Nie, F.; Huo, Z.; Huang, H. Joint capped norms minimization for robust matrix recovery. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
  21. Wu, M.J.; Liu, J.X.; Gao, Y.L.; Kong, X.Z.; Feng, C.M. Feature selection and clustering via robust graph-laplacian PCA based on capped L1-norm. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1741–1745. [Google Scholar]
  22. Zhao, M.; Chow, T.W.; Zhang, H.; Li, Y. Rolling fault diagnosis via robust semi-supervised model with capped L2,1-norm regularization. In Proceedings of the IEEE International Conference on Industrial Technology, Toronto, ON, Canada, 22–25 March 2017; pp. 1064–1069. [Google Scholar]
  23. Xiang, S.; Nie, F.; Meng, G.; Pan, C.; Zhang, C. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1738–1754. [Google Scholar] [CrossRef]
  24. Nie, F.; Wang, X.; Huang, H. Multiclass capped Lp-norm SVM for robust classifications. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
  25. Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L1-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef]
  26. Ma, X.; Ye, Q.; Yan, H. L2,p-norm distance twin support vector machine. IEEE Access 2017, 5, 23473–23483. [Google Scholar] [CrossRef]
  27. Ma, X.; Liu, Y.; Ye, Q. P-Order L2-Norm Distance Twin Support Vector Machine. In Proceedings of the 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 617–622. [Google Scholar]
  28. Zhang, L.; Luo, M.; Li, Z.; Nie, F.; Zhang, H.; Liu, J.; Zheng, Q. Large-scale robust semisupervised classification. IEEE Trans. Cybern. 2018, 49, 907–917. [Google Scholar] [CrossRef] [PubMed]
  29. Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
  30. Belkin, M. Problems of Learning on Manifolds. Ph.D. Thesis, The University of Chicago, Chicago, IL, USA, 2003. [Google Scholar]
  31. Rossi, L.; Torsello, A.; Hancock, E.R. Unfolding kernel embeddings of graphs: Enhancing class separation through manifold learning. Pattern Recognit. 2015, 48, 3357–3370. [Google Scholar] [CrossRef] [Green Version]
  32. Qi, Z.; Tian, Y.; Shi, Y. Laplacian twin support vector machine for semi-supervised classification. Neural Netw. 2012, 35, 46–53. [Google Scholar] [CrossRef]
  33. Xie, X.; Sun, F.; Qian, J.; Guo, L.; Zhang, R.; Ye, X.; Wang, Z. Laplacian Lp-norm least squares twin support vector machine. Pattern Recognit. 2023, 136, 109192. [Google Scholar] [CrossRef]
  34. Wen, J.; Lai, Z.; Wong, W.K.; Cui, J.; Wan, M. Optimal feature selection for robust classification via L2,1-norms regularization. In Proceedings of the Twenty-Second International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014; pp. 517–521. [Google Scholar]
  35. Wang, H.; Nie, F.; Huang, H. Learning robust locality preserving projection via p-order minimization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; AAAI Press: Washington, DC, USA, 2015; pp. 3059–3065. [Google Scholar]
  36. Ke, J.; Gong, C.; Liu, T.; Zhao, L.; Yang, J.; Tao, D. Laplacian Welsch Regularization for Robust Semisupervised Learning. IEEE Trans. Cybern. 2020, 52, 164–177. [Google Scholar] [CrossRef]
  37. Yuan, C.; Yang, L.-M. Capped L2,P-norm metric based robust least squares twin support vector machine for pattern classification. Neural Netw. 2021, 142, 457–478. [Google Scholar] [CrossRef] [PubMed]
  38. Kwak, N. Principal component analysis based on L1-norm maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1672–1680. [Google Scholar] [CrossRef] [PubMed]
  39. Demi<i>s</i>ˇar, J.; Schuurmans, D. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. L 2  loss vs.  L 1  loss vs. Welsch loss vs.  V θ ( x ) −loss.
Figure 1. L 2  loss vs.  L 1  loss vs. Welsch loss vs.  V θ ( x ) −loss.
Axioms 12 00737 g001
Figure 2. Welsch Loss with  θ −power under different  θ .
Figure 2. Welsch Loss with  θ −power under different  θ .
Axioms 12 00737 g002
Figure 3. Convergence of WMTBSVM.
Figure 3. Convergence of WMTBSVM.
Axioms 12 00737 g003
Figure 4. Distribution of artificial datasets with outliers.
Figure 4. Distribution of artificial datasets with outliers.
Axioms 12 00737 g004
Figure 5. The classification performance of six algorithms on the artificial datasets.
Figure 5. The classification performance of six algorithms on the artificial datasets.
Axioms 12 00737 g005
Figure 6. Accuracies of six algorithms via different noises.
Figure 6. Accuracies of six algorithms via different noises.
Axioms 12 00737 g006
Figure 7. Visualization of post-hoc tests for data from Table 6. (a) Gaussian kernel with  10 %  unlabeded samples. (b) Gaussian kernel with  30 %  unlabeled samples. (c) Gaussian kernel with  50 %  unlabeled samples.
Figure 7. Visualization of post-hoc tests for data from Table 6. (a) Gaussian kernel with  10 %  unlabeded samples. (b) Gaussian kernel with  30 %  unlabeled samples. (c) Gaussian kernel with  50 %  unlabeled samples.
Axioms 12 00737 g007
Table 1. Experimental results on UCI datasets without noise.
Table 1. Experimental results on UCI datasets without noise.
TSVMTBSVMLSTBSVMCTSVMWMTBSVMWMLSTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
(N × n) Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Australian85.4486.9186.0386.2186.44   87.18
(690 × 14)14.6981.8280.0613.7981.5840.766
Balance93.5793.5793.2592.36   94.82 93.57
(576 × 4)0.6950.7250.0513.2701.0170.616
Backnote87.2387.3087.9086.92   88.35 88.15
(1372 × 4)15.13412.7915.0897.1055.9922.492
Cancer95.6595.9495.22   96.16 94.1795.62
(699 × 9)2.6402.0631.0643.8432.3120.843
German73.8073.9074.0075.70   77.60 76.10
(1000 × 24)5.4953.9831.0752.6552.6661.536
Hepat77.3380.6780.5180.18   83.42 82.67
(155 × 19)0.4800.6270.2972.3780.5540.200
Pima75.9276.6776.7175.92   77.05 76.45
(768 × 8)4.2821.7300.6693.8272.0110.888
QSAR85.9685.3885.3086.25   86.90 86.90
(1055 × 41)7.6306.8432.1131.9463.8601.958
Spect80.7780.3880.7781.2581.92   83.08
(267 × 44)0.5120.2240.1521.7941.0450.308
Vote95.9594.7194.7995.4895.71   95.95
(432 × 16)2.8080.4500.1562.7501.360.404
WDBC96.4395.8995.9396.54   97.25 96.43
(569 × 30)3.7220.5640.2542.6741.6130.688
Wholesale82.7988.6086.0590.0089.37   90.47
(440 × 7)1.1201.6480.7452.5601.2270.500
Table 2. Characteristics of UCI Datasets.
Table 2. Characteristics of UCI Datasets.
DatasetsSamplesAttributesDatasetsSamplesAttributes
Australian69014Pima7688
Balance5764QSAR105541
Backnote13724Spect26744
Cancer6999Vote43216
German10004Wholesale4407
Hepat15519WDBC56930
Table 3. Experimental results on UCI datasets with  10 %  noise.
Table 3. Experimental results on UCI datasets with  10 %  noise.
TSVMTBSVMLSTBSVMCTSVMWMTBSVMWMLSTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
(N × n) Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Australian85.2986.3286.4085.85   86.41 85.44
(690 × 14)3.7021.2440.5523.5641.7450.842
Balance93.0493.3992.4391.11   93.75 93.21
(576 × 4)1.4101.4400.6543.0461.1170.593
Backnote86.3585.9983.2786.4684.89   86.93
(1372 × 4)15.0688.8454.0907.4066.0622.436
Cancer94.9495.5195.00   95.78 94.0095.46
(699 × 9)2.1431.9730.8622.691.7410.856
German73.1073.4073.5174.40   75.30 73.21
(1000 × 24)5.0514.1201.5751.8464.0381.661
Hepat76.0078.6777.4277.59   81.33 81.33
(155 × 19)0.2093.9991.4832.1670.6070.270
Pima75.6075.9276.1176.2476.18   76.33
(768 × 8)2.5651.5050.9694.2671.8751.016
QSAR83.3782.9883.1383.87   84.12 82.44
(1055 × 41)9.9776.8633.1114.683.6591.844
Spect78.0879.2379.7780.69   81.15 81.92
(267 × 44)0.3500.2870.0492.0771.0520.287
Vote95.2494.4894.7995.0095.24   95.48
(432 × 16)2.9400.4470.1483.4381.1190.452
WDBC93.9693.7194.8195.11   96.82 95.07
(569 × 30)5.2010.5520.2542.8562.2700.682
Wholesale79.5383.4984.6487.4788.15   90.23
(440 × 7)0.5232.3121.0502.1991.2730.552
Table 4. Experimental results on UCI datasets with  30 %  noise.
Table 4. Experimental results on UCI datasets with  30 %  noise.
TSVMTBSVMLSTBSVMCTSVMWMTBSVMWMLSTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
(N × n) Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Australian82.4483.1483.7184.3284.85   85.15
(690 × 14)1.8330.6690.3503.9461.6770.841
Balance91.2192.8692.1088.5793.39   93.04
(576 × 4)1.4321.6240.7513.4091.0630.555
Backnote79.7880.0781.4683.96   84.60 84.79
(1372 × 4)11.3675.3574.0887.1044.2523.062
Cancer94.7892.2292.5191.84   93.13 92.32
(699 × 9)2.3891.5030.6533.5711.4600.824
German71.8271.4372.0072.90   74.80 72.70
(1000 × 24)0.8210.7410.3761.5304.5491.591
Hepat73.3374.0074.8275.41   80.67 80.00
(155 × 19)0.2332.7111.0322.5400.5370.190
Pima71.6371.2970.1674.16   75.00 75.00
(768 × 8)15.6763.0111.5711.0311.9570.982
QSAR77.3775.7776.9180.1082.12   82.35
(1055 × 41)8.3005.1383.1084.6783.4681.850
Spect74.0077.6978.0077.31   81.15 81.15
(267 × 44)0.4380.4770.0472.0041.0580.343
Vote94.0593.5293.6194.2995.00   95.20
(432 × 16)3.5570.4020.1483.0291.1370.424
WDBC91.7192.2993.0092.93   95.29 93.89
(569 × 30)10.1080.5010.2552.6702.1840.665
Wholesale68.5668.1267.3885.6087.81   89.77
(440 × 7)2.8762.0451.1512.9581.3210.476
Table 5. Experimental results on UCI datasets with  50 %  noise.
Table 5. Experimental results on UCI datasets with  50 %  noise.
TSVMTBSVMLSTBSVMCTSVMWMTBSVMWMLSTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
(N × n) Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Australian78.8880.7178.1580.7484.26   84.71
(690 × 14)2.4720.6210.3593.3921.5900.759
Balance80.3280.4381.3786.9690.54   92.50
(576 × 4)3.6411.1150.6483.1201.0690.619
Backnote76.7977.0178.2178.62   82.74 80.57
(1372 × 4)12.2247.3183.0867.3174.6952.484
Cancer84.3584.6485.0089.42   91.70 90.59
(699 × 9)1.8541.3510.7563.5791.5050.883
German70.9071.0070.1070.8072.20   70.50
(1000 × 24)15.2936.9123.0732.6602.6411.560
Hepat70.6771.3971.6372.33   77.00 75.67
(155 × 19)0.2320.6480.2991.8830.6140.174
Pima65.7962.2664.6168.2973.39   73.53
(768 × 8)3.7622.5561.2724.3781.9460.924
QSAR62.9163.5864.2877.31   80.58 76.63
(1055 × 41)0.76710.4864.1254.3784.1981.747
Spect69.2866.9266.9271.1579.66   80.38
(267 × 44)0.7720.8000.3211.6941.0380.314
Vote83.8191.3892.2994.24   94.76 94.52
(432 × 16)3.7930.4000.1502.6141.1730.380
WDBC84.5082.2381.3289.11   92.57 90.54
(569 × 30)10.1950.5150.0543.2502.1280.649
Wholesale68.6768.1471.9383.7485.88   88.37
(440 × 7)1.1050.5390.2442.3111.3870.551
Table 6. Average accuracy and ranks of seven algorithms with Gaussian kernel on UCI datasets with different proportions of unlabeled samples.
Table 6. Average accuracy and ranks of seven algorithms with Gaussian kernel on UCI datasets with different proportions of unlabeled samples.
Cases TSVMTBSVMLSTBSVMCTSVMWMTBSVMWMLSTBSVM
Gaussian kernelAvg.ACC  10 % 85.5485.2685.1185.8086.4586.84
Avg.rank  10 % 4.884.174.172.922.252.63
Avg.ACC  30 % 80.6481.0382.3183.4585.6586.04
Avg.rank  30 % 5.175.084.083.501.501.67
Avg.ACC  50 % 75.5774.9775.4880.2383.7784.37
Avg.rank  50 % 4.924.964.793.01.421.92
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, B.; Ma, J.; Yu, G. A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification. Axioms 2023, 12, 737. https://doi.org/10.3390/axioms12080737

AMA Style

Ma B, Ma J, Yu G. A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification. Axioms. 2023; 12(8):737. https://doi.org/10.3390/axioms12080737

Chicago/Turabian Style

Ma, Bao, Jun Ma, and Guolin Yu. 2023. "A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification" Axioms 12, no. 8: 737. https://doi.org/10.3390/axioms12080737

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop