A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification

Ma, Bao; Ma, Jun; Yu, Guolin

doi:10.3390/axioms12080737

Open AccessArticle

A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification

by

Bao Ma

,

Jun Ma

^*

and

Guolin Yu

School of Mathematics and Information Sciences, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Axioms 2023, 12(8), 737; https://doi.org/10.3390/axioms12080737

Submission received: 24 June 2023 / Revised: 23 July 2023 / Accepted: 24 July 2023 / Published: 27 July 2023

(This article belongs to the Special Issue Mathematics of Neural Networks: Models, Algorithms and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In this work, we address the problem of improving the classification performance of machine learning models, especially in the presence of noisy and outlier data. To this end, we first innovatively design a generalized adaptive robust loss function called

V_{θ} (x)

. Intuitively,

V_{θ} (x)

can improve the robustness of the model by selecting different robust loss functions for different learning tasks during the learning process via the adaptive parameter

θ

. Compared with other robust loss functions,

V_{θ} (x)

has some desirable salient properties, such as symmetry, boundedness, robustness, nonconvexity, and adaptivity, making it suitable for a wide range of machine learning applications. Secondly, a new robust semi-supervised learning framework for pattern classification is proposed. In this learning framework, the proposed robust loss function

V_{θ} (x)

and capped

L_{2, p}

-norm robust distance metric are introduced to improve the robustness and generalization performance of the model, especially when the outliers are far from the normal data distributions. Simultaneously, based on this learning framework, the Welsch manifold robust twin bounded support vector machine (WMRTBSVM) and its least-squares version are developed. Finally, two effective iterative optimization algorithms are designed, their convergence is proved, and their complexity is calculated. Experimental results on several datasets with different noise settings and different evaluation criteria show that our methods have better classification performance and robustness. With the Cancer dataset, when there is no noise, the classification accuracy of our proposed methods is

94.17 %

and

95.62 %

, respectively. When the Gaussian noise is

50 %

, the classification accuracy of our proposed methods is

91.76 %

and

90.59 %

, respectively, demonstrating that our method has satisfactory classification performance and robustness.

Keywords:

robust distance metric; loss function; manifold regularization; semi-supervised learning; pattern classification

1. Introduction

Data collecting and reasonable processing are becoming increasingly crucial as modern computer technology advances. As an excellent machine learning tool, support vector machine (SVM) [1,2,3] has been widely used in bioinformatics, computer vision, data mining, robotics, and other fields in recent years. The main idea behind SVM classification based on statistical learning theory and optimization theory is to construct a pair of parallel hyperplanes to maximize the minimum distance between two classes of samples. SVMs implement the structural risk minimization (SRM) principle in addition to empirical risk minimization. Although SVM can achieve good classification performance, it needs to solve a large-scale quadratic programming problem (QPP), and learning it takes a lot of time, which seriously hinders the application of SVM in large-scale classification tasks [4]. Furthermore, when dealing with complicated data, the simple SVM model would run into various issues, which will stymie its development and practical implementation, such as the “XOR” problem.

To overcome the difficulties brought by SVM to solve a QP problem, Jayadeva et al. [5] proposed a twin support vector machine (TSVM) for pattern classification based on generalized eigenvalue approximation support vector machine (GEPSVM). Since TSVM solves two smaller QPP problems instead of a single large QPP problem, it can theoretically learn four times faster than a standard SVM. The main goal of TSVM is to find two parallel hyperplanes, each of which is as close as possible to the corresponding class in the sample data, while being as far away from the other classes as possible. Further, to overcome the problem that TSVM only considers empirical risk minimization without considering the principle of structural risk minimization, Shao et al. [6] proposed a twin bounded support vector machine (TBSVM) by introducing two regularization terms. Compared with TSVM, a significant advantage of TBSVM is the principle of structural risk minimization, which embodies the essence of statistical learning theory, so this improvement can improve the classification performance of TSVM. In recent years, some TSVM-based variant algorithms have been proposed for pattern classification tasks, such as least squares twin support vector machine (LSTSVM) [4], recursive projection twin support vector machine (RPTSVM) [7], pinball twin support vector machine (Pin-TSVM) [8], sparse pinball twin support vector machine (SPTWSVM) [9], least squares recursive projection twin support vector machine (LSRPTSVM) [10], fuzzy twin support vector machine (FBTSVM) [11], and so on, which greatly promoted the development of TSVM.

It is well known that distance metrics play a crucial role in many machine learning algorithms [12]. Although the above algorithms show good performance in pattern classification, it is worth noting that most of them adopt the

L_{2}

-norm distance metric, whose squaring operation will exaggerate the impact of outliers on model performance. To effectively alleviate the impact of the

L_{2}

-norm distance metric on the robustness of the algorithm, the

L_{1}

-norm distance metric c with bounded derivative has received extensive attention and research in many fields of machine learning in recent years [13,14,15,16,17,18]. For example, Zhu et al. [13] proposed 1-norm SVM (1-SVM) based on an SVM learning framework. Mangasarian [14] proposed an exact

L_{1}

-norm support vector machine based on unconstrained convex differentiable minimization. Gao [15] developed a new 1-norm least squares TSVM (NELSTSVM). Ye et al. [16] proposed a

L_{1}

-norm distance minimization-based robust TSVM. Yan et al. [17] proposed 1-norm projection TSVM (1-PTSVM), and so on. As mentioned earlier, the

L_{1}

-norm is a better alternative to the squared

L_{2}

-norm in terms of enhancing the robustness of the algorithm. However, when the outliers are large, the existing classification methods based on

L_{1}

-norm distance often cannot achieve satisfactory classification results.

Recently, more and more researchers have paid attention to the capped

L_{1}

-norm and achieved some excellent research results [19,20,21,22,23,24]. Research shows that capped

L_{1}

-norm is considered to be a better approximation of

L_{0}

-norm and more robust than

L_{1}

-norm. In general, the capped

L_{1}

-norm is considered to be a better approximation of the

L_{1}

-norm, with stronger robustness than the

L_{1}

-norm. Some excellent algorithms based on capped

L_{1}

-norm have been proposed for robust classification tasks. For example, Wang et al. [25] proposed a new robust TSVM (CTSVM) by applying capped

L_{1}

-norm. CTSVM retains the advantages of TSVM and improves the robustness of classification. The experimental results on multiple datasets show that the CTSVM algorithm has good robustness and effectiveness to outliers. The capped

L_{1}

-norm metrics are neither convex nor smooth, which makes them difficult to optimize. There are two general strategies for solving nonconvex optimization problems. The first strategy is to design efficient algorithms, such as the bump process algorithm and the abnormal path algorithm. The second strategy is to smooth the metric function to reduce the complexity of the algorithm. To overcome the shortcomings of capped

L_{1}

-norm, many scholars proposed capped

L_{2, p}

-norm for robust learning [26,27]. Zhang et al. [28] proposed a new large-scale semi-supervised classification algorithm based on ridge regression and capped

L_{2, p}

-norm loss function. It is worth noting that by setting the appropriate p-value, the capped

L_{1}

-norm and capped

L_{2}

-norm are special forms of capped

L_{2, p}

-norm: when

p = 1

or

p = 2

, the capped

L_{2, p}

-norm corresponds to the capped

L_{1}

-norm or capped

L_{2}

-norm. These algorithms show that the capped distance metric is robust against outliers. However, there are few extensions and related applications of the capped

L_{2, p}

-norm for twin support vector machine.

In the current scenario, although data collection is easy, obtaining labeled data is difficult [29]. To address this issue, researchers have proposed semi-supervised learning (SSL) [29], which uses less labeled data and more unlabeled data to build more reliable classifiers. Graph-based SSL algorithms are a significant branch of SSL. The learning strategy involves first forming edges by connecting points between labeled and unlabeled data points and then creating a graph from these edges that represents the similarity between samples. Manifold regularization-based SSL [30] is one of the graph-based SSL methods that preserve the manifold structure to improve the discriminative property of the data [31]. The learning strategy involves mining the geometric distribution information of the data and representing it in the form of regularization terms. The reference [31] first introduced MR to SSL by proposing the Laplace support vector machine (Lap-SVM) and Laplace regularized least squares (Lap-RLS). Qi et al. [32] developed a Laplace TSVM (LapTSVM) based on a pair of non-parallel hyperplanes of TSVM. Although the classifier’s generalization performance is improved, the method’s parameter adjustment may be impacted by different datasets, and it may not be able to handle large-scale problems effectively due to high computational complexity. Xie et al. [33] propose a novel Laplacian

L_{p}

-norm least squares twin support vector machine (Lap-

L_{p}

LSTSVM). The experimental results on both synthetic and real-world datasets show that Lap-

L_{p}

LSTSVM outperforms other state-of-the-art methods and can also deal with noisy datasets [34,35].

To summarize, prior research on improving the TBSVM classification performance while considering robustness and discriminability is limited. In response, we introduce the WMRTBSVM and WMLSRTBSVM models. Specifically, we replace the hinge loss term in TBSVM with the

L_{2, p}

-norm, and we replace the second term in TBSVM with the Welsch Loss with p-power. This improves the model’s classification performance and robustness. Furthermore, we incorporate a manifold structure into the model to further enhance its classification performance and discriminability. The main contributions of this paper are summarized as follows:

(1): A generalized adaptive robust loss function called $V_{θ} (x)$ is innovatively designed. Intuitively, $V_{θ} (x)$ can improve the robustness of the model by selecting different robust loss functions for different learning tasks during the learning process via the adaptive parameter $θ$ . Compared with other robust loss functions, $V_{θ} (x)$ has some desirable salient properties, such as symmetry, boundedness, robustness, nonconvexity, and adaptivity.
(2): A novel robust manifold learning framework for semi-supervised pattern classification is proposed. In this learning framework, the proposed robust loss function $V_{θ} (x)$ and capped $L_{2, p}$ -norm robust distance metric are introduced to improve the robustness and generalization performance of the model, especially when the outliers are far from the normal data distributions.
(3): Two effective iterative optimization algorithms are designed for solving our methods by the half-quadratic (HQ) optimization algorithm, and the convergence of the algorithms is demonstrated.
(4): Experimental results on artificial and benchmark datasets with different noise settings and different evaluation criteria show that our methods have better classification performance and robustness.

In Section 2, we introduce the formulas involved in TBSVM and manifold regularization since our model is based on these two approaches. In Section 3, we present a novel robust manifold learning framework for semi-supervised pattern classification. Finally, we discuss experiments and conclusions in Section 4 and Section 5, respectively.

The structure of the rest of this paper is as follows: In Section 2, as our model is based on TBSVM and manifold regularization, in order to improve our formulas and their derivation, we will introduce the formulas involved in TBSVM and manifold regularization, respectively. In Section 3, we present a novel robust manifold learning framework for semi-supervised pattern classification. Finally, in Section 4 and Section 5, we discuss experiments and conclusions.

2. Related Works

This section presents a review of related works, which include TBSVM and manifold regularization. The binary classification problem in the n-dimensional real vector space

R^{n}

is considered. All vectors are represented as columns. Given a training dataset

T = (x_{1}, y_{1}), \dots, (x_{m}, y_{m})

, where

x_{i} \in R^{n}

is the input and

y_{i} = {- 1, 1}

is the corresponding output for

i = 1, \dots, m

. T is composed of

m_{1}

positive class and

m_{2}

negative class samples, where m =

m_{1}

+

m_{2}

. The data samples from class i form the data matrix

X_{i} \in R^{n \times n}

, where each column represents a sample.

A \in R^{n \times m_{1}}

represents all positive class samples (i.e.,

y_{i} = 1

), and

B \in R^{n \times m_{2}}

represents all negative classes (i.e.,

y_{i} = - 1

).

2.1. TBSVM

In this subsection, we provide a brief review of the twin bounded support vector machine (TBSVM). The optimization objective of TBSVM is to ensure that each hyperplane is as close as possible to the samples in the corresponding class and as far away as possible from the samples in the other class. For the linear case, TBSVM defines two nonparallel hyperplanes:

\begin{matrix} f_{1} (x) = ω_{1}^{T} x + b_{1} = 0 a n d f_{2} (x) = ω_{2}^{T} x + b_{2} = 0 . \end{matrix}

(1)

To improve the classification ability of TSVM and realize the principle of structural risk minimization, an improved version of TSVM named TBSVM is obtained by introducing an

L_{2}

-regularization term based on TSVM:

\begin{matrix} min_{ω_{1}, b_{1}, ξ_{1}} \frac{1}{2} ∥ A ω_{1} + e_{1} b_{1} ∥_{2}^{2} + c_{1} e_{2}^{T} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0, \end{matrix}

(2)

and

\begin{matrix} min_{ω_{2}, b_{2}, ξ_{2}} \frac{1}{2} ∥ B ω_{2} + e_{2} b_{2} ∥_{2}^{2} + c_{2} e_{1}^{T} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}), \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1}, ξ_{2} \geq 0 . \end{matrix}

(3)

To avoid the impact of singular problems caused by inverse matrices, positive scales

λ_{1} I

and

λ_{2} I

are introduced, where

λ_{1}

and

λ_{2}

are small positive constants, and 0 and I represent the zero vector matrix and the identity matrix, respectively, on the appropriate dimension. Therefore, based on the dual theory, we can obtain the dual problem of (2) and (3):

\begin{matrix} min_{α} \frac{1}{2} α^{T} G {(H^{T} H + c_{3} I)}^{- 1} G^{T} α - e_{2}^{T} α \\ s . t . 0 \leq α \leq c_{1} e_{2}, \end{matrix}

(4)

and

\begin{matrix} min_{β} \frac{1}{2} β^{T} H {(G^{T} G + c_{4} I)}^{- 1} H^{T} β - e_{1}^{T} β, \\ s . t . 0 \leq β \leq c_{2} e_{1} . \end{matrix}

(5)

where

c_{1}, c_{2}, c_{3}, c_{4} > 0

represent regularization parameters,

e_{1} \in R^{m_{1}}

and

e_{2} \in R^{m_{2}}

are vectors of ones, and

ξ_{1}

and

ξ_{2}

are slack vectors. The prime superscript T is used to transform column vectors into row vectors, and the matrices

G = [B e_{2}]

and

H = [A e_{1}]

. The dual problems are revised as

α \in R^{m_{2}}

and

β \in R^{m_{1}}

, which are Lagrange multipliers. By solving (4) and (5), two nonparallel hyperplanes can be obtained:

\begin{matrix} [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] = - {(H^{T} H + c_{3} I)}^{- 1} G^{T} α a n d [\begin{matrix} ω_{2} \\ b_{2} \end{matrix}] = {(G^{T} G + c_{4} I)}^{- 1} H^{T} β . \end{matrix}

A new data point

x \in R^{n}

is then assigned to the positive or negative class, depending on which of the two hyperplanes (1) it lies closest to, i.e.,

f (x) = a r g m i n_{k = 1, 2} \frac{| x ω_{k} + b_{k} |}{∥ ω_{k} ∥},

where

| . |

is the absolute value operation,

{∥ . ∥}_{p}

means the

L_{p}

-norm for

p > 0

, when

p = 2

,

{∥ . ∥}_{2}

is written as

∥ . ∥

for brevity.

2.2. Manifold Regularization

In this subsection, we briefly review graph-based semi-supervised learning (SSL). Manifold regularization (MR) is one of the graph-based SSL methods, whose learning strategy is to mine the geometric distribution information of the data and represent it in the form of regularization terms. In [30], the authors point out that data distributions on manifolds are often complex and may exhibit nonlinear structures, and traditional methods may not be able to effectively capture their intrinsic structures and characteristics. Based on this, the authors propose a regularization method based on the Laplacian graph. On the basis of ensuring smoothness, the method maintains the Euclidean distance relationship of the original data sample as far as possible, enabling it to better reflect the distribution of data in the manifold space.

Consider a binary semi-supervised classification problem in the n-dimensional real space

R^{n}

. The set of training data is represented by

T = {(x_{1}, y_{1}), \dots, (x_{l}, y_{l}), x_{l + 1}, \dots, x_{l + u}}

, where

l + u = n

, dataset

X_{l} = {x_{i}}_{i = 1}^{l} \in R^{l \times n}

are the labeled data with corresponding labels

Y_{l} = {y_{i}}_{i = 1}^{l} \in {- 1, 1}

, and dateset

X_{u} = {x_{i}}_{i = 1}^{u} \in R^{u \times n}

are the unlabeled data with corresponding labels

Y_{u} = 0

, where

X = X_{l} + X_{u}

represent the whole dateset. We model

X

as a graph

G

,

W

is the adjacency matrix of graph

G

,

\begin{matrix} w_{i j} : = \{\begin{matrix} exp (\frac{- ∥ x_{i} - x_{j} ∥^{2}}{2 σ^{2}}), x_{i} \in N_{k} (x_{j}) o r x_{j} \in N_{k} (x_{i}), \\ 0, O t h e r w i s e, \end{matrix} \end{matrix}

denotes the similarity between examples

x_{i}

and

x_{j}

, where

N_{k} (x_{j})

represents the k nearest neighbors of

x_{i}

. Based on the adjacency matrix

W

, the Laplacian matrix

L

of the graph

X

can be computed by

L = D - W

, where

D = d i a g (\sum_{j = 1}^{n} W_{1 j}, \sum_{j = 1}^{n} W_{2 j}, \dots, \sum_{j = 1}^{n} W_{n j})

.

In RKHS, the optimization of manifold regularization can be written as follows:

\begin{matrix} f^{*} = a r g min_{f \in H} R^{e m p} (f) + γ_{H} {∥ f ∥}_{H}^{2} + γ_{M} {∥ f ∥}_{M}^{2}, \end{matrix}

where

R^{e m p} (f)

denotes the empirical risks on the labeled data

Y

, which also denote the loss function.

γ_{H}

and

γ_{M}

are non-negative regularization parameters.

{f ∥}_{H}^{2}

is the regularization term to prevent overfitting.

{f ∥}_{M}^{2}

is the smoothness term, which can be expressed as:

\begin{matrix} {∥ f ∥}_{M}^{2} = \frac{1}{{(l + u)}^{2}} \sum_{i, j = 1}^{l + u} w_{i j} {(f (x_{i}) - f (x_{j}))}^{2} = f^{T} L f . \end{matrix}

(6)

3. Main Contributions

In this section, we begin by outlining the key motivation behind our proposed model. We then present the model formulation and describe its components in detail. Finally, we provide a convergence analysis of the proposed model in Section 3.3.

3.1. Generalized Adaptive Robust Loss Function

To improve the robustness, classification performance, and generalization ability of the TBSVM framework, we propose a new robust loss function called the generalized adaptive robust loss function

V_{θ} (x)

. The

V_{θ} (x)

loss function is symmetric and has bounded non-negativity. The

V_{θ} (x)

is defined for any

x \in R^{n}

as follows:

\begin{matrix} V_{θ} (x) = \frac{c^{2}}{2} [1 - exp (- \frac{x^{2}}{2 c^{2}})]^{θ}, \end{matrix}

(7)

where

θ > 0

is the power parameter, and c is a trade-off parameter that penalizes outliers.

Remark 1.

When

θ = 1

, the

V_{θ} (x)

-Loss will degenerate into Welsch Loss [36]. That is, Welsch Loss is a special case of

V_{θ} (x)

-Loss.

Property 1.

V_{θ} (x)

has boundedness, non-negativity, symmetry, lack of smoothness, and non-convexity. Secondly, its value is limited to a constant and does not increase, which ensures better robustness and desirability of the loss function.

In Figure 1, we compare the robustness of different loss functions, namely

L_{2}

-loss,

L_{1}

-loss, Welsch loss, and

V_{θ} (x) - l o s s

(

c = 1

), against outliers. As shown in the figure, the Welsch Loss with

θ

-power (red curves) is the most robust, highlighting its effectiveness in suppressing the impact of noisy outliers on the model performance. In Figure 2, we plot the loss curve of the Welsch Loss with

θ

-power under different values of the parameter

θ

. We observe that as

θ

decreases (from 4 to 2, 1, and

0.5

), the function becomes narrower while remaining symmetric and bounded, further demonstrating its suitability for handling noise and outliers.

3.2. Our Method

In this subsection, we present our model and provide an explanation of it. For the binary classification task, we aim to find a pair of optimal classification hyperplanes to separate the positive and negative samples. Specifically, we consider a pair of constrained optimization problems:

\begin{matrix} min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + c_{1} \sum_{i = 1}^{m_{2}} [1 - exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]^{θ} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} f_{1}^{T} L f_{1} \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0, \end{matrix}

(8)

and

\begin{matrix} min_{ω_{2}, b_{2}, ξ_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{3}) + c_{2} \sum_{i = 1}^{m_{2}} [1 - exp (- \frac{ξ_{2, i}^{2}}{2 c^{2}})]^{θ} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} f_{2}^{T} L f_{2} \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} \geq e_{1}, ξ_{2} \geq 0 . \end{matrix}

(9)

where,

c_{1}, c_{2}, c_{3}, c_{4}, c_{5}

, and

c_{6}

are positive regularization parameters, while c is an adjustment parameter that controls the degree of penalty for outliers. As stated in (6):

\begin{matrix} ∥ f_{1} ∥_{M}^{2} = \frac{1}{{(l + u)}^{2}} \sum_{i, j = 1}^{l + u} W_{i j} {(f_{1} (x_{i}) - f_{1} (x_{j}))}^{2} = f_{1}^{T} L f_{1} \end{matrix}

and

\begin{matrix} ∥ f_{2} ∥_{M}^{2} = \frac{1}{{(l + u)}^{2}} \sum_{i, j = 1}^{l + u} W_{i j} {(f_{2} (x_{i}) - f_{2} (x_{j}))}^{2} = f_{2}^{T} L f_{2} . \end{matrix}

where

L = D - W

refers to the Graph Laplacian matrix. D is a diagonal matrix associated with W, where the diagonal element is

D_{i j} = \sum_{i, j = 1}^{l + u} W_{i j}

. The vector

f_{1} = [f_{1} {(x_{1}, \dots, f_{1} (x_{l + u})]}^{T}

equals

M ω_{1} + e b_{1}

, while

f_{2} = [f_{2} {(x_{1}, \dots, f_{2} (x_{l + u})]}^{T}

equals

M ω_{2} + e b_{2}

, where

M \in R^{(l + u) \times n}

represents all training data, including labeled and unlabeled data and e is an appropriate vector. Thus, the primary problem of (8) and (9) can be written as:

\begin{matrix} min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + c_{1} \sum_{i = 1}^{m_{1}} [1 - exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]^{θ} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0, \end{matrix}

(10)

and

\begin{matrix} min_{ω_{2}, b_{2}, ξ_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{3}) + c_{2} \sum_{i = 1}^{m_{2}} [1 - exp (- \frac{ξ_{2, i}^{2}}{2 c^{2}})]^{θ} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} \geq e_{1}, ξ_{2} \geq 0 . \end{matrix}

(11)

Since the two terms are quite similar, we can solve one of them and obtain a solution for the other in a similar manner. For the purpose of illustration, let us consider solving (10) in two parts:

\{\begin{matrix} P (ω_{1}, b_{1}) = min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ R (ω_{1}, b_{1}) = c_{1} \sum_{i = 1}^{m_{2}} [1 - exp (- \frac{ξ_{1}^{2},_{i}}{2 c^{2}})]^{θ} \end{matrix}

(12)

Then, we can rewrite the Formula (10) as:

\begin{matrix} max_{ω_{1}, b_{1}, ξ_{1}} M (ω_{1}, b_{1}, ξ_{1}) = \bar{R} (ω_{1}, b_{1}) - P (ω_{1}, b_{1}), \end{matrix}

(13)

where

\bar{R} (ω_{1}, b_{1}) = c_{1} \sum_{i = 1}^{m_{2}} [exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]^{θ}

. We define a convex function

\begin{matrix} g (v) = - v log (- v) + v, v < 0 . \end{matrix}

(14)

From the theory of conjugate functions, we obtain:

\begin{matrix} exp (- \frac{ξ_{1}^{2}}{2 c^{2}})^{θ} = sup_{v < 0} [v \frac{ξ_{1}^{2}}{2 c^{2}} - g (v)]^{θ}, v = - exp (- \frac{ξ_{1}^{2}}{2 c^{2}})^{θ} . \end{matrix}

(15)

Then, we obtain:

\begin{matrix} max_{ω_{1}, b_{1}, ξ_{1}} M (ω_{1}, b_{1}, ξ_{1}) = \sum_{i = 1}^{m_{2}} ([v_{i} \frac{ξ_{1, i}^{2}}{2 c^{2}} - g (v_{i})])^{θ} - P (ω_{1}, b_{1}) . \end{matrix}

(16)

Thus, the (10) and (11) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0, \end{matrix}

(17)

and

\begin{matrix} min_{ω_{2}, b_{2}, ξ_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{2}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{2} ξ_{1} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1}, ξ_{2} \geq 0, \end{matrix}

(18)

where

Ω_{j} = d i a g (- v_{j, i}^{s}, 0)

,

j = 1, 2

. To optimize the objective function smoothly, we introduce concave duality, as illustrated in Lemma 1 [37,38].

Lemma 1.

Let

g (θ) : R^{n} \to R

be a continuous nonconvex function, suppose

h (θ) : R^{n} \to Ξ

is a map with range Ξ. We assume that a concave function

\bar{g} (u)

exists defined on Ξ, such that

g (θ) = g (h (θ))

holds.

Therefore, the nonconvex function

g (θ)

can be expressed as:

g (θ) = inf_{v \in R^{n}} [v^{T} h (θ) - g^{* (v)}] .

(19)

According to concave duality,

g^{*} (v)

is the concave dual of

\bar{g} (u)

given as:

g^{*} (v) = inf_{u \in} [v^{T} h (θ) - g^{* (v)}] .

(20)

In addition, the minimum value to the right is as follows:

v^{*} = \frac{\partial \bar{g} (θ)}{\partial θ} |_{u = h (θ)} .

(21)

Based on the Lemma 1, we give a non-convex function

\bar{g} (θ) : R \to R

make any arbitrary

θ > 0

,

\begin{matrix} \bar{g} (θ) = min (θ^{\frac{p}{2}}, ε) . \end{matrix}

(22)

Assuming that

h (μ) = μ^{2}

, we obtain

\begin{matrix} min (∥ ω x_{i} {+ b ∥}_{2}^{p}, ε) = g (h (μ)), μ = ∥ ω x_{i} {+ b ∥}_{2} \end{matrix}

(23)

Based on (23), the first term of (17) and (18) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} \bar{g} (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{2}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0, \end{matrix}

(24)

and

\begin{matrix} min_{ω_{2}, b_{2}, ξ_{2}} \sum_{i = 1}^{m_{2}} \bar{g} (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} \geq e_{1}, ξ_{2} \geq 0 . \end{matrix}

(25)

Let

θ_{1} = h (μ_{1}) = {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{2}

. By Formula (19), the first term of (17) can be expressed as:

min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) = \bar{g} (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{2}) = inf_{f_{i i} \geq 0} (f_{i i} h (μ_{1}) - g^{*} (f_{i i})) = inf_{f_{i i} \geq 0} f_{i i} θ_{1} - g^{*} (f_{i i}) .

(26)

Therefore, the nonconvex dual function of

\bar{g} (θ_{1})

given as:

g^{*} (f_{i i}) = inf_{θ_{1}} [f_{i i} θ_{1} - \bar{g} (θ_{1}] = inf_{θ_{1}} \{\begin{matrix} f_{i i} θ_{1} - θ_{1}^{\frac{p}{2}}, θ_{1}^{\frac{p}{2}} < ε_{1}, \\ f_{i i} θ_{1} - ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} . \end{matrix}

(27)

By optimizing

θ_{1}

for (27):

g^{*} (f_{i i}) = \{\begin{matrix} f_{i i} {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}} - {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}}, θ_{1}^{\frac{p}{2}} < ε_{1}, \\ f_{i i} ε_{1}^{\frac{2}{p}} - ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} . \end{matrix}

(28)

Finally, the objective function (17) first term can be further written as:

min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) = inf_{f_{i i} \geq 0} L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}),

where

L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}) \{\begin{matrix} f_{i i} θ_{1} - f_{i i} {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}} + {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}}, θ_{1}^{\frac{p}{2}} < ε_{1}, \\ f_{i i} θ_{1} - f_{i i} ε_{1}^{\frac{2}{p}} + ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} . \end{matrix}

(29)

Therefore, Formula (17) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ ⟺ \\ min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} inf_{f_{i i} \geq 0} L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ ⟺ \\ min_{ω_{1}, b_{1}, f_{i i} \geq 0} \sum_{i = 1}^{m_{1}} L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) . \end{matrix}

(30)

Similarly, Formula (18) can be rewritten as:

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ ⟺ \\ min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} inf_{d_{i i} \geq 0} L_{i} (ω_{2}, b_{2}, d_{i i}, ε_{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ ⟺ \\ min_{ω_{2}, b_{2}, d_{i i} \geq 0} \sum_{i = 1}^{m_{2}} L_{i} (ω_{2}, b_{2}, d_{i i}, ε_{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) . \end{matrix}

(31)

The objective functions (30) and (31) are solved by learning optimal classifiers through alternative optimization algorithms. We calculate the gradient of the function

g (θ)

with respect to

θ

, expressed as:

\frac{\partial \bar{g} (θ)}{\partial θ} = \{\begin{matrix} \frac{p}{2} θ^{\frac{p}{2} - 1}, 0 < θ < ε^{\frac{2}{p}}, \\ 0, θ > ε^{\frac{2}{p}} . \end{matrix}

(32)

If

θ_{1} = h (μ_{1}) = {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{2}

, we fix

ω_{1}

and

b_{1}

:

f_{i i} = \frac{\partial \bar{g} (θ_{1})}{\partial θ_{1}} |_{θ_{1}} = {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{2} = \{\begin{matrix} \frac{p}{2} ∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p - 2}, 0 < {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{p} < ε_{1}, \\ 0, e l s e . \end{matrix}

(33)

Similarly, if

θ_{2} = h (μ_{2}) = {∥ ω_{2} x_{i} + b_{2} ∥}_{2}^{2}

, we fix

ω_{2}

and

b_{2}

:

d_{i i} = \frac{\partial \bar{g} (θ_{2})}{\partial θ_{2}} |_{θ_{2}} = {∥ ω_{2} x_{i} + b_{2} ∥}_{2}^{2} = \{\begin{matrix} \frac{p}{2} ∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p - 2}, 0 < {∥ ω_{2} x_{i} + b_{2} ∥}_{2}^{p} < ε_{3}, \\ 0, e l s e . \end{matrix}

(34)

To understand the relationship between parameters more clearly, we set the distance from sample

x_{i}

to the hyperplane as X. If

X > ε_{1}

and

f_{i i}

almost equals 0, then the sample

x_{i}

is considered an outlier and is discarded. Furthermore,

d_{i i}

is similar to

f_{i i}

. When the variables

f_{i i}

and

d_{i i}

are fixed to solve the classifier-related parameters

ω_{1}

,

ω_{2}

,

b_{1}

, and

b_{2}

, the optimization problem (30) and (31) can be written as:

min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} f_{i i} (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{2}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1})

(35)

and

min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} d_{i i} (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2})

(36)

Let

F = d i a g (f_{11}, \dots, f_{m_{1}, m_{1}})

be an

m_{1} \times m_{1}

diagonal matrix, and

D = d i a g (d_{11}, \dots,

d_{m_{2}, m_{2}})

be an

m_{2} \times m_{2}

diagonal matrix. The optimization problem (35) and (36) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}, ξ_{1}} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, \end{matrix}

(37)

and

\begin{matrix} min_{ω_{2}, b_{2}, ξ_{1}} {(B ω_{2} + e_{2} b_{2})}^{T} D (B ω_{2} + e_{2} b_{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} \geq e_{1} . \end{matrix}

(38)

The corresponding Lagrange function of the above optimization problem (37) can be rewritten as:

\begin{matrix} L (ω_{1}, b_{1}, ξ_{1}, α) = \frac{1}{2} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) - α^{T} (- (B ω_{1} + e_{2} b_{1}) + ξ_{1} - e_{2}), \end{matrix}

(39)

where

α

is a Lagrange multiplier, we derive the Lagrange function about

ω_{1}

and

b_{1}

and obtain the following Karush–Kuhn–Tucker (KKT) conditions.

\{\begin{matrix} \frac{\partial L}{\partial ω_{1}} = A^{T} F (A ω_{1} + e_{1} b_{1}) + c_{3} ω_{1} + c_{5} M^{T} L (M ω_{1} + e b_{1}) + B^{T} α = 0, \\ \frac{\partial L}{\partial b_{1}} = e_{1}^{T} F (A ω_{1} + e_{1} b_{1}) + c_{3} b_{1} + c_{5} e^{T} L (M ω_{1} + e b_{1}) + e_{2}^{T} α = 0, \\ \frac{\partial L}{\partial ξ_{1}} = c_{1} Ω_{1} ξ_{1} - α = 0, \\ α^{T} - (B ω_{1} + e_{2} b_{1} + ξ_{1} - e_{2}) = 0, \\ α \geq 0 . & (v) \end{matrix}

(40)

Let

H = [\begin{matrix} A \\ e_{1}^{T} \end{matrix}], E = [\begin{matrix} B \\ e_{2}^{T} \end{matrix}], Z = [\begin{matrix} M \\ e^{T} \end{matrix}] a n d {\bar{θ}}_{1} = [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] .

(41)

Thus, we have

[\begin{matrix} A^{T} \\ e_{1}^{T} \end{matrix}] F [\begin{matrix} A & e_{1} \end{matrix}] [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] + c_{3} L [\begin{matrix} M^{T} \\ e^{T} \end{matrix}] [\begin{matrix} M & e_{1} \end{matrix}] [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] + [\begin{matrix} B^{T} \\ e_{2}^{T} \end{matrix}] α = 0 .

(42)

Further, we can get

(H^{T} F H + c_{3} I + c_{3} Z^{T} L Z) \bar{θ} + E^{T} α = 0,

(43)

where I is an identity matrix of appropriate dimensions. According to matrix theory, it can be easily proved that

H^{T} F H + c_{3} I + c_{3} Z^{T} L Z

is a positive definite matrix. Therefore, we have

{\bar{θ}}_{1} = {[ω_{1}, b_{1}]}^{T} = - {(H^{T} F H + c_{3} I + c_{5} Z^{T} L Z)}^{- 1} E^{T} α .

(44)

Furthermore, we can obtain the dual problem of (8) as follows:

\begin{matrix} min_{α} \frac{1}{2} α^{T} (E {(H^{T} F H + c_{3} I + c_{3} Z^{T} L Z)}^{- 1} E^{T} + c_{1} Ω_{1}^{- 1}) α - e_{2}^{T} α \\ s . t . 0 \leq α \leq c_{1} e_{2} . \end{matrix}

(45)

Similarly, the dual problem of (9) can be written as:

\begin{matrix} min_{β} \frac{1}{2} β^{T} (H {(E^{T} D E + c_{4} I + c_{4} Z^{T} L Z)}^{- 1} H^{T} + c_{2} Ω_{2}^{- 1}) α - e_{1}^{T} β \\ s . t . 0 \leq β \leq c_{2} e_{1}, \end{matrix}

(46)

where

β

is the Lagrange multiplier and the augmented vector

{\bar{θ}}_{2} = {[ω_{2}, b_{2}]}^{T} = {(E^{T} D E + c_{4} I + c_{6} Z^{T} L Z)}^{- 1} H^{T} β .

(47)

Once vectors

{\bar{θ}}_{1}

and

{\bar{θ}}_{2}

are obtained, a new data point

X \in R^{n}

is then assigned to the positive or negative class, depending on which the two hyperplanes it lies closest to, i.e.,

f (x) = a r g m i n_{k = 1, 2} \frac{| x ω_{k} + b_{k} |}{∥ ω_{k} ∥},

where

| . |

is the absolute value operation,

{∥ . ∥}_{p}

means the

L_{p}

-norm for

p > 0

, when

p = 2

,

{∥ . ∥}_{2}

is written as

∥ . ∥

for brevity.

Based on the above discussion, our algorithm will be presented in Algorithm 1.

Algorithm 1 Solving WMRTBSVM

Input: Data matrices $A \in R^{m_{1} \times n}$ and $B \in R^{m_{2} \times n}$ ; Parameters $c_{i}, (i = 1, 2, 3, 4, 5, 6)$ , cut off level $ε_{i}, (i = 1, 2, 3, 4)$ .
Output: $θ_{1}^{*}$ and $θ_{2}^{*}$ are the optimal values for $θ_{1}$ and $θ_{2}$ .
Process:
1. Initialize $F \in R^{m_{1} \times m_{1}}$ and $Ω_{1} \in R^{m_{1} \times m_{1}}$ ; $D \in R^{m_{2} \times m_{2}}$ and $Ω_{2} \in R^{m_{2} \times m_{2}}$ .
2. Calculate by the KKT conditions can get $α$ and $β$ by (45) and (46).
3. Get $θ_{1}$ and $θ_{2}$ by
$θ_{1} = - {(H^{T} F H + c_{3} I + c_{5} Z^{T} L Z)}^{- 1} E^{T} α$
and
$θ_{2} = {(E^{T} D E + c_{4} I + c_{6} Z^{T} L Z)}^{- 1} H^{T} β$ .
4. Update matrix separately F and D, $Ω_{1}$ and $Ω_{2}$ by (24), (25), (33) and (34).

To improve the computational power of WMTBSVM, we further propose the least squares version of WMTBSVM.

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + c_{1} \sum_{i = 1}^{m_{2}} {[1 - exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]}^{θ} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} f_{1}^{T} L f_{1} \\ s . t - (B ω_{1} + e_{2} b_{1}) + ξ_{1} = e_{2}, ξ_{1} \geq 0, \end{matrix}

(48)

and

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{3}) + c_{2} \sum_{i = 1}^{m_{2}} {[1 - exp (- \frac{ξ_{2, i}^{2}}{2 c^{2}})]}^{θ} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{6} f_{2}^{T} L f_{2} \\ s . t (A ω_{1} + e_{1} b_{2}) + ξ_{2} = e_{1}, ξ_{2} \geq 0 . \end{matrix}

(49)

Like (37) and (38) in WMTBSVM, (48) and (49) can be rewritten as follows:

\begin{matrix} min_{ω_{1}, b_{1}} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{c_{1}}{2 c^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} = e_{2}, \end{matrix}

(50)

and

\begin{matrix} min_{ω_{2}, b_{2}} {(B ω_{2} + e_{2} b_{2})}^{T} D (B ω_{2} + e_{2} b_{2}) + \frac{c_{2}}{2 c^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) + c_{5} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} = e_{1} . \end{matrix}

(51)

By bringing the equality constraint into the objective function,

\begin{matrix} min_{ω_{1}, b_{1}} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{2} b_{1}) + \frac{c_{1}}{2 c^{2}} ∥ e_{2} + B ω_{1} + e_{2} b_{1} {| |}_{2}^{2} \\ + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} (ω_{1}^{T} M^{T} + e^{T} b_{1}) L (ω_{1} M + e b_{1}) \end{matrix}

(52a)

and

\begin{matrix} min_{ω_{2}, b_{2}} {(B ω_{2} + e_{2} b_{2})}^{T} D (A ω_{2} + e_{1} b_{2}) + \frac{c_{1}}{2 c^{2}} ∥ e_{1} - A ω_{2} - e_{1} b_{2} {| |}_{2}^{2} \\ + \frac{c_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{1}^{2}) + c_{6} (ω_{2}^{T} M^{T} + e^{T} b_{2}) L (ω_{2} M + e b_{2}) \end{matrix}

(52b)

The solution of (52) can be expressed as:

\begin{matrix} \bar{θ_{1}} = - {(\frac{2 c^{2}}{c_{1}} H^{T} F H + E^{T} Ω_{1} E + \frac{c_{3}}{c_{1}} I + c_{5} Z^{T} L Z)}^{- 1} E^{T} Ω_{1} e_{2}, \\ {\bar{θ}}_{2} = - {(\frac{2 c^{2}}{c_{2}} E^{T} D E + H^{T} Ω_{2} H + \frac{c_{4}}{c_{2}} I + c_{6} Z^{T} L Z)}^{- 1} H^{T} Ω_{2} e_{1}, \end{matrix}

(53)

where H, F,Z,

\bar{θ}

, E, and D are the same as those of WMTBSVM.

Once vectors

{\bar{θ}}_{1}

and

{\bar{θ}}_{2}

are obtained, a new data point

X \in R^{n}

is then assigned to the positive or negative class, depending on which of the two hyperplanes it lies closest to, i.e.,

f (x) = a r g m i n_{k = 1, 2} \frac{| x ω_{k} + b_{k} |}{∥ ω_{k} ∥},

where

| . |

is the absolute value operation;

{∥ . ∥}_{p}

means that the

L_{p}

-norm for

p > 0

, when

p = 2

,

{∥ . ∥}_{2}

is written as

∥ . ∥

for brevity. Based on the above discussion, our algorithm will be presented in Algorithm 2.

Algorithm 2 Solving WMLSRTBSVM

Input: Data matrices $A \in R^{m_{1} \times n}$ and $B \in R^{m_{2} \times n}$ ; Parameters $c_{i}, (i = 1, 2, 3, 4, 5, 6)$ , cut off level $ε_{i}, (i = 1, 2, 3, 4)$ .
Output: $θ_{1}^{*}$ and $θ_{2}^{*}$ are the optimal values for $θ_{1}$ and $θ_{2}$ .
Process:
1. Initialize $F \in R^{m_{1} \times m_{1}}$ and $Ω_{1} \in R^{m_{1} \times m_{1}}$ ; $D \in R^{m_{2} \times m_{2}}$ and $Ω_{2} \in R^{m_{2} \times m_{2}}$ .
2. Calculate by the KKT conditions can get $α$ and $β$ by (52a) and (52b).
3. Get $θ_{1}$ and $θ_{2}$ by
$θ_{1} = - {(\frac{2 c^{2}}{c_{1}} H^{T} F H + E^{T} Ω_{1} E + \frac{c_{3}}{c_{1}} I + c_{5} Z^{T} L Z)}^{- 1} E^{T} Ω_{1} e_{2}$ ,
and
$θ_{2} = - {(\frac{2 c^{2}}{c_{2}} E^{T} D E + H^{T} Ω_{2} H + \frac{c_{4}}{c_{2}} I + c_{6} Z^{T} L Z)}^{- 1} H^{T} Ω_{2} e_{1}$ .
4. Update matrix separately F and D, $Ω_{1}$ and $Ω_{2}$ by (24), (25), (33) and (44).

3.3. Convergence Analysis

In this subsection, we prove the convergence of the proposed algorithms (see Appendix A).

3.4. Complexity Analysis

In this section, we briefly analyze the complexity of our proposed Algorithms 1 and 2. We know that computational complexity is mainly determined by matrix multiplication and matrix inversion. In Algorithms 1 and 2, assuming the size of the dataset is

R^{m \times n}

, where there are

m_{1}

and

m_{2}

positive and negative samples, respectively, and

A \in R^{m_{1} \times n}

and

B \in R^{m_{2} \times n}

.

In (44) and (47),

\bar{θ} 1 = {[ω 1, b_{1}]}^{T} = - {(H^{T} F H + c_{3} I + c_{5} Z^{T} L Z)}^{- 1} E^{T} α

and

\bar{θ} 2 = {[ω 2, b_{2}]}^{T} = {(E^{T} D E + c_{4} I + c_{6} Z^{T} L Z)}^{- 1} H^{T} β

. The computational costs of matrix multiplication are both

O (m \times {(n)}^{2})

, while the computational cost of matrix inversion is

O ({(n)}^{3})

. Therefore, the upper bound of the total computational cost of Algorithm 1 is

O (2 T (m \times {(n)}^{2} + {(n)}^{3}))

, where T is the number of iterations, which is usually less than 10 in similar algorithms to our model. In addition, in our experiment, the number of samples m is generally much larger than the dimension of samples n, so the total computational cost of Algorithm 1 is

O (2 T (m \times {(n)}^{2}))

.

In (53), the computational costs of matrix multiplication are

O (m_{1} \times {(n)}^{2})

and

O (m_{2} \times {(n)}^{2})

, respectively, and the computational cost of matrix inversion is

O ({(n)}^{3})

. Therefore, the upper bound of the total computational cost of Algorithm 2 is

O ((m \times {(n)}^{2} + {(n)}^{3}))

, where

m > n

. Consequently, the total computational cost of this algorithm is

O ((m \times {(n)}^{2}))

.

4. Experimental Results and Analysis

In this section, we test the performance of our proposed model. For a fair comparison, we implemented six classification algorithms in MATLAB R2021a. The experimental environment consisted of a Windows 11 machine (CPU: Intel Core i5; RAM: 16.00 GB; OS: 64-bit Windows 11).

4.1. Experimental Setting

To validate and evaluate the validity and reliability of our proposed model, we compared WM-TBSVM and WM-LSTBSVM with other related methods, including twin support vector machine (TSVM), twin bounded support vector machine (TBSVM), least squares twin support vector machine (LSTSVM), WMRTBSVM, and WMLSRTBSVM. Furthermore, the conventional accuracy (

A C C

) was used to measure the classification performance of all algorithms, which is defined as follows:

\begin{matrix} A C C = \frac{T P + T N}{T P + F N + T N + F P}, \end{matrix}

(54)

where TP and TN denote the true positive and true negative, respectively, and FP and FN denote the false positive and false negative, respectively. The higher the ACC value, the better the model value.

In the experiment, data preprocessing is carried out first. We divided the dataset into a training dataset and a test dataset, and all sample data were normalized to reduce the difference in features among different samples. In order to overcome the randomness of the test results, the experimental parameters were selected by 10-fold cross-validation, each dataset was tested 10 times, and the classification accuracy was averaged 10 times. In order to obtain the best generalization ability, the parameters involved in the experiment were selected as follows:

The value range of the

c_{i} (i = 1, 2, \dots, 6)

is

{2^{i} | i = - 7, - 6, \dots, 6, 7}

,

ε_{i} (i = 1, 2, 3, 4)

=

10^{- 5}

,

σ

and

ε

is

{10^{i} | i = - 7, - 6, \dots, 6, 7}

.

4.2. General Experimental Results

In order to verify the classification performance of the proposed method and other related algorithms in a noise-free setting, we ran them on twelve UCI datasets from the UCI Machine Learning Repository. We split each dataset into a training set and a testing set with a sample ratio of 7:3. That is, in each experiment, we randomly selected 70% points of both classes at a time as the training set and the rest as the testing set. In addition, we used the grid method with 10-fold cross-validation to find the optimal parameters. The process was repeated 10 times. The general experimental results are shown in Table 1, with the best results for each testing set shown in bold. Here, ACC is the average classification accuracy in the testing set, and “time (s)” represents the average running time in the testing set in seconds obtained by each algorithm according to the optimal parameters.

UCI datasets: Australian, Balance, Backnote, Cancer, German, Hepat, Pima Indian (Pima), QSAR, Spect, Vote, Wisconsin diagnostic breast cancer (WDBC), and Wholesale. See Table 2 for details of the twelve UCI datasets.

As shown in Table 1, we observe that the classification accuracy of WMRTBSVM and WMLSRTBSVM is generally higher than that of other methods. Additionally, the classification accuracy of CTSVM is generally higher than that of TSVM, TBSVM, and LSTBSVM. CTSVM, WMRTBSVM, and WMLSRTBSVM all contain capped norm distances. In general, LSTBSVM and WMLSRTBSVM have shorter running times, but WMLSRTBSVM has higher classification accuracy. Based on this, we can objectively conclude that the use of a capped

L_{2, p}

-norm distance metric in the TBSVM framework can improve classification performance, and the addition of the Welsch Loss with p-power can further enhance classification performance.

4.3. Convergence Analysis

In Section 3.3, we theoretically proved that the iterative optimization algorithm we designed is convergent. In this section, we conducted experiments on the Cancer dataset to further verify its convergence. As shown in Figure 3, the value of the objective function decreases with each iteration. In addition, the algorithm reached the optimal value in less than 10 iterations on the Cancer dataset. This also proves the feasibility and effectiveness of our algorithm.

4.4. Robustness Analysis

We conducted experiments on both artificial datasets and UCI datasets in a noisy environment. The dataset includes one synthetic dataset and twelve benchmark datasets from the UCI Machine Learning Repository. Please refer to Figure 4 and Table 2 for details on the artificial and UCI datasets.

Artificial datasets The dataset consists of 104 two-dimensional points, with 52 samples in each class. These points are generated by disturbing points located on two intersecting planes, where each plane corresponds to a class of data. We used “∘” and “+” to distinguish between the two classes. To test the effect of outliers on classification performance, we added four outliers to the dataset, two of which belong to class

+ 1

, and two belong to class

- 1

. This is illustrated in Figure 4.

In order to visually evaluate the classification performance and robustness differences between WMRTBSVM, WMLSRTBSVM, and the other four algorithms, we conducted experiments on artificial datasets with four outliers. The experimental results are shown in Figure 5.

From the results depicted in Figure 5, we can see intuitively that WMRTBSVM and WMLSRTBSVM have better performance. The accuracy of six algorithms (TBSVM, LSTBSVM, CTSVM, WMRTBSVM, and WMLSRTBSVM) were

62.23 %

,

65.10 %

,

71.96 %

,

77.00 %

,

80.08 %

, and

81.54 %

, respectively. These results indicate that WMRTBSVM and WMLSRTBSVM can deal with outliers better than other methods after the introduction of outliers. Additionally, the classification effect of CTSVM is also good, which may neutralize the negative impact of outliers due to the capped

L_{1}

-norm distance. Experimental results demonstrate that WMRTBSVM and WMLSRTBSVM have good classification accuracy after introducing outliers, which may be due to the use of capped

L_{2, p}

-norm distance. The robustness of WMRTBSVM and WMLSRTBSVM to outliers has been demonstrated effectively.

In addition, we also evaluated the robustness of WMRTBSVM and WMLSRTBSVM by introducing Gaussian noise of

10 %

,

30 %

, and

50 %

in the UCI datasets. Table 3, Table 4 and Table 5 show the experimental results on the dataset with

10 %

,

30 %

, and

50 %

Gaussian noises, respectively.

Table 3, Table 4 and Table 5 present the comparison of the 6 algorithms on the 12 UCI datasets with

10 %

,

30 %

and

50 %

Gaussian noise, respectively. The experimental results reveal that the classification accuracy of each algorithm decreases after the introduction of noise. However, in most cases, WMTBSVM and WMLSTBSVM display higher classification accuracy than other algorithms, particularly when the noise surpasses

30 %

. Moreover, LSTBSVM and WMLSTBSVM demonstrate less runtime. Overall, WMTBSVM and WMLSTBSVM are superior to the other four algorithms in terms of accuracy and robustness. This implies that WMTBSVM and WMLSTBSVM are robust learning algorithms that facilitate the classification of noise-contaminated samples.

Based on the results shown in Figure 6, we observe that the accuracy of the six algorithms decreases to varying degrees as noise increases from

0 %

to

10 %

,

30 %

, and

50 %

. This indicates that the algorithms’ robustness is impacted by the number of noise points. However, our proposed models, WMTBSVM (represented by the red curve) and WMLSTBSVM (represented by the blue curve), maintain the highest accuracy. Even when noise points reach

50 %

, our algorithms still show clear advantages over the others. In the smaller datasets (a:

690 \times 14

, c:

440 \times 7

, and d:

699 \times 9

), the CTSVM (represented by the magenta curve), WMTBSVM (represented by the red curve), and WMLSTBSVM (represented by the blue curve) curves show relatively smooth variations. This may be attributed to the truncation loss used in the algorithms. The performance of the three truncation-based algorithms was also good in the larger datasets (b:

1372 \times 4

, e:

1000 \times 4

, and f:

1055 \times 41

). However, overall, WMTBSVM and WMLSTBSVM showed the best performance, likely due to their use of Welsch Loss with p-power.

4.5. Statistical Analysis

This section describes the analysis of the significant differences among the seven algorithms on the 12 UCI datasets using the Friedman test [39]. The Friedman test is a simple, safe, and robust non-parametric test that assumes the null hypothesis that all algorithms have the same performance. If the null hypothesis is rejected, we can perform a post-hoc test of the Nemeny test [39]. We calculated the average ranking and accuracy of the seven algorithms on the ten datasets, and the results are presented in Table 6.

To begin with, taking Gaussian kernel datasets with

30 %

unlabeled samples as an example, we calculate the Friedman statistic variable by using the following formulation:

\begin{matrix} X_{F}^{2} = \frac{12 N}{k (k + 1)} [\sum_{j} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4}] = 44.49, \end{matrix}

(55)

where k is the number of algorithms, N is the number of UCI datasets, and

R j

is the average rank of the jth algorithm on the employed datasets. Notice that

k = 6

and

N = 12

in our paper. Furthermore, according to the

X_{F}^{2}

distribution with

(k - 1)

degrees of freedom, we have

\begin{matrix} F_{F} = \frac{(N - 1) X_{F}^{2}}{X_{F}^{2} - N (k - 1)} = 11.344, \end{matrix}

(56)

where

F_{F} ((k - 1), (k - 1) (N - 1))

obeys the F-distribution with

(k - 1)

and

(k - 1) (N - 1)

degrees of freedom. In addition, for

α = 0.01

, we obtain

F_{α} = (5, 55) = 3.340

. Obviously, the value of

F_{F}

is greater than

F_{α}

; thus, we can reject the null hypothesis. From Table 6, we see that the average ranking of WMTBSVM and WMLSTBSVM was much lower than the rest of the algorithms, which means that our WMTBSVM and WMLSTBSVM are more effective than the other algorithms.

Furthermore, we compared the seven algorithms in pairs using the Nemenyi post-hoc test. The difference in performance between the two algorithms was significant when the average rank difference between the two algorithms was larger than the critical value; otherwise, the difference was not significant. By dividing the Studentized range statistic by

\sqrt{2}

, we obtain

q_{α} = 0.01 = 2.209

. Therefore, we calculate the critical difference

(C D)

by the following formula:

\begin{matrix} C D = q_{α = 0.01} \sqrt{\frac{k (k + 1)}{6 N}} = 2.209 \times \sqrt{\frac{6 (6 + 1)}{6 \times 12}} = 1.701 . \end{matrix}

(57)

From Figure 7, we see that WMTBSVM and WMLSTBSVM perform significantly better than TSVM, TBSVM, LSTBSVM, and CTSVM. It can further be seen that there is no significant difference between the proposed methods WMTBSVM and WMLSTBSVM, as the difference is smaller than the CD value. Therefore, through statistical analysis, it can be a safe conclusion that the proposed methods WMTBSVM and WMLSTBSVM have better performance.

5. Conclusions

In this paper, a generalized adaptive robust loss function

V_{θ} (x)

is designed.

V_{θ} (x)

has several significant and satisfactory characteristics, such as symmetry, boundedness, and non-convexity. By setting appropriate parameters to improve the adaptability and robustness of WMTBSVM, we achieve better generalization performance and robustness. Secondly, we introduce the capped

L_{2, p}

-norm distance measure into WMRTBSVM to improve the generalization performance and robustness of the model. This is done by setting appropriate p and upper bound parameter values, especially when the outliers are far from the normal data distribution. We also add MR into WMTBSVM to improve the discriminability and classification ability of our model. To improve the computational efficiency of WMRTBSVM, we use the least square method to obtain WMLSRTBSVM. Two effective iterative optimization algorithms are designed, and theoretical support is given for both WMRTBSVM and WMLSRTBSVM. We mainly conducted accuracy test experiments on manual datasets and UCI datasets. The experimental results show that WMRTBSVM and WMLSRTBSVM have better classification performance and robustness. In future work, we hope to apply WMRTBSVM and WMLSRTBSVM to multi-classification tasks to further study their performance and our theoretical work. We also plan to study how to combine our method with sparse kernel SVM to develop better performance and faster algorithms. In addition, we designed the generalized adaptive robust loss function

V_{θ} (x)

, which we hope can be combined with other loss functions to further improve the adaptability and robustness of the correlation algorithms. Ultimately, we hope that

V_{θ} (x)

can be applied to ensemble learning to deal with unbalanced datasets.

Author Contributions

B.M.: writing—original draft, conceptualization, writing—reviewing and editing, software, data curation. G.Y.: writing—original draft, supervision, validation, project administration, funding acquisition. J.M.: writing—original draft, conceptualization, writing—reviewing and editing, software, data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Ningxia Provincial of China (No. 2022AAC03260, No. 2023AAC02053), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), in part by the Fundamental Research Funds for the Central Universities (No. 2021KYQD23, No. 2022XYZSX03), in part by the National Natural Science Foundation of China (No. 11861002).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

All of the benchmark datasets used in our numerical experiments are from the UCI Machine Learning Repository, and are available at http://archive.ics.uci.edu/ml/ (accessed on 21 March 2023).

Conflicts of Interest

There are no conflict of interest in this study.

Appendix A. Convergence Analysis

Lemma A1.

For any scalar t, when

0 < p \leq 2

, inequality

{2 | t |}_{p} - p t^{2} + p - 2 \leq 0

holds.

Proof.

Let

f (t) = {2 | t |}^{\frac{p}{2}} - p t + p - 2

, find the first derivative of

f (t)

, respectively:

\begin{matrix} f^{^{'}} (t) = p (t^{\frac{p - 2}{2}} - 1) \end{matrix}

and

\begin{matrix} f^{^{″}} (t) = \frac{p (p - 2)}{2} t^{\frac{p - 4}{2}} . \end{matrix}

If

t > 0

and

0 < p \leq 2

, then

f^{^{″}} (t) \leq 0

and

t = 1

is only point that

f^{^{'}} (t) = 0

. Note that

f^{^{'}} (1) = 0

, thus when

t > 0

and

0 < p \leq 2

, then

f (t) \leq 0

. Thus

f^{2} (t) \leq 0

, which indicates

{2 | t |}^{p} - p t^{2} + p - 2 \leq 0

holds. □

Lemma A2.

For any nonzero vectors α, β, when

0 < p \leq 2

, the following inequality holds.

\begin{matrix} {∥ α ∥}_{2}^{p} - \frac{p}{2} {∥ β ∥}_{2}^{p - 2} {∥ α ∥}_{2}^{2} \leq {∥ β ∥}_{2}^{p} - \frac{p}{2} {∥ β ∥}_{2}^{p - 2} {∥ β ∥}_{2}^{2} . \end{matrix}

Proof.

According to Lemma A1, we obtain:

2 {(\frac{{∥ α ∥}_{2}}{{∥ β ∥}_{2}})}^{p} - p {(\frac{{∥ α ∥}_{2}}{{∥ β ∥}_{2}})}^{2} + p - 2 \leq 0

⇒

{2 ∥ α ∥}_{2}^{p} - {p ∥ β ∥}_{2}^{p - 2} {∥ α ∥}_{2}^{2} \leq (2 - p) {∥ β ∥}_{2}^{p}

⇒

{∥ α ∥}_{2}^{p} - \frac{p}{2} {∥ β ∥}_{2}^{p - 2} {∥ α ∥}_{2}^{2} \leq {∥ β ∥}_{2}^{p} - \frac{p}{2} {∥ β ∥}_{2}^{p - 2} {∥ β ∥}_{2}^{2} .

□

Theorem A1.

Algorithm 1 will monotonically decrease the objective (17) and (18) in each iteration until it converges.

Proof.

Recall our framework

\begin{matrix} J = min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + c_{1} \sum_{i = 1}^{m_{2}} {[1 - exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]}^{θ} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} f_{1}^{T} L f_{1} \end{matrix}

(A1)

\begin{matrix} = J_{1} + J_{2} + J_{3} + J_{4}, \end{matrix}

\begin{matrix} J = min_{z} \sum_{i = 1}^{m_{1}} min (∥ h_{i} z_{1} ∥_{2}^{p}, ε_{1}) + \frac{c_{3}}{2} z_{1}^{T} z_{1} + J_{2} + J_{4}, \end{matrix}

(A2)

where

h_{i} = (x_{i}, 1)

,

z_{1} = {(w_{1}^{T}, b_{1})}^{T}

. When

∥ h_{i} z_{1} ∥_{2}^{p}

is smaller than

ε_{1}

, the above equation is equivalent to:

\begin{matrix} J = min_{z} \sum_{i = 1}^{m_{1}} min {∥ h_{i} z_{1} ∥}_{2}^{p} + \frac{c_{3}}{2} z_{1}^{T} z_{1} + J_{2} + J_{4}, \end{matrix}

(A3)

Suppose

z_{1}^{k + 1}

is the solution of the

(k + 1)

th iteration of the algorithm, based on (47) we have:

\begin{matrix} z_{1}^{k + 1} = min_{z} \frac{1}{2} {(H z_{1}^{(k + 1)})}^{T} F^{(k + 1)} H z_{1}^{(k + 1)} + c_{3} {(z_{1}^{(k + 1)})}^{T} z_{1}^{(k + 1)} + J_{2}^{(k + 1)} + J_{4}^{(k + 1)} . \end{matrix}

(A4)

At the kth iteration:

\begin{matrix} {(H z_{1}^{(k + 1)})}^{T} F^{(k + 1)} H z_{1}^{(k + 1)} + c_{3} {(z_{1}^{(k + 1)})}^{T} z_{1}^{(k + 1)} + J_{2}^{(k + 1)} + J_{4}^{(k + 1)} \end{matrix}

(A5)

≤

\begin{matrix} {(H z_{1}^{(k)})}^{T} F^{(k)} H z_{1}^{(k)} + c_{3} {(z_{1}^{(k)})}^{T} z_{1}^{(k)} + J_{2}^{(k)} + J_{4}^{(k)} . \end{matrix}

Which is equality:

\begin{matrix} \frac{p}{2} ∥ H z_{1}^{(k + 1)} ∥_{2}^{p} - \frac{p}{2} {∥ H z_{1}^{(k + 1)} ∥}_{2}^{p - 2} + c_{3} {(z_{1}^{(k + 1)})}^{T} z_{1}^{(k + 1)} + J_{2}^{(k + 1)} + J_{4}^{(k + 1)} \end{matrix}

(A6)

≤

\begin{matrix} \frac{p}{2} ∥ H z_{1}^{(k)} ∥_{2}^{p} - \frac{p}{2} {∥ H z_{1}^{(k)} ∥}_{2}^{p - 2} + c_{3} {(z_{1}^{(k)})}^{T} z_{1}^{(k)} + J_{2}^{(k)} + J_{4}^{(k)} . \end{matrix}

Based on Lemma A2, we obtain:

\begin{matrix} ∥ H z_{1}^{(k + 1)} ∥_{2}^{p} - \frac{p}{2} ∥ H z_{1}^{(k + 1)} ∥_{2}^{p - 2} ∥ H z_{1}^{(k + 1)} ∥_{2}^{2} \leq ∥ H z_{1}^{(k)} ∥_{2}^{p} - \frac{p}{2} ∥ H z_{1}^{(k)} ∥_{2}^{p - 2} {∥ H z_{1}^{(k)} ∥}_{2}^{2} . \end{matrix}

(A7)

Here, according to the Formulas (A6) and (A7), we have:

\begin{matrix} ∥ H z_{1}^{(k + 1)} ∥_{2}^{p} + c_{3} {(z_{1}^{(k + 1)})}^{T} z_{1}^{(k + 1)} + J_{2}^{(k + 1)} + J_{4}^{(k + 1)} \leq {∥ H z_{1}^{(k)} ∥}_{2}^{p} + c_{3} {(z_{1}^{(k)})}^{T} z_{1}^{(k)} + J_{2}^{(k)} + J_{4}^{(k)} . \end{matrix}

(A8)

Thus, we have

J (z_{1}^{(k + 1)}) \leq J (z_{1}^{(k)})

. If

∥ h_{i} z_{1} ∥_{2}^{p}

is the biggest and

ε_{1}

, we obtain

J (z_{1}^{(k + 1)}) = J (z_{1}^{(k)})

. Therefore, the

J (z_{1}^{(k + 1)}) \leq J (z_{1}^{(k)})

holds, meaning that Algorithm 1 decreases the objective of problems (17) until convergence. For problem (18), we have the same proof process. Since the Formulas (17) and (18) are lower bounded by 0, Algorithm 1 will converge. □

Lemma A3.

For all positive real number a and b, the following inequality holds:

\begin{matrix} \sqrt{a} - \frac{a}{2 \sqrt{b}} \leq \sqrt{b} - \frac{b}{2 \sqrt{b}} . \end{matrix}

(A9)

Theorem A2.

Algorithm 1 will converge to a local minimal solution of the problem (17) and (18).

Proof.

Recall our framework

\begin{matrix} J = min_{ω_{1}, b_{1}, ξ_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + c_{1} \sum_{i = 1}^{m_{2}} {[1 - exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]}^{θ} + \frac{c_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) + c_{5} f_{1}^{T} L f_{1}, \end{matrix}

(A10)

\begin{matrix} J = min_{z} \sum_{i = 1}^{m_{1}} min (∥ h_{i} z_{1} ∥_{2}^{p}, ε_{1}) + \frac{c_{3}}{2} z_{1}^{T} z_{1} + J_{2} + J_{4}, \end{matrix}

(A11)

where

h_{i} = (x_{i}, 1)

,

z_{1} = {(w_{1}^{T}, b_{1})}^{T}

. First we consider the

J_{2} = c_{1} \sum_{i = 1}^{m_{2}} {[1 - exp (- \frac{ξ_{1, i}^{2}}{2 c^{2}})]}^{θ}

, and we first define two functions

\begin{matrix} R^{+} \to R^{+} : c o n c a v e f u n c t i o n θ (V) = ϑ (\sqrt{V}), V \in [0, \infty), θ^{^{'}} (v) = \frac{ϑ^{^{'}} (\sqrt{V})}{2 \sqrt{V}}, \end{matrix}

(A12)

\begin{matrix} R^{-} \to R^{+} : {(- θ^{^{'}})}^{- 1} . \end{matrix}

(A13)

Based on conjugate function theory, there exists a convex conjugate function of the convex function

- θ (v)

in

R^{-}

:

\begin{matrix} {(- θ)}^{*} (z) = sup_{v \geq 0} {z v + θ (v)}, z < 0, \end{matrix}

(A14)

where

\begin{matrix} {(- θ)}^{*} (z) = z {(- θ^{^{'}})}^{- 1} (z) + θ [{(- θ^{^{'}})}^{- 1} (z)], z < 0 . \end{matrix}

(A15)

Because the conjugate function of a convex function’s conjugate function is the convex function itself, we have

\begin{matrix} - θ) (v) = sup_{z < 0} {z v - {(- θ)}^{*} (v)}, v \geq 0 . \end{matrix}

(A16)

Let

z = - \frac{1}{2} s

, and define a convex function

ψ (s) = - θ^{*} (- \frac{1}{2} s)

,

\begin{matrix} - θ (v) = sup_{s > 0} {- \frac{1}{2} s v - ψ (s)}, v \geq 0, \end{matrix}

(A17)

which is equivalent to

\begin{matrix} θ (v) = inf_{s > 0} {\frac{1}{2} s v + ψ (s)}, \forall v \geq 0 . \end{matrix}

(A18)

In (A18),

\frac{1}{2} s v + ψ (s)

by

s > 0

is convex, then we can obtain a minimum solution

s^{*} = 2 θ^{^{'}} (v)

by derivation. Define

ψ (v) = 1 - exp (- v^{2})

, where

v = \frac{ε_{1}}{\sqrt{2 c}}

, due to

ψ (v) = θ (v^{2})

, we have:

\begin{matrix} φ (v) = θ (v^{2}) = inf_{s > 0} {\frac{1}{2} s v^{2} + ψ (s)}, \forall v . \end{matrix}

(A19)

When

v > 0

, there exists a minimum solution

s^{*} = 2 θ^{^{'}} (v^{2})

in the right hand of the above equation, i.e.,

\begin{matrix} s^{*} = \frac{φ^{^{'}} (v)}{v} \end{matrix}

(A20)

Combining the Formulas (A19) and (A20):

\begin{matrix} inf_{s > 0} {\frac{1}{2} s v^{2} + ψ (s)} = \frac{1}{2} s^{*} v^{2} + ψ (s^{*}), \forall v, \end{matrix}

(A21)

where

s^{*} = 2 exp (- v^{2})

. Then, we can say that Algorithm 1 will converge to a local minimum solution of

J_{2}

. For

J_{4} = c_{5} f_{1}^{T} L f_{1}

, in the

(k + 1)

th iteration, we have:

\begin{matrix} J_{4}^{(k + 1)} \leq J_{4}^{(k)} . \end{matrix}

(A22)

With Lemma A3, we set

\begin{matrix} a = | J_{4}^{(k + 1)} |^{2}, \end{matrix}

(A23)

\begin{matrix} b = | J_{4}^{(k)} |^{2}, \end{matrix}

then, we can easily obtain the following inequality:

\begin{matrix} J_{4}^{(k + 1)} - \frac{| J_{4}^{(k + 1)} |^{2}}{2 J_{4}^{(k)}} \leq J_{4}^{(k)} - \frac{| J_{4}^{(k)} |^{2}}{2 J_{4}^{(k)}} . \end{matrix}

(A24)

Combining (A22) and (A24), we can obtain

\begin{matrix} | J_{4}^{(k + 1)} | \leq | J_{4}^{(k)} | . \end{matrix}

(A25)

Then, we can say that Algorithm 1 will converge to a local minimum solution of

J_{4}

. For

\begin{matrix} J_{1} + J_{3} = min_{z} \sum_{i = 1}^{m_{1}} min (∥ h_{i} z_{1} ∥_{2}^{p}, ε_{1}) + \frac{c_{3}}{2} z_{1}^{T} z_{1} . \end{matrix}

(A26)

Define the Lagrangian function of (A26) as

τ (z 1)

, with the KKT condition of (A26), we have:

c_{3} z_{1} + \{\begin{matrix} Σ p ∥ h_{i} z_{1} ∥_{2}^{p - 1} h_{i}^{T}, 0 \leq {∥ h_{i} z_{1} ∥}_{2}^{p} < ε_{1}, \\ 0, o t h e r w i s e . \end{matrix}

(A27)

We substitute the

f_{i i}

in (33) into the above equation:

\begin{matrix} 2 H^{T} F H z_{1} + c_{3} z_{1} = 0 . \end{matrix}

(A28)

Combining (A28) and (47), we obtain:

\begin{matrix} {(H z_{1})}^{T} F (H z_{1}) + c_{3} z_{1}^{T} z_{1} . \end{matrix}

(A29)

Similarly, we obtain the Lagrangian function of Formula (A29):

\begin{matrix} 2 H^{T} F H z_{1} + c_{3} z_{1} = 0 . \end{matrix}

(A30)

Then, we can say that Algorithm 1 will converge to a local minimum solution of

J_{1} + J_{3}

. Furthermore, we can say that Algorithm 1 will converge to a local minimum solution of J. □

References

Brown, M.P.; Grundy, W.N.; Lin, D.; Cristianini, N.; Sugnet, C.W.; Furey, T.S.; Ares, M., Jr.; Haussler, D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 2000, 97, 262–267. [Google Scholar] [CrossRef] [PubMed]
Ma, S.; Cheng, B.; Shang, Z.; Liu, G. Scattering transform and LSPTSVM based fault diagnosis of rotating machinery. Mech. Syst. Signal Process. 2018, 104, 55–170. [Google Scholar] [CrossRef]
Suykens, J.A.K.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
Kumar, M.A.; Gopal, M. Least squares twin support vector machines for pattern classification. Expert Syst. Appl. 2009, 36, 7535–7543. [Google Scholar] [CrossRef]
Jayadeva, N.; Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef]
Shao, Y.H.; Zhang, C.H.; Wang, X.B.; Deng, N.Y. Improvements on twin support vector machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef]
Chen, X.; Yang, J.; Ye, Q.; Liang, J. Recursive projection twin support vector machine via within-class variance minimization. Pattern Recognit. 2011, 44, 2643–2655. [Google Scholar] [CrossRef]
Xu, Y.; Yang, Z.; Pan, X. A novel twin support-vector machine with pinball loss. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 359–370. [Google Scholar] [CrossRef]
Tanveer, M.; Tiwari, A.; Choudhary, R.; Jalan, S. Sparse pinball twin support vector machines. Appl. Soft Comput. 2019, 78, 164–175. [Google Scholar] [CrossRef]
Shao, Y.H.; Deng, N.Y.; Yang, Z.M. Least squares recursive projection twin support vector machine for classification. Pattern Recognit. 2012, 45, 2299–2307. [Google Scholar] [CrossRef]
Chen, S.G.; Wu, X.J. A new fuzzy twin support vector machine for pattern classification. Int. J. Mach. Learn. Cybern. 2018, 9, 1553–1564. [Google Scholar] [CrossRef]
Hou, Y.Y.; Li, J.; Chen, X.B.; Ye, C.Q. Quantum adversarial metric learning model based on triplet loss function. arXiv 2023, arXiv:2303.08293. [Google Scholar] [CrossRef]
Zhu, J.; Rosset, S.; Tibshirani, R.; Hastie, T. 1-norm support vector machines. Adv. Neural Inf. Process. Syst. 2003, 16. [Google Scholar]
Mangasarian, O.L.; Bennett, K.P.; Parrado-Hernández, E. Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization. J. Mach. Learn. Res. 2006, 7, 1517–1530. [Google Scholar]
Gao, S.; Ye, Q.; Ye, N. 1-Norm least squares twin support vector machines. Neurocomputing 2011, 74, 3590–3597. [Google Scholar] [CrossRef]
Ye, Q.; Zhao, H.; Li, Z.; Yang, X.; Gao, S.; Yin, T.; Ye, N. L₁-Norm distance minimization-based fast robust twin support vector k-plane clustering. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4494–4503. [Google Scholar] [CrossRef]
Yan, H.; Ye, Q.; Zhang, T.A.; Yu, D.J.; Yuan, X.; Xu, Y.; Fu, L. Least squares twin bounded support vector machines based on L₁-norm distance metric for classification. Pattern Recognit. 2018, 74, 434–447. [Google Scholar] [CrossRef]
Hazarika, B.B.; Gupta, D. 1-Norm random vector functional link networks for classification problems. Complex Intell. Syst. 2022, 8, 3505–3521. [Google Scholar] [CrossRef]
Jiang, W.; Nie, F.; Huang, H. Robust dictionary learning with capped L₁-norm. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Nie, F.; Huo, Z.; Huang, H. Joint capped norms minimization for robust matrix recovery. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Wu, M.J.; Liu, J.X.; Gao, Y.L.; Kong, X.Z.; Feng, C.M. Feature selection and clustering via robust graph-laplacian PCA based on capped L₁-norm. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 13–16 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1741–1745. [Google Scholar]
Zhao, M.; Chow, T.W.; Zhang, H.; Li, Y. Rolling fault diagnosis via robust semi-supervised model with capped L_2,1-norm regularization. In Proceedings of the IEEE International Conference on Industrial Technology, Toronto, ON, Canada, 22–25 March 2017; pp. 1064–1069. [Google Scholar]
Xiang, S.; Nie, F.; Meng, G.; Pan, C.; Zhang, C. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1738–1754. [Google Scholar] [CrossRef]
Nie, F.; Wang, X.; Huang, H. Multiclass capped L_p-norm SVM for robust classifications. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L₁-norm twin support vector machine. Neural Netw. 2019, 114, 47–59. [Google Scholar] [CrossRef]
Ma, X.; Ye, Q.; Yan, H. L_2,p-norm distance twin support vector machine. IEEE Access 2017, 5, 23473–23483. [Google Scholar] [CrossRef]
Ma, X.; Liu, Y.; Ye, Q. P-Order L₂-Norm Distance Twin Support Vector Machine. In Proceedings of the 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 617–622. [Google Scholar]
Zhang, L.; Luo, M.; Li, Z.; Nie, F.; Zhang, H.; Liu, J.; Zheng, Q. Large-scale robust semisupervised classification. IEEE Trans. Cybern. 2018, 49, 907–917. [Google Scholar] [CrossRef] [PubMed]
Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
Belkin, M. Problems of Learning on Manifolds. Ph.D. Thesis, The University of Chicago, Chicago, IL, USA, 2003. [Google Scholar]
Rossi, L.; Torsello, A.; Hancock, E.R. Unfolding kernel embeddings of graphs: Enhancing class separation through manifold learning. Pattern Recognit. 2015, 48, 3357–3370. [Google Scholar] [CrossRef] [Green Version]
Qi, Z.; Tian, Y.; Shi, Y. Laplacian twin support vector machine for semi-supervised classification. Neural Netw. 2012, 35, 46–53. [Google Scholar] [CrossRef]
Xie, X.; Sun, F.; Qian, J.; Guo, L.; Zhang, R.; Ye, X.; Wang, Z. Laplacian L_p-norm least squares twin support vector machine. Pattern Recognit. 2023, 136, 109192. [Google Scholar] [CrossRef]
Wen, J.; Lai, Z.; Wong, W.K.; Cui, J.; Wan, M. Optimal feature selection for robust classification via L_2,1-norms regularization. In Proceedings of the Twenty-Second International Conference on Pattern Recognition (ICPR), Stockholm, Sweden, 24–28 August 2014; pp. 517–521. [Google Scholar]
Wang, H.; Nie, F.; Huang, H. Learning robust locality preserving projection via p-order minimization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; AAAI Press: Washington, DC, USA, 2015; pp. 3059–3065. [Google Scholar]
Ke, J.; Gong, C.; Liu, T.; Zhao, L.; Yang, J.; Tao, D. Laplacian Welsch Regularization for Robust Semisupervised Learning. IEEE Trans. Cybern. 2020, 52, 164–177. [Google Scholar] [CrossRef]
Yuan, C.; Yang, L.-M. Capped L_2,P-norm metric based robust least squares twin support vector machine for pattern classification. Neural Netw. 2021, 142, 457–478. [Google Scholar] [CrossRef] [PubMed]
Kwak, N. Principal component analysis based on L₁-norm maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1672–1680. [Google Scholar] [CrossRef] [PubMed]
Demi<i>s</i>ˇar, J.; Schuurmans, D. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]

Figure 1.

L_{2}

loss vs.

L_{1}

loss vs. Welsch loss vs.

V_{θ} (x)

−loss.

Figure 1.

L_{2}

loss vs.

L_{1}

loss vs. Welsch loss vs.

V_{θ} (x)

−loss.

Figure 2. Welsch Loss with

θ

−power under different

θ

.

Figure 2. Welsch Loss with

θ

−power under different

θ

.

Figure 3. Convergence of WMTBSVM.

Figure 4. Distribution of artificial datasets with outliers.

Figure 5. The classification performance of six algorithms on the artificial datasets.

Figure 6. Accuracies of six algorithms via different noises.

Figure 7. Visualization of post-hoc tests for data from Table 6. (a) Gaussian kernel with

10 %

unlabeded samples. (b) Gaussian kernel with

30 %

unlabeled samples. (c) Gaussian kernel with

50 %

unlabeled samples.

Figure 7. Visualization of post-hoc tests for data from Table 6. (a) Gaussian kernel with

10 %

unlabeded samples. (b) Gaussian kernel with

30 %

unlabeled samples. (c) Gaussian kernel with

50 %

unlabeled samples.

Table 1. Experimental results on UCI datasets without noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WMTBSVM	WMLSTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
(N × n)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	85.44	86.91	86.03	86.21	86.44	$87.18$
(690 × 14)	14.698	1.828	0.061	3.798	1.584	0.766
Balance	93.57	93.57	93.25	92.36	$94.82$	93.57
(576 × 4)	0.695	0.725	0.051	3.270	1.017	0.616
Backnote	87.23	87.30	87.90	86.92	$88.35$	88.15
(1372 × 4)	15.134	12.791	5.089	7.105	5.992	2.492
Cancer	95.65	95.94	95.22	$96.16$	94.17	95.62
(699 × 9)	2.640	2.063	1.064	3.843	2.312	0.843
German	73.80	73.90	74.00	75.70	$77.60$	76.10
(1000 × 24)	5.495	3.983	1.075	2.655	2.666	1.536
Hepat	77.33	80.67	80.51	80.18	$83.42$	82.67
(155 × 19)	0.480	0.627	0.297	2.378	0.554	0.200
Pima	75.92	76.67	76.71	75.92	$77.05$	76.45
(768 × 8)	4.282	1.730	0.669	3.827	2.011	0.888
QSAR	85.96	85.38	85.30	86.25	$86.90$	86.90
(1055 × 41)	7.630	6.843	2.113	1.946	3.860	1.958
Spect	80.77	80.38	80.77	81.25	81.92	$83.08$
(267 × 44)	0.512	0.224	0.152	1.794	1.045	0.308
Vote	95.95	94.71	94.79	95.48	95.71	$95.95$
(432 × 16)	2.808	0.450	0.156	2.750	1.36	0.404
WDBC	96.43	95.89	95.93	96.54	$97.25$	96.43
(569 × 30)	3.722	0.564	0.254	2.674	1.613	0.688
Wholesale	82.79	88.60	86.05	90.00	89.37	$90.47$
(440 × 7)	1.120	1.648	0.745	2.560	1.227	0.500

Table 2. Characteristics of UCI Datasets.

Datasets	Samples	Attributes	Datasets	Samples	Attributes
Australian	690	14	Pima	768	8
Balance	576	4	QSAR	1055	41
Backnote	1372	4	Spect	267	44
Cancer	699	9	Vote	432	16
German	1000	4	Wholesale	440	7
Hepat	155	19	WDBC	569	30

Table 3. Experimental results on UCI datasets with

10 %

noise.

Table 3. Experimental results on UCI datasets with

10 %

noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WMTBSVM	WMLSTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
(N × n)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	85.29	86.32	86.40	85.85	$86.41$	85.44
(690 × 14)	3.702	1.244	0.552	3.564	1.745	0.842
Balance	93.04	93.39	92.43	91.11	$93.75$	93.21
(576 × 4)	1.410	1.440	0.654	3.046	1.117	0.593
Backnote	86.35	85.99	83.27	86.46	84.89	$86.93$
(1372 × 4)	15.068	8.845	4.090	7.406	6.062	2.436
Cancer	94.94	95.51	95.00	$95.78$	94.00	95.46
(699 × 9)	2.143	1.973	0.862	2.69	1.741	0.856
German	73.10	73.40	73.51	74.40	$75.30$	73.21
(1000 × 24)	5.051	4.120	1.575	1.846	4.038	1.661
Hepat	76.00	78.67	77.42	77.59	$81.33$	81.33
(155 × 19)	0.209	3.999	1.483	2.167	0.607	0.270
Pima	75.60	75.92	76.11	76.24	76.18	$76.33$
(768 × 8)	2.565	1.505	0.969	4.267	1.875	1.016
QSAR	83.37	82.98	83.13	83.87	$84.12$	82.44
(1055 × 41)	9.977	6.863	3.111	4.68	3.659	1.844
Spect	78.08	79.23	79.77	80.69	$81.15$	81.92
(267 × 44)	0.350	0.287	0.049	2.077	1.052	0.287
Vote	95.24	94.48	94.79	95.00	95.24	$95.48$
(432 × 16)	2.940	0.447	0.148	3.438	1.119	0.452
WDBC	93.96	93.71	94.81	95.11	$96.82$	95.07
(569 × 30)	5.201	0.552	0.254	2.856	2.270	0.682
Wholesale	79.53	83.49	84.64	87.47	88.15	$90.23$
(440 × 7)	0.523	2.312	1.050	2.199	1.273	0.552

Table 4. Experimental results on UCI datasets with

30 %

noise.

Table 4. Experimental results on UCI datasets with

30 %

noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WMTBSVM	WMLSTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
(N × n)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	82.44	83.14	83.71	84.32	84.85	$85.15$
(690 × 14)	1.833	0.669	0.350	3.946	1.677	0.841
Balance	91.21	92.86	92.10	88.57	93.39	$93.04$
(576 × 4)	1.432	1.624	0.751	3.409	1.063	0.555
Backnote	79.78	80.07	81.46	83.96	$84.60$	84.79
(1372 × 4)	11.367	5.357	4.088	7.104	4.252	3.062
Cancer	94.78	92.22	92.51	91.84	$93.13$	92.32
(699 × 9)	2.389	1.503	0.653	3.571	1.460	0.824
German	71.82	71.43	72.00	72.90	$74.80$	72.70
(1000 × 24)	0.821	0.741	0.376	1.530	4.549	1.591
Hepat	73.33	74.00	74.82	75.41	$80.67$	80.00
(155 × 19)	0.233	2.711	1.032	2.540	0.537	0.190
Pima	71.63	71.29	70.16	74.16	$75.00$	75.00
(768 × 8)	15.676	3.011	1.571	1.031	1.957	0.982
QSAR	77.37	75.77	76.91	80.10	82.12	$82.35$
(1055 × 41)	8.300	5.138	3.108	4.678	3.468	1.850
Spect	74.00	77.69	78.00	77.31	$81.15$	81.15
(267 × 44)	0.438	0.477	0.047	2.004	1.058	0.343
Vote	94.05	93.52	93.61	94.29	95.00	$95.20$
(432 × 16)	3.557	0.402	0.148	3.029	1.137	0.424
WDBC	91.71	92.29	93.00	92.93	$95.29$	93.89
(569 × 30)	10.108	0.501	0.255	2.670	2.184	0.665
Wholesale	68.56	68.12	67.38	85.60	87.81	$89.77$
(440 × 7)	2.876	2.045	1.151	2.958	1.321	0.476

Table 5. Experimental results on UCI datasets with

50 %

noise.

Table 5. Experimental results on UCI datasets with

50 %

noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WMTBSVM	WMLSTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
(N × n)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Australian	78.88	80.71	78.15	80.74	84.26	$84.71$
(690 × 14)	2.472	0.621	0.359	3.392	1.590	0.759
Balance	80.32	80.43	81.37	86.96	90.54	$92.50$
(576 × 4)	3.641	1.115	0.648	3.120	1.069	0.619
Backnote	76.79	77.01	78.21	78.62	$82.74$	80.57
(1372 × 4)	12.224	7.318	3.086	7.317	4.695	2.484
Cancer	84.35	84.64	85.00	89.42	$91.70$	90.59
(699 × 9)	1.854	1.351	0.756	3.579	1.505	0.883
German	70.90	71.00	70.10	70.80	72.20	$70.50$
(1000 × 24)	15.293	6.912	3.073	2.660	2.641	1.560
Hepat	70.67	71.39	71.63	72.33	$77.00$	75.67
(155 × 19)	0.232	0.648	0.299	1.883	0.614	0.174
Pima	65.79	62.26	64.61	68.29	73.39	$73.53$
(768 × 8)	3.762	2.556	1.272	4.378	1.946	0.924
QSAR	62.91	63.58	64.28	77.31	$80.58$	76.63
(1055 × 41)	0.767	10.486	4.125	4.378	4.198	1.747
Spect	69.28	66.92	66.92	71.15	79.66	$80.38$
(267 × 44)	0.772	0.800	0.321	1.694	1.038	0.314
Vote	83.81	91.38	92.29	94.24	$94.76$	94.52
(432 × 16)	3.793	0.400	0.150	2.614	1.173	0.380
WDBC	84.50	82.23	81.32	89.11	$92.57$	90.54
(569 × 30)	10.195	0.515	0.054	3.250	2.128	0.649
Wholesale	68.67	68.14	71.93	83.74	85.88	$88.37$
(440 × 7)	1.105	0.539	0.244	2.311	1.387	0.551

Table 6. Average accuracy and ranks of seven algorithms with Gaussian kernel on UCI datasets with different proportions of unlabeled samples.

Cases		TSVM	TBSVM	LSTBSVM	CTSVM	WMTBSVM	WMLSTBSVM
Gaussian kernel	Avg.ACC $10 %$	85.54	85.26	85.11	85.80	86.45	86.84
	Avg.rank $10 %$	4.88	4.17	4.17	2.92	2.25	2.63
	Avg.ACC $30 %$	80.64	81.03	82.31	83.45	85.65	86.04
	Avg.rank $30 %$	5.17	5.08	4.08	3.50	1.50	1.67
	Avg.ACC $50 %$	75.57	74.97	75.48	80.23	83.77	84.37
	Avg.rank $50 %$	4.92	4.96	4.79	3.0	1.42	1.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, B.; Ma, J.; Yu, G. A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification. Axioms 2023, 12, 737. https://doi.org/10.3390/axioms12080737

AMA Style

Ma B, Ma J, Yu G. A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification. Axioms. 2023; 12(8):737. https://doi.org/10.3390/axioms12080737

Chicago/Turabian Style

Ma, Bao, Jun Ma, and Guolin Yu. 2023. "A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification" Axioms 12, no. 8: 737. https://doi.org/10.3390/axioms12080737

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Robust Metric Distance Optimization-Driven Manifold Learning Framework for Semi-Supervised Pattern Classification

Abstract

1. Introduction

2. Related Works

2.1. TBSVM

2.2. Manifold Regularization

3. Main Contributions

3.1. Generalized Adaptive Robust Loss Function

3.2. Our Method

3.3. Convergence Analysis

3.4. Complexity Analysis

4. Experimental Results and Analysis

4.1. Experimental Setting

4.2. General Experimental Results

4.3. Convergence Analysis

4.4. Robustness Analysis

4.5. Statistical Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Convergence Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI