1. Introduction
In some applications, such as pattern recognition and data mining, dimensionality reduction methods are often used since they can reduce spacetime complexity, denoise, and make the model more robust. Principal component analysis (PCA) [
1,
2,
3] and linear discriminant analysis (LDA) [
2,
4,
5,
6,
7] are two typical linear algorithms. Following them, researchers have proposed many variants, such as kernel PCA [
8], generalized discriminant analysis (GDA) [
9], and linear discriminant analysis for robust dimensionality reduction (RLDA) [
10].
For nonlinear dimensionality reduction problems, manifold learning provides an effective solution. By supposing that data are located on a lowdimensional manifold, data samples observed in highdimensional space can be represented in a lowdimensional space. Some representative manifold learning algorithms are ISOMAP [
11], locally linear embedding (LLE) [
12] and Laplacian Eigenmaps (LE) [
13].
From the algorithmic perspective, algorithms mentioned above can be categorized as global methods or local methods. Global methods learn the lowdimensional representations by using global information of data. PCA and LDA are all global methods. The global methods are often effective and efficient, such that they are widely used in many real world applications. However, when dealing with nonlinear data, using the global method cannot capture the genuine distribution of data very well. Local methods using the manifold learning idea, such as LLE and LE, pay special attention to the intrinsic structure of data. Nevertheless, most of these manifold learning methods disregard label information when recovering the lowdimensional manifold structure, in that they are inherently unsupervised.
Although the abovementioned methods are defined from different perspectives of dimensionality reduction, graph embedding provide a unified framework for understanding and comparing them [
14]. Furthermore, by integrating label information in the computation of intrinsic and penalty graphs within the graph embedding framework, a supervised dimensionality reduction method called marginal Fisher analysis (MFA) is proposed [
14].
In traditional dimensionality reduction algorithms as described above, a data distribution assumption is generally applied that data are independent and identically distributed (i.i.d.). However, in realworld applications, there are often certain relativity or links between certain data, for instance, geometrical or semantic similarity, links among web pages, citation relations between scientific papers. Relationships usually indicate that these related samples are likely to have similarities or belonging to the same class. Nevertheless, although some dimensionality reduction methods consider to preserve the locality of data [
15,
16,
17,
18], the useful relationships are often simply ignored during the learning process of most existing dimensionality reduction methods.
Recently, relational learning has often been used in practical applications, for instance, web mining [
19] and social network analysis [
20]. In addition, relational information is also considered in social network discovery, document classification, sequential data analysis and semisupervised graph embedding [
21,
22,
23].
In the domain of dimensionality reduction, some algorithms have already been proposed in which the relational information is integrated into the representation learning process. In [
24], Duin et al. propose the relational discriminant analysis (RDA). In RDA, relationships among data are measured by the Euclidean distance between objects and prototypes or support objects of each class. However, RDA uses mean squared error as the objective function, and cannot perform well on multiclass learning problems. In [
25], Li et al. propose the probabilistic relational PCA (PRPCA) to build a probabilistic model associated with PCA and relational learning. RDA and PRPCA effectively integrate relational learning into the dimensionality reduction algorithms. Nevertheless, how to better inject relationships into traditional dimensionality reduction models is still worth exploring.
Recently, a few deep relational learning algorithms have been proposed. Specifically, Gao et al. design a deep learning model based on relational network for hyperspectral image fewshot classification [
26], Chen et al. apply local relation learning for face forgery detection [
27], and Cho et al. develop a weakly supervised anomaly detection method via contextmotion relational learning [
28]. In addition, some relational learning methods are used in the oneshot [
29] or zeroshot [
30] learning scenarios. However, many realworld applications have only very few data. Hence, shallow relational learning algorithms are still needed to be proposed and utilized.
In this paper, we propose a novel and general framework for dimensionality reduction, called relational Fisher analysis (RFA) [
31]. Besides the intrinsic and penalty graph in the graph embedding framework, we further construct a relational graph which captures the relational information encoded within data. Through this graph, the proposed RFA takes into account the impact of the relational information in the presentation learning process. An effective iterative trace ratio algorithm is proposed to optimize RFA. Futhermore, we use the kernel trick to extend RFA to its kernelized version—KRFA. Additiionally, we theoretically prove that the optimization algorithm of RFA converges. To evaluate the effectiveness of RFA, we conduct extensive experiments in many realworld applications. The results demonstrate that the proposed RFA outperforms most of the classic dimensionality reduction algorithms on the datasets we use. The effectiveness of KRFA is also tested.
This paper is based on one of our previous conference papers [
31], with significant improvements. For concreteness, we propse the KRFA algorithm and add more exclusive experiments with comparison to the related approaches. The rest of this paper is organized as follows: In
Section 2, we introduce several related ideas, including graph embedding, trace ratio problem and relational learning, which are highly relevant to our work. In
Section 3, we focus on our proposed method RFA, including the notation, the formulation and the iterative optimization method of RFA.
Section 4 includes the proof of the convergence of RFA and we present how to extend RFA to a kernel version—KRFA. In
Section 5, we compare our methods RFA and KRFA to other commonly used dimensionality reduction methods with extensive experiments, which demonstrate the effectiveness of our proposed methods. Finally, we summarize this paper in
Section 6.
2. Related Work
In this section, we first briefly introduce some traditional dimensionality reduction methods, and then offer a detailed description about relevant ideas including graph embedding, trace ratio problem and relational learning, respectively. Finally, we specify how those ideas are used in this work.
2.1. Traditional Dimensionality Reduction Methods
Some basic ideas of traditional dimensionality reduction methods are presented in this subsection, such as PCA, LDA as well as several locality based manifold learning methods including LLE and LE. Advantages and drawbacks of these methods are also presented in this part.
2.1.1. PCA
The main idea of PCA [
1] is to seek projection directions with maximal variances of the lowdimensional embeddings. It effectively extract and retain the principle components of the original data. However, as PCA is an unsupervised dimensionality reduction method, lowdimensional embeddings obtained from this method cannot perfectly maintain the discrimination between data of different classes.
2.1.2. LDA
LDA [
7], well known as a supervised dimensionality reduction method, aims to seek projection directions to minimize the intraclass scattering and maximize the interclass scattering for the lowdimensional embeddings. However, for LDA, if the dimensionality of data is far greater than the data size, the intraclass scattering matrix may suffer from the singularity problem and thus it influences the solution of this dimensionality reduction algorithm. Furthermore, since the rank of the interclass scattering matrix is at most
$C1$, the number of available projection directions of LDA is at most
$C1$, where
C is the number of classes.
2.1.3. Manifold Learning Methods
PCA and LDA are all global methods which use global information to project the original data into a subspace and obtain the lowdimensional data representations. However, for highly nonlinear data structure, these linear methods cannot learn the nonlinear relationships between data and thus the results are not ideal. By assuming that the highdimensional data have a lowdimensional manifold structure, manifold learning algorithms can nonlinearly map the highdimensional data onto their lowdimensional manifold. Among manifold learning methods, local geometric informationbased methods, such as LLE, LE and local preserving projection (LPP) [
32] are widely used. The ideas behind them are as follows.
LLE [
12] preserves the linear reconstruction characteristics in a local neighborhood of each datum. Hence, the lowdimensional embeddings obtained by LLE presents the local geometrical structure of the data manifold. LE [
13] preserves the similarities of the neighboring data points based on an adjacency matrix and a graph Laplacian matrix. However, for LLE and LE, as the nonlinear mapping function between the highdimensional and lowdimensional spaces is not learned, we cannot easily obtain the lowdimensional representations of new data. To the end, LPP [
32] performs a linear approximation of LE, and successfully overcomes its drawback as mentioned above.
2.2. Graph Embedding
In [
14], Yan et al. show that some commonly used dimensionality reduction algorithms could be transformed into a unified framework despite their different motivations, and the unified framework is called graph embedding. This framework derives a lowdimensional feature space, which preserves the adjacency relationship between sample pairs. The general objective function of this framework is presented in Equation (
1), where
$\mathbf{W}$ denotes the similarity matrix of the undirected weighted graph
$\mathbf{G}=\{\mathbf{X},\mathbf{W}\}$ and
$\mathbf{B}$ is the constraint matrix defined to avoid a trivial solution of the objective function,
where
${\mathbf{S}}_{W}$ and
${\mathbf{S}}_{B}$ are matrices constructed with respect to
$\mathbf{X}$,
$\mathbf{W}$ and
$\mathbf{B}$, respectively.
We note that this unified framework graph embedding also provides a new idea for researchers to propose new dimensionality reduction algorithms. In particular, Yan et al. propose a novel dimensionality reduction method by defining an intrinsic graph which characterizes the intraclass compactness and a penalty graph which characterizes the interclass separability in the graph embedding framework, and call it marginal Fisher analysis (MFA).
2.3. Trace Ratio Problems
As presented in the above subsection, within the context of graph embedding, the dimensionality reduction methods can be viewed as trying to obtain the transformation matrix
$\mathbf{W}$ that makes
$Tr\left({\mathbf{W}}^{T}{\mathbf{S}}_{p}\mathbf{W}\right)$ maximum and
$Tr\left({\mathbf{W}}^{T}{\mathbf{S}}_{l}\mathbf{W}\right)$ minimum. This is often formulated as a trace ratio optimization problem, that is
${max}_{\mathbf{W}}Tr\left({\mathbf{W}}^{T}{\mathbf{S}}_{p}\mathbf{W}\right)/Tr\left({\mathbf{W}}^{T}{\mathbf{S}}_{l}\mathbf{W}\right)$ [
33]. Generally, there are two kinds of solutions for this problem: (1) Simplifying the problem into a ratio trace problem:
${max}_{W}Tr\left[{\left({\mathbf{W}}^{T}{S}_{l}\mathbf{W}\right)}^{1}\left({\mathbf{W}}^{T}{S}_{p}\mathbf{W}\right)\right]$, then using generalized eigenvalue decomposition (GED) to obtain the transformation matrix
$\mathbf{W}$; (2) Directly optimizing the objective function through an iterative procedure, with each step presented as a trace difference problem:
$Tr\left[\left({\mathbf{W}}^{T}({\mathbf{S}}_{p}{\lambda}^{n}{\mathbf{S}}_{l})\mathbf{W}\right)\right]$. However, for the first solution, the optimization of ratio trace formulation may deviate from the original objective, which results in a closedform but inexact solution and may subsequently lead to uncertainty in subsequent classification or clustering problems. For the second solution, Wang et al. [
33] propose an efficient iterative procedure by solving the trace difference problem in each iterative step, named iterative procedure (ITR). It is proven that ITR could converge to the optimal solution and solve the trace ratio problem. With the orthogonal assumption on the projection matrix, objective function of the ITR optimization can be defined as
In [
34], Nie et al. address the graphbased feature selection framework using the iterative process of the trace ratio problem. In [
35], Zhong et al. analyze the iterative procedures for the trace ratio problem and prove necessary and sufficient conditions of the existence of the optimal solution of trace ratio problems, which are that there is a sequence
$\{{\lambda}_{1}^{\ast},{\lambda}_{2}^{\ast},\dots ,{\lambda}_{n}^{\ast}\}$ that converges to
${\lambda}^{\ast}$ as
$n\to +\infty $, where
${\lambda}^{\ast}$ is the optimal value of Equation (
2). Based on these previous works, we also formulate RFA as a trace ratio problem and theoretically prove the convergence of its optimization algorithm.
2.4. Relational Learning
In many realworld applications, data generally share some kinds of relations, such as geometrical or semantic similarity, links or citations. This relation information encoded inside data provides valuable evidence for some issues, such as classification and retrieval. To the end, relational learning is generally integrated into the representation learning models.
In [
36], Duin et al. prove that it is possible to use only proximity measure (distances or similarities) to represent the samples rather than mapping the feature vectors to the lowdimensional space. In addition, they propose a proximity descriptionbased dimensionality reduction method called relational discriminant analysis (RDA) in [
24]. Instead of data, RDA uses similarities to a subset of objects in the training data as features. In this case, dimensionality reduction can be conducted either by selection methods (such as random selection [
37], systematic selection [
38]), or by feature extraction methods (such as multidimensional scaling [
39], Sammon mapping [
40] and Niemann mapping [
41]).
In [
25], Li et al. model the covariance of data with the relationships between instances and propose a Gaussian latent variable model which successfully integrates relational information into the dimensionality reduction process, called probabilistic relational PCA (PRPCA). In PRPCA, relational information is defined by the relevance between data samples. We take the scientific paper citation as an example. If there is a quoting between the papers, it means that these papers most likely have similar topics. To take the interinfluence between cited papers into account, Li et al. further construct a matrix
$\mathsf{\Phi}={\mathsf{\Delta}}^{1}$, which satisfy the condition that similar instances often have a lower probability density at the latent space. To the end, PRPCA, based on the relational covariance
$\mathsf{\Phi}$, successfully applies the relational information to the dimensionality reduction algorithms.
Relational learning is also commonly used for data mining, information retrieval and other machine learningrelated applications. Paccanaro et al. [
42] propose a method, called linear relational embedding, for the distributed representations of data, where data consist of the relationship of concepts. Wang et al. [
43] utilize the characteristic that existing relations between items are often useful in recommendation systems and propose a model called relational collaborative topic regression (RCTR), which expand the traditional CTR model by integrating feedback information, item content information and relational information. Xuan et al. [
44] propose a nonparametric relational topic model using stochastic processes instead of fixeddimensional probability distributions.
Based on the classical works mentioned above, we propose a general and effective dimensionality reduction framework named relational Fisher analysis (RFA). This framework uses graph embedding [
14] as theoretical foundation and integrates the relational information [
24,
25] encoded inside data into the dimensionality reduction process. Besides the intrinsic graph and the penalty graph as defined using graph embedding, we further construct a relational graph based on the existing relationships between data, which enables the desired lowdimensional space to preserve the intrinsic information, reduce the penalty information and further learn and preserve the relational information among the data samples. In addition, through the derivation and equivalent transformation operations, the objective function of our proposed method can be transformed into the trace ratio form for optimization. Based on a systematic analysis of two optimization method for trace ratio problems [
33], we propose a novel iterative algorithm which uses the value of the trace ratio as criterion for the algorithmic convergence. In addition, by further introducing the ITRScore defined in [
34] into the iterative process, optimal projection directions are learned, which improves the effectiveness of the proposed RFA model.
3. Methodology
In this section, we first present some notations used in our work. The iterative steps, the optimization method and the proof of global convergence of RFA are then introduced in detail.
3.1. Notation
Matrices are represented in uppercase bold letters, for instance, $\mathbf{A}$, while vectors are represented in boldface lowercase letters, for instance, $\mathbf{a}$, and ${\mathbf{a}}_{i}$ is the ith element of $\mathbf{a}$. ${\mathbf{A}}_{i\ast}$ and ${\mathbf{A}}_{\ast j}$ denote the ith row and jth column of a matrix $\mathbf{A}$; therefore, the element of the ith row and jth column of the matrix is represented by ${\mathbf{A}}_{ij}$. The trace of $\mathbf{A}$ is defined by $\mathrm{tr}\left(\mathbf{A}\right)$ and the transpose of $\mathbf{A}$ is defined by ${\mathbf{A}}^{T}$. In addition, ${\mathbf{A}}_{ij}$ is the absolute value of ${\mathbf{A}}_{ij}$, ${\parallel \mathbf{A}\parallel}_{F}$ is the Frobenius norm of $\mathbf{A}$. If $\mathbf{A}$ is positive definite, we have $\mathbf{A}\succ 0$, while it is positive semidefinite (psd), we have $\mathbf{A}\u2ab00$.
In a learning task that contains multiple classes of data, we usually have dataset $\{\{{\mathbf{X}}_{\ast i},{\mathbf{y}}_{i}\}\in {\Re}^{D}\times {\Re}^{1},i=1,2,\dots ,N\}$, where each ${\mathbf{X}}_{\ast i}$ represents a sample and ${\mathbf{y}}_{i}\in \{1,2,\dots ,C\}$ is the class of that sample, $C\u2a7e2$ is the total number of classes and N is the total number of samples. For a linear dimensionality reduction task, we hope to find a projection matrix $\mathbf{W}$ and obtain the ddimensional representation ${\mathbf{Z}}_{\ast i}$ of ${\mathbf{X}}_{\ast i}$ by ${\mathbf{Z}}_{\ast i}=\mathbf{W}\ast {\mathbf{X}}_{\ast i}$, where ${\mathbf{Z}}_{\ast i}\in {\Re}^{d},i=1,2,\dots ,N$, $d<D$ is the dimensionality of the output.
3.2. Formulation of RFA
As discussed in
Section 2, graph embedding has already been proven to be a general framework for dimensionality reduction. However, there are two shortages of graph embedding. First, graph embedding does not obtain and preserve the relational information between data. Second, the graph embedding framework needs to be solved by generalized eigenvalue decomposition, which is only an approximate approach. Inspired by graph embedding, we propose a new dimensionality reduction framework called RFA, which integrates relationships among data into the dimensionality reduction model and can alleviate the two problems mentioned above. Formulation of the proposed RFA is described as follows.
We use
$\mathbf{R}\in {\Re}^{N\times N}$ to denote the relational matrix. The dimensionality reduction framework RFA is modeled as
where
${\mathbf{L}}_{I}$ and
${\mathbf{L}}_{P}$ define the intrinsic and penalty graphs, respectively, and
$\lambda \u2a7e0$ is a hyperparameter. Specifically, we only consider undirected graph and assume that
${\mathbf{L}}_{I}$,
${\mathbf{L}}_{P}$ and
$\mathbf{R}$ are symmetric and psd. Based on this formulation and these assumptions, the generality of RFA can be explained from the following two points:
(1) If $\lambda =0$, our algorithm can be simplified to a basic graph embedding model, so that some commonly used dimensionality reduction algorithms can be regarded as special cases of RFA;
(2) Otherwise, if $\mathcal{L}$ only contains relational information, RFA can be considered to use relational learning to reduce the dimensionality of data. For instance, the MDS algorithm is a special RFA algorithm under this condition.
3.3. Optimization of RFA
We reformulate Problem (
3) as
where
${\mathbf{S}}_{I}=\mathbf{X}{\mathbf{L}}_{I}{\mathbf{X}}^{T}$,
${\mathbf{S}}_{P}=\mathbf{X}{\mathbf{L}}_{P}{\mathbf{X}}^{T}$ and
${\mathbf{S}}_{R}=\mathbf{X}\mathbf{R}{\mathbf{X}}^{T}$.
As
$\mathbf{R}$ is psd,
${\mathbf{S}}_{R}$ is as well. We suppose
${\mathbf{S}}_{R}=\mathbf{U}\mathsf{\Lambda}{\mathbf{U}}^{T}$. We have
where
$\mathbf{W}=\mathbf{U}{\mathsf{\Lambda}}^{1/2}\mathbf{V}$,
${\tilde{\mathbf{S}}}_{I}={\mathsf{\Lambda}}^{1/2}{\mathbf{U}}^{T}{\mathbf{S}}_{I}\mathbf{U}{\mathsf{\Lambda}}^{1/2}$ and
${\tilde{\mathbf{S}}}_{P}={\mathsf{\Lambda}}^{1/2}{\mathbf{U}}^{T}{\mathbf{S}}_{P}\mathbf{U}{\mathsf{\Lambda}}^{1/2}$.
We let
$\frac{\partial \mathcal{L}}{\partial \mathbf{V}}=\mathbf{0}$. We have
Equation (
7) can be rewritten as
We let
$\eta =\frac{\mathrm{tr}\left({\mathbf{V}}^{T}{\tilde{\mathbf{S}}}_{I}\mathbf{V}\right)}{\mathrm{tr}\left({\mathbf{V}}^{T}{\tilde{\mathbf{S}}}_{P}\mathbf{V}\right)}$ and
$\tilde{\lambda}=\lambda \mathrm{tr}\left({\mathbf{V}}^{T}{\tilde{\mathbf{S}}}_{P}\mathbf{V}\right)$. We obtain
It can be seen from Equation (
9) that the columns of matrix
$\mathbf{V}$ are the eigenvectors of
${\tilde{\mathbf{S}}}_{I}\eta {\tilde{\mathbf{S}}}_{P}$, where
$\eta $ is a parameter related to
$\mathbf{V}$.
Without loss of generality, we assume
${\mathbf{V}}^{T}\mathbf{V}={\mathbf{I}}_{d}$, where
${\mathbf{I}}_{d}$ is an identity matrix. Hence, we have the following constrained trace ratio problem [
33,
45]:
Problem (
10) can be solved with the iterative method similar to that in [
33,
45]. The specific steps are as follows:
(1)
Removing the null space of ${\mathbf{S}}_{t}={\tilde{\mathbf{S}}}_{I}+{\tilde{\mathbf{S}}}_{P}$ [46]. We assume that
${\mathbf{S}}_{t}=\tilde{\mathbf{U}}\tilde{\mathsf{\Lambda}}{\tilde{\mathbf{U}}}^{T}$, where
$\tilde{\mathsf{\Lambda}}$ is a diagonal matrix and
$\tilde{\mathbf{U}}$ contains the eigenvectors of
${\mathbf{S}}_{t}$ corresponding to nonzero eigenvalues. Therefore, Formula (
10) can be transformed to
where
$\mathbf{V}=\tilde{\mathbf{U}}\tilde{\mathbf{W}},\tilde{\mathbf{W}}\in {\Re}^{r\times d}$,
${\widehat{\mathbf{S}}}_{I}={\tilde{\mathbf{U}}}^{T}{\tilde{\mathbf{S}}}_{I}\tilde{\mathbf{U}}$ and
${\widehat{\mathbf{S}}}_{P}={\tilde{\mathbf{U}}}^{T}{\tilde{\mathbf{S}}}_{P}\tilde{\mathbf{U}}$. We can further rewrite the problem (
11) as
where
${\widehat{\mathbf{S}}}_{T}={\tilde{\mathbf{U}}}^{T}({\tilde{\mathbf{S}}}_{I}+{\tilde{\mathbf{S}}}_{P})\tilde{\mathbf{U}}={\widehat{\mathbf{S}}}_{I}+{\widehat{\mathbf{S}}}_{P}$. Since
${\widehat{\mathbf{S}}}_{T}$ is positive definite, for any orthonormal matrix
$\tilde{\mathbf{W}}$, Problem (
12) satisfies that the denominator is positive.
(2)
Efficient iterative optimization. The original trace ratio problem (
12) can be rewritten as a trace difference problem:
where
$\tilde{\eta}$ is a parameter which can be calculated in the iterative process. In the iterative process, we first randomly initialize the target matrix
$\tilde{\mathbf{W}}$ to be an arbitrary orthogonal matrix as
${\tilde{\mathbf{W}}}_{0}\in {\Re}^{r\times d}$, and then calculate
${\tilde{\eta}}_{0}=\frac{\mathrm{tr}\left({\tilde{\mathbf{W}}}_{0}^{T}{\widehat{\mathbf{S}}}_{I}{\tilde{\mathbf{W}}}_{0}\right)}{\mathrm{tr}\left({\tilde{\mathbf{W}}}_{0}^{T}{\widehat{\mathbf{S}}}_{T}{\tilde{\mathbf{W}}}_{0}\right)}$. By using the calculated
${\tilde{\eta}}_{0}$, we can obtain
${\tilde{\mathbf{W}}}_{1}$ by solving Problem (
13). In the end, through several iterations, we obtain
${\tilde{\mathbf{W}}}_{T}$, where
T is the number of iterations and it’s satisfied
${\tilde{\eta}}_{T}{\tilde{\eta}}_{T1}<\u03f5$ (
$\u03f5={10}^{5}$ is used in our experiments). Then,
${\tilde{\mathbf{W}}}_{T}$ is the optimal solution of Problem (
12). In next section, we prove that RFA owns the global convergence. In order to improve the effectiveness of our method, we select some superior projection directions for each
${\tilde{\mathbf{W}}}_{t}$, as performed in [
47]. Our selection criterion is
where
$\Phi $ is a set of
$r\times d$ matrices with columns formed by eigenvectors of
${\widehat{\mathbf{S}}}_{I}{\tilde{\eta}}_{t1}{\widehat{\mathbf{S}}}_{T}$. We use the eigenvectors corresponding to
d smallest ITRscore [
48] to initialize the selection.
Algorithm 1 specifically describes the iterative procedure of Problem (
12).
Algorithm 1 Optimization of Problem (12) 
 1:
Initialization: Initialize $\tilde{\mathbf{W}}$ as an orthonormal matrix ${\tilde{\mathbf{W}}}_{0}\in {\Re}^{r\times d}$ and Let ${\tilde{\eta}}_{0}=0$.  2:
Iterations:  3:
for $t=0$ to MaxIt do  4:
(1) Compute ${\tilde{\eta}}_{t}$ as
 5:
(2) Solve the eigenvalue decomposition problem:
 6:
 7:
(4) Use ${\left\{{\widehat{\mathbf{W}}}_{\ast i}\right\}}_{i=1}^{r}$ to initialize ${\tilde{\mathbf{W}}}_{t}$ and solve the following problem:
where $\Phi $ is a set of matrices with columns formed by ${\left\{{\widehat{\mathbf{W}}}_{\ast i}\right\}}_{i=1}^{r}$.  8:
if ${\tilde{\eta}}_{t}{\tilde{\eta}}_{t1}<\u03f5$ ($\u03f5={10}^{5}$ is used in our experiments) then  9:
Break.  10:
end if  11:
end for  12:
Output:${\tilde{\mathbf{W}}}_{t}$.

5. Experiments
In this section, extensive experiments are conducted to validate the effectiveness of our RFA and KRFA methods. For the linear case, we conduct experiments on document analysis, handwritten digits recognition, face recognition and webpage classification problems. For KRFA, we test its performance on several benchmark datasets. The results of comparative experiments are presented below.
5.1. Performance of RFA
To evaluate the performance of RFA, we selected several related dimensionality reduction methods for comparison with RFA. These methods were LDA, MFA, RDA, and PRPCA, respectively. Among them, LDA, MFA and RDA are supervised methods, while PRPCA is unsupervised. We note that, due to the flexibility in the design of the relational matrix, RFA could be either global or local. We describe this detail in the following part.
We followed MFA to construct the intrinsic and penalty graphs [
14]. The numbers of nearest neighbors for constructing the intrinsic graph (
${k}_{i}$) and the penalty graph (
${k}_{p}$) were set to 5 and 20, respectively, for all the datasets.
We applied RFA to document understanding, face recognition and several other recognition tasks. In order to test the performance of RFA on document recognition tasks, we used the Ibn Sina ancient Arabic document dataset [
49], USPS handwritten digits dataset (
http://www.cs.nyu.edu/∼roweis/data.html, accessed on 19 October 2023), two handwritten digits datasets (Optdigits and Pendigits) and one English letter dataset (Letter) from the UCI machine learning repository [
50]. For face recognition tasks, we used two face datasets [
51,
52,
53,
54], the CMU PIE (
http://www.facerec.org/databases/, accessed on 19 October 2023) and the YaleB (
http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html, accessed on 19 October 2023) datasets. At the same time, we used some UCI datasets, including Shuttle, Thyroid, Vowel and Waveform21, to evaluate RFA.
In our experiments, we used the classical graph Laplacian matrix,
$\mathbf{L}=\mathbf{D}\mathbf{M}$, to define the relational matrix, whose weight matrix
$\mathbf{M}$ is shown as below:
where
${\mathcal{N}}_{k}\left({\mathbf{x}}_{i}\right)$ is the set consisting of the
k nearest neighborForof
${\mathbf{x}}_{i}$. For each dataset, the value of
k was selected based on 5fold crossvalidation. Moreover,
$\mathbf{D}$ is the diagonal degree matrix with
${\mathbf{D}}_{ii}={\sum}_{j}{\mathbf{M}}_{ij}$.
We note that the Laplacian weight matrix formed by the above formula contains the relationship information between the sample and a certain number of its neighbors, which allows the iintegration of the local relationships between data into the supervised representation learning algorithm. At the same time, as a general model, the relationship matrix $\mathbf{R}$ in RFA can be of various forms. For example, $\mathbf{R}$ can be the centralization matrix $\mathbf{H}={\mathbf{I}}_{N}\frac{1}{N}\mathbf{1}{\mathbf{1}}^{T}$, where N denotes the data size and $\mathbf{1}$ is a column vector of length N with all ones.
For the PRPCA algorithm, we performed this experiment using the codes provided by the authors. For the RDA algorithm, we randomly selected the prototypes [
24]. For algorithms other than RDA, we used the 1nearest neighbor classifier to evaluate the classification performance of them.
5.1.1. Comparison on HandWriting Datasets
We first tested the performance of RFA on the handwriting datasets. For clarity, the details of the used datasets are shown in
Table 1.
Ibn Sina dataset: This dataset [
55] is an ancient manuscript dataset, one image of which is shown in
Figure 1, and we tried to identify the Arabic subwords on this dataset. In the experiment, we used a 50page manuscript as the training set and 10 pages as the test set. We extracted the squareroot velocity (SRV) representation [
56] of the Arabic subwords. Then, we removed the outlier classes, including the classes that had less than 10 samples. Finally, we obtained a 174class Arabic subword dataset with 17,543 samples for training and 3125 samples for test.
In this experiment, we set the number of the nearest neighbors –
k of the RFA to 8 and compared RFA with the LDA and MFA dimensionality reduction algorithms. The classification accuracies after using these three dimensionality reduction methods to map the data to different dimensionalities are shown in
Figure 2. We can see that RFA is far better than the LDA algorithm and slightly better than the MFA algorithm. At the same time, when the dimensionality is from 50 to
$C1$ (
C is the number of classes), the correct rate of the MFA algorithm has a certain fluctuation and tends to decline, while the classification performance of our RFA algorithm is relatively stable, which means that our algorithm is more robust than MFA.
USPS dataset: The USPS is an U.S. post handwritten digits dataset that contains 7291 training data and 2007 test data from 10 classes and the dimensionality of the data features is 256. Some handwritten digits in the USPS dataset are shown in
Figure 3.
In this experiment, we set the number of the nearest neighbors –
k of RFA to 24. The classification accuracies obtained by RFA and the compared algorithms are shown in
Figure 4. Because the LDA subspace has a maximum of
$C1$ dimension (
C is the number of classes), LDA is only presented with a black star in the figure. We can see that when the dimension is 9, RFA obtains a comparative results with LDA and MFA. When the dimension increases, the results of RFA are always optimal. At the same time, the classification accuracy obtained by RDA is low.
Figure 5 presents 3D visualization of the learned data representations by RFA, which shows the effect of RFA to obtain better classification boundaries between the classes. In
Figure 6, we show the 2D projections of data learned by both RFA and MFA, to further show the effectiveness of RFA. It is easy to see that the samples processed by RFA are less likely to overlap at the boundary, indicating that compared to MFA, RFA preserves more properties that help distinguish the samples.
In addition, the robustness of RFA is tested with respect to
${k}_{i}$ and
${k}_{p}$ (used to construct the intrinsic and penalty graphs). From
Figure 4, it can be seen that RFA obtained the best result when the subspace dimension was 35, with parameter settings
${k}_{i}$ = 5,
${k}_{p}$ = 20 and
k = 24. We fixed
k and one of
${k}_{i}$ and
${k}_{p}$ to obtain the results when another parameter took different values.
Figure 7 and
Figure 8 show that RFA is very robust.
We also selected three document recognitionrelated datasets from the UCI machine learning repository to further test the effectiveness of RFA. They are Optdigits, Pendigits and Letter. Optdigits was preprocessed by NIST [
57] programs to obtain 5620 instances in 8 × 8 dimensions. The Pendigits dataset contains a large number of preprocessed 16dimensional samples written by 44 different authors, including 7494 training examples and 3498 test samples. Letter consists of 20,000 handwritten characters written by 20 fonts from 26 capital letters in the English alphabet. We used the 5fold crossvalidation for these experiments.
Table 2,
Table 3 and
Table 4 show the classification accuracy and standard deviation obtained on these three datasets and the boldface results are the best ones. We can see that RFA performs consistently better than other compared methods.
5.1.2. Comparison on Face Datasets
Here, we tested RFA on the face recognition problems. The PIE and YaleB datasets were used. The details of these two datasets are shown in
Table 5. For the corresponding experimental settings, we set the number of the nearest neighbors –
k on the PIE dataset to 8 and
k on the YaleB dataset to 18.
Considering that RDA cannot perform well on multiclass classification problems, we used LPP instead as a compared method in these experiments. We used the 5fold crossvalidation to decide the value of parameter
k for graph construction in LPP. We performed the experiments in different lowdimensional spaces on the PIE and YaleB datasets, and the experimental results are shown in
Table 6 and
Table 7.
We can see from the experimental results that RFA performs very well. Although LPP is an effective dimensionality reduction method for face recognition, RFA is significantly better than LPP. Moreover, RFA obtains comparable results with LDA and MFA. These results demonstrate the effectiveness of RFA in the face recognition applications.
Additionally, convergence of RFA is verified on these two datasets. As illustrated in
Figure 9 and
Figure 10, the value of
$\eta $ (trace ratio) decreases through the iterative procedures until it reaches the global optimal value
${\eta}^{\ast}$ on both of the two datasets, which clearly shows the convergence of RFA.
5.1.3. Comparison on Other UCI Datasets
To evaluate the generalization ability of RFA, we conducted experiments on UCI datasets of other fields. The details of the used datasets are shown in
Table 8. For the corresponding experimental settings of these four dataset, we set the number of the nearest neighbors –
k to 15. For the fairness of comparison, the subspace dimension of each method was set to
$C1$.
The results shown in
Table 9 demonstrate the advantage of RFA over the related approaches. It is very effective in a wide range of applications.
5.1.4. Comparison on Document Classification and Webpage Classification Problems
As a general dimensionality reduction framework, the relational matrix can be constructed with different strategies. In the previous sections, we considered the relationship between samples based on their class labels or similarity. However, in some complicated problems, relationships may presented in other forms. For example, as indicated in [
25], if there is a reference relationship between two papers, they are likely to have the same topic. However, due to the sparse nature of the bagofwords representation, the similarity between these two papers may be very low. Thus, to further testify RFA, we designed a relational matrix based on the citation relevance between data samples to model RFA, and tested its effectiveness on document classification and webpage classification problems.
For this experiment, we used two datasets, citeseer and WebKB (
https://linqsdata.soe.ucsc.edu/public/lbc/, accessed on 19 October 2023). We note that the WebKB dataset contains four subsets: Cornell, Texas, Washington and Wisconsin, and we show the experimental results of these four subsets separately. Each dataset contains bagofwords representation of documents or webpages and citation links between the instances. Citeseer contains 3312 scientific documents from 6 different classes, and there are 4732 citation relation between the documents. WebKB consists of 877 webpages from 5 different classes, and there are 1608 page links within this dataset. We adopted the same strategy as in PRPCA to construct the relational matrix:
(1) Constructing the adjacent graph $\mathbf{A}$ according to the relevance between data samples. If there was a citation or link between sample i and j, then ${\mathbf{A}}_{ij}=1$; else, ${\mathbf{A}}_{ij}=0$.
(2) Letting
${\tilde{\mathbf{D}}}_{ii}={\sum}_{j}{\mathbf{A}}_{ij}={\left(\mathbf{A}\mathbf{A}\right)}_{ij}$, then
$\mathbf{B}=\mathbf{A}\mathbf{A}\tilde{\mathbf{D}}$,
(3) Defining $\mathbf{G}=2\mathbf{A}+\mathbf{B}$ as the relational matrix in RFA.
We took PRPCA as the baseline method in this part. Experimental results are illustrated in
Figure 11; we can see that RFA achieves comparable results with PRPCA in all these five datasets and is even better than PRPCA on some of the datasets.
5.2. Performance of KRFA
To evaluate the efficiency of KRFA, we tested its performance on several benchmark datasets from the UCI machine learning repository. The details of these datasets are shown in
Table 10. For the corresponding experimental settings, we set the number of the nearest neighbors –
k of these five dataset to 15. To avoid the singular value issue, we adopted KPCA to retain 98% of the variance before formally performing KMFA and KRFA. We used Gaussian kernel in the experiment and for the fairness of the comparison, the subspace dimension of each method was set to
$C1$.
Table 11 shows the comparison results obtained by KRFA, KMFA and RFA.
As shown in
Table 11, the proposed KRFA obtained a comparable and even better result than KMFA. Furthermore, experimental results of KRFA were all better than RFA on the used datasets. That superiority can be especially reflected on the Satimage dataset. The performance of RFA on Satimage was unsatisfactory. However, KRFA conducted effective nonlinear dimensionality reduction and thus obtained good result on the following classification problem. These two points clearly demonstrate the nonlinear dimensionality reduction ability of KRFA.