1. Introduction
Due to the gradually increasing requirements on system property, product quality as well as economic benefits, modern industrial processes have become even more complicated [
1]. Hence, modern industrial processes urgently need advanced fault detection and isolation (FDI) techniques. Reviewing the existing FDI methods, there are two main subclasses: the databased ones and the modelbased ones. Specially, the databased methods have been everaccelerated recently with the continuously development of data collection and storage technologies, and received broad attention in both academy and industry domains [
2,
3,
4,
5]. Recently, lots of mature databased approaches have been found applications in FDI field with great successes, which is also the focus of this paper.
Among all the databased methods mentioned above, it is noteworthy that Fisher discriminant analysis (FDA) [
6,
7] and principal component analysis (PCA) [
8,
9] are the two most popular ones. In general, PCA and its extensions are an unsupervised feature extraction method and ignore the correlations among different faults. Hence, it is more suitable for fault detection rather than fault classification. In contrast, FDA belongs to supervised dimension reduction and feature extraction method, it selects a set of vectors that maximize the distance among classes while minimize the distance between classes simultaneously, have also becomes one of the research hotspots in classifying the detected process abnormalities nowadays [
10].
Despite the fact that FDA is a popular fault classification model, it tends to show degraded performance if the samples in a class show multimodality. In response to this problem, a method named local Fisher discriminant analysis (LFDA) was originally presented in [
11] for dimensionality reduction. LFDA aims to overcome the limitations of local preserving projection (LPP), which cannot select the most discriminative basis vectors to construct the subspace. Simultaneously, LFDA addresses the issue of unsatisfactory classification performance of traditional FDA when dealing with multimodal problems. By combining localitypreserving projection and Fisher Discriminant Analysis, LFDA enhances the discriminative power of features. It achieves this by preserving similarity within the local structure of the data and optimizing global discriminability through the Fisher criterion. LFDA is particularly suitable for handling multimodal data. Then, Yu applied this method to complex chemical process monitoring [
12] and shown high sensitivity in diagnosing multiple faults. After that, the authors of [
13] extended the LFDA to its multiway variation and performed much better in classifying faults as well as detecting abnormal operating conditions in fedbatch operation. Furthermore, the authors of [
14] developed JFDA method to describe the process dataset from both the global and local view in a highdimensional space and received satisfactory diagnosis performance in Tennessee Eastman (TE) process. Recently, Zhong et al. [
15] proposed the sparse LFDA model, which can exploit the local data information from multiple dimensions and ease the problem of multimodality and nonlinearity. Ma et al. [
16] presented the hierarchical strategy based on LFDA and canonical variable analysis (CVA) for hot strip mill process monitoring.
Nevertheless, the above LFDAbased methods typically assume that the relationships between samples have been correctly described. However, this assumption can be easily violated since the relations among the operating variables are also intricate. In such context, the variable selection strategy is necessary for feature filtering and better model interpretability, drawing great attention in both industry and academia. What is more, when the dimension of dataset is bigger than the number of samples (i.e., the smallsamplesize (SSS) problem). Then the LFDAbased methods will confront with singular problem due to the irreversibility of the withinclass scatter matrix. Given the reliability of LFDA in the field of fault diagnosis and classification, the following two problems maybe formidable challenges: One major challenge is how to implement proper improvements on LFDA to release the hypothesis that every sample of training data should be labeled accurately beforehand. Another challenge lies in the adverse effects on model classification performance that caused by singular data matrix.
To deal with the first predicament, an alternative way could be utilizing the useful information from labeled and unlabeled samples simultaneously [
17,
18], thus both the supervised information and intrinsic global structure are considered. Such methods are called the semisupervised ones and they also have been applied to fault classification field. For example, ref. [
18] proposed the semisupervised form of the FDA (SFDA), which incorporated the supererogatory unlabeled samples when conducting the fault diagnosis model, showing better fault classification over FDA and PCA. Yan et al. presented the semisupervised mixture discriminant analysis (SMDA) [
19] to monitor batch processes and both the known and unknown faults have been diagnosed correctly. Recently, researchers combined the active learning with the semisupervised EDA model and applied to process industries successfully [
20], thus improving the applicability of the traditional EDA model in real industrial processes. However, the semisupervised LFDA for fault classification has not been studied yet in existing literatures.
When encountering the likely and common SSS problem in practical scenarios, none of these approaches can be applied to that case directly since the withinclass scatter matrix in question is singular, if an appropriate technique is utilized to solve up the SSS problem, then the model could be more broadly applicable. Fortunately, several modified methods have been provided in [
21,
22] to relieve the SSS problem to some extent. Nevertheless, the intrinsic limitation of FDA has not eliminated and the SFDA model could completely solve the SSS problem neither. Afterwards, Zhang et al. [
23] proposed a novel advantageous FDA model from the matrix exponential perspective, which has settled the SSS problem thoroughly and show superior performance on various experiments. Soon after that, Adil et al. [
24] applied the EDA model to fault diagnosis in industrial process with improved classification accuracy. Note that the authors of [
25] also incorporated the exponential technique into LDE and showed better performance than the current discriminant analysis techniques in face recognition. More recently, Yu et al. [
26] developed the exponential slow feature analysis (SFA) model for adaptive monitoring, which can correctly identify various operation statuses in different simulation processes. Nevertheless, the intentions of these advanced approaches are not overcome or provide practical solutions to solve the SSS problem in LFDA.
Based on the above discussions and current research status. We tend to present the sparse variables selection based exponential local Fisher discriminant analysis, referred to as SELFDA, which has not been reported in fault diagnosis field before. The salient contributions of this paper lies in the following aspects:
The SELFDA can maximize the betweenclass separability and reserve the withinclass local structure simultaneously through the localization factor. That is means, the multimodality of operating data has been preserved from sample dimension.
The least absolute shrinkage and selection operator (LASSO) is used to select the responsible variables for SELFDA model effectively. Then the sparse discriminant optimization problem is formulated and solve by minimizationmaximization method. Thus, the data characteristics can be well exploited from the variable dimension.
Besides, the matrix exponential strategy is integrated into the framework of LFDA. As a consequence, the SELFDA method can function well when encountering the common SSS problem in despite of the dimensions of the input samples.
Although SELFDA is an LFDAbased method, it is able to jointly overcome the two limitations of conventional LFDA. Thus, SELFDA is more feasible and universal in engineering practices. To our best knowledge, this paper is also the first time to leverage the SELFDA for fault classification of realworld diesel engine.
The outline of this work is structured as below. In
Section 2, the classic LFDA model is reviewed. The motivation of this study as well as the specific description of the proposed SELFDA algorithm is presented in detail in
Section 3. Simulation experiments are conducted on a simulation process and a realworld diesel working process in
Section 4. Finally, conclusions are made in
Section 5.
2. Revisit of LFDA
Assume the ${X}_{L}=\{{x}_{1},{x}_{2},\dots ,{x}_{l}\}\in {R}^{m\times l}$ is the labeled dataset matrix, where the vector ${x}_{i}$ is from mdimensional space ${R}^{m}$. And we assume that there are ${n}_{k}$ labeled samples in the kth ($1\le k\le K$) class ${C}_{k}$.
Let
${S}_{b}$ and
${S}_{w}$ be the betweenclass scatter matrix and withinclass scatter matrix. In order to better understand LFDA, the pairwise manner of FDA [
18] is the necessary prerequisite knowledge.
where
Based on the pairwise forms of FDA, the pairwise forms of LFDA can be rewritten as below [
11]:
The weighting matrices
${W}_{i,j}^{lb}$ and
${W}_{i,j}^{lw}$ are given as
where
${A}_{i,j}$ is the
$(i,j)$th element of affinity matrix
A, being the affinity between the
ith sample and
jth sample. Then the projection vectors of LFDA are obtained by the objective function described as below.
where
${v}_{l,i}$ represents the
ith discriminant vector in LFDA. It has already been proved that the solution of the above optimization problem is given by:
where
${\lambda}_{l,i},{w}_{l,i},i=1,2,\dots ,m$ are the generalized eigenvalues and corresponding eigenvectors, respectively.
3. Methodology
3.1. Problem Statement and Motivation
Statement and Motivation 1: Generally speaking, there are always plentiful variables in real cases, especially in the complex industrial process, often resulting in inaccurate classification models. To improve the suboptimal classification performance caused by imbalanced number of normal and faulty samples, the model should explore the useful discriminant information and manifold structures from both the labeled and unlabeled samples, which is beneficial for fault classification.
However, most of the exciting LFDA models are unable to carry out variable selection in fault diagnosis area, which incorporates all process variables without emphasizing the key faulty ones. Therefore, the LASSO method is used to select the responsible variables for SELFDA model effectively. Then the sparse discriminant optimization problem is formulated and solve by feasible gradient direction method. Thus, the data characteristics can be well exploited from the variable dimension.
Statement and Motivation 2: Since the within class matrix ${S}_{lw}$ of LFDA is noninvertible when encounter the frequent SSS problem, thus the discriminant information corresponding to the eigenvalues of ${S}_{lw}$ that are equal to 0 has been ignored by LFDA, hence the LFDA cannot be applied to the SSS case. As a consequence, the application scope of fault classification methods that developed upon LFDA is largely narrowed, which has brought bottlenecks to the popularizations and implement of these methods in actual industrial process. Therefore, an efficient LFDAbased classification model able to cope with the SSS problem is badly needed.
As described in former sections, if the dimension of dataset exceeds the number of samples, then the ${S}_{lw}$ in (9) is noninvertible. Thus, the optimization problem of (8) is unsolvable and the fault cannot be accurately classified through the LFDAbased methods. Actually, the SSS problem is quite common in the complex industrial process since the faulty samples are insufficient and hard to obtain in most cases. Therefore, to alleviate the SSS problem is imperative for fault classification models. In the proposed method, the matrix exponential strategy is carried out to develop the favorable model, which can completely solve the SSS issue without reducing the data dimensionality compulsively, and more practical and robust in practical applications. What is more, it also inherits the discriminant nature from LFDA and allows for the enhanced classification performance by the distance diffusion mapping.
3.2. SELFDA
(1) Sparse Local Fisher Discriminant Analysis (SLFDA): Based on LFDA, SLFDA with LASSO sparsity makes the model more concise and interpretable, the object function (9) can be reformulated as below:
where
${\widehat{w}}_{k}$ is the
kth discriminant direction of SLFDA.
The LFDA model can realize variable selection through adding L0 penalty term, which is a NPhard problem. So the LFDA in (10) can be reformulated by adding LASSO penalty as follows:
where
$\lambda $ is the LASSO penalty factor. In general, the interpretability and discriminant performance of SLFDA model increase with the increase of
$\lambda $ within a certain range, and then decrease.
(2) Solution of SLFDA: As orthogonal constraint in (11) is difficult to satisfy directly. Aiming at this problem, a new between class scatter matrix
${\widehat{S}}_{k}^{b}$ is designed to replace
${\widehat{S}}^{b}$. So that the
kth discriminant direction can be calculated as
${\widehat{S}}_{k}^{b}={\mathsf{\Psi}}^{\mathrm{T}}{P}_{k}^{\perp}\mathsf{\Psi}$. If
$k=1$, let
${P}_{1}^{\perp}=I$, then
${\widehat{S}}_{1}^{b}={\widehat{S}}^{b}$. Or else, the
${\widehat{S}}^{b}$ can be expressed as
${\widehat{S}}^{b}={\mathsf{\Psi}}^{\mathrm{T}}\mathsf{\Psi}$ through Cholesky decomposition. Then, a new orthogonal projection matrix
${M}_{k}^{\perp}$ projects onto the space that orthogonal to
$\mathsf{\Psi}{w}_{i}$ (
$i=1,2,\dots ,k1$), which can be written as:
where
$W=[{\widehat{w}}_{1},{\widehat{w}}_{2},\dots ,{\widehat{w}}_{k1}]$,
${\left(\mathsf{\Psi}W\right)}^{+}$ is the Moore Penrose pseudoinverse of
$\mathsf{\Psi}W$.
Generally speaking, problem (11) can be worked out by Lagrangian multiplier regardless of the LASSO penalty as
And Equation (
13) can be turn into the following form by multiplying
${\widehat{w}}_{i}^{\mathrm{T}}$Then, the left part of (14) can be changed into
Set
${m}_{i}=\mathsf{\Psi}{\widehat{w}}_{i}$, so
${m}_{i}^{\mathrm{T}}{m}_{i}={\widehat{w}}_{i}^{\mathrm{T}}{\widehat{S}}^{b}{\widehat{w}}_{i}={\lambda}_{i}{\widehat{w}}_{i}^{\mathrm{T}}{\widehat{S}}^{w}{\widehat{w}}_{i}={\lambda}_{i}$, and
${m}_{i}^{\mathrm{T}}{m}_{j}=0$ when
$i\ne j$. Set
$M=\mathsf{\Psi}W$, then
$M=[{m}_{1},{m}_{2},\dots ,{m}_{k1}]$. Therefore, Equation (
15) can be changed into (16)
Since
${\left({M}^{\mathrm{T}}M\right)}^{1}$ in (16) is diagonal matrix that consist of
$\frac{1}{{\lambda}_{i}}$, so
${m}_{i}^{\mathrm{T}}M{\left({M}^{\mathrm{T}}M\right)}^{1}{M}^{\mathrm{T}}$ can be expressed as:
Then we have
${\lambda}_{k}{\widehat{w}}_{i}^{\mathrm{T}}{\widehat{S}}^{w}{\widehat{w}}_{k}=0$, that is to say
${\widehat{w}}_{i}^{\mathrm{T}}{\widehat{S}}^{w}{\widehat{w}}_{k}=0$. After that, Equation (
11) can be calculate as
since the problem in (18) is nonconvex. And the minimizationmaximization (MM) method that is a common choice for figuring out nonconvex functions. Thus, (18) is finally transformed into the following iterative optimization problem by MM algorithm:
where
$\theta $ is the parameter vector used to maximize the objective function.
${\widehat{w}}_{k}^{i}$ is the optimal solution of the last iteration. In this way, the
kth discriminant direction can be approximated by iterated operation, which can be solved by the feasible gradient direction method [
15].
Then, the regularized forms of scatter matrixes
${S}_{rlb}$ and
${S}_{rlw}$ are defined in following form:
where
${S}_{t}$ is the total scatter matrix of the whole dataset [
18],
${I}_{m}$ is an identity matrix, and
$\beta \in [0,1]$ denotes the weighting factor. Under most condition, one may choose different
$\beta $ value to increase the flexibility of the model.
(3) Derivation Procedure of SELFDA: In order to extract the discriminant information contained in the null space of withinclass scatter matrix, the matrix exponential strategy is carried out here. Analogous to the kernel methods, in SELFDA model, suppose there is a nonlinear distance diffusion mapping
$\phi $, then the scatter matrices
${S}_{rlb}$ and
${S}_{rlw}$ can be mapped into a new highdimensional space.
Specifically, taking the covariance matrix
${S}_{rlb}$ as an example, its exponential form can be calculated as below:
where
${S}_{rlb}=Q\wedge {Q}^{\mathrm{T}}$,
Q is an orthogonal matrix and ∧ is a diagonal matrix.
Similar to the LFDA introduced in
Section 2, the projection directions of SELFDA can be obtained by solving the exponential of
${S}_{rlb}$ and
${S}_{rlw}$ from the following optimization problem:
And the matrices
${S}_{rlb}$ and
${S}_{rlw}$ should be normalized beforehand to prevent the appearance of large values of the numbers originating from
$exp\left({S}_{rlb}\right)$ and
$exp\left({S}_{rlw}\right)$. Since the
$exp\left({S}_{rlb}\right)$ and
$exp\left({S}_{rlw}\right)$ are fullrank matrices according to the Theorem 3 in [
23]. Thus, the SSS problem of LFDA is solved because there is no need for (25) to consider the singularity of withinclass scatter matrix.
Similarly, the solution of (25) is acquired through the following Lagrangian multiplier method:
where
${\lambda}_{ce,i}$ and
${w}_{ce,i}$ are the eigenvalues and the corresponding eigenvectors, respectively. The first
d eigenvectors
${W}_{d}=[{w}_{ce,1},\dots ,{w}_{ce,d}]$ are used to span the subspace.
3.3. Discriminant Power of SELFDA
It is noted that SELFDA is equivalent to transforming the original data into a new space by exponential mapping, after that, the LFDA model with comprehensive sample information is carried out in such a new space, which might shown some analogous characteristics of kernel mapping. The only difference between them is that the latter maps the feature vectors while the SELFDA maps the scatter matrices. After that, the SELFDA may show enhanced performance over LFDA when involving nonlinear circumstance.
The LFDA tends to search the optimal discriminate direction, which can minimize the withinclass distance and maximize the betweenclass distance simultaneously. In mathematics, the
$trace$ of the scatter matrices can be given:
Since the eigenvalues of $trace\left({S}_{rlb}\right)$ are often used to describe the separation between classes, while the eigenvalues of $trace\left({S}_{rlw}\right)$ are often used to describe the closeness of the samples within classes. Hence, the discriminant vector that corresponds to the bigger ratio of ${\lambda}_{rlb,i}/{\lambda}_{rlw,i}$ owns stronger discriminant power.
And the eigenvalue of the exponential matrix can be obtained by the following equation:
where
${w}_{rlb}$ is the eigenvector of
${S}_{rlb}$, and
${\lambda}_{rlb}$ is its corresponding eigenvalue.
Therefore, the
$trace$ of the SELFDA can be calculated as following:
It is noticeable that for most of the eigenvalues in (29) and (30), we have the inequality ${\lambda}_{rlb,i}>{\lambda}_{rlw,i}$ and ${e}^{{\lambda}_{rlb,i}}>{e}^{{\lambda}_{rlw,i}}$.
Since
${\lambda}_{rlb,i}>{\lambda}_{rlw,i}>0$, then
${\lambda}_{rlb,i}{\lambda}_{rlw,i}>0$ and consequently since
$\forall a>0,{e}^{a}>1+a$, we have
Also we have
$1+{\lambda}_{rlb,i}{\lambda}_{rlw,i}>1+\frac{{\lambda}_{rlb,i}{\lambda}_{rlw,i}}{{\lambda}_{rlw,i}}$, since
${\lambda}_{rlw,i}>1$ for large eigenvalues. We obtain the following equation through transitivity
From the above analysis, we know that diffusion scale to between class distance is bigger than that of within class. That is means, SELFDA can enlarge the margin between different categories compared with LFDA, which is desirable for fault classification.
In order to better understand the working process of SELFDA algorithm, the pseudocodes are given in Algorithm 1.
Algorithm 1 SELFDA 
Input: Training data $X\in {R}^{m\times n}$ Output: The data matrix projection
 1:
Establish the of SELFDA model according to (10)–(20)  2:
for $i,j=1$ to n, $\beta \in [0,1]$ do  3:
${S}_{lb}\leftarrow \frac{1}{2}{\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}}{W}_{i,j}^{lb}({x}_{i}{x}_{j}){({x}_{i}{x}_{j})}^{\mathrm{T}}$  4:
${S}_{lw}\leftarrow \frac{1}{2}{\displaystyle \sum _{i=1}^{n}}{\displaystyle \sum _{j=1}^{n}}{W}_{i,j}^{lw}({x}_{i}{x}_{j}){({x}_{i}{x}_{j})}^{\mathrm{T}}$  5:
${S}_{rlb}\leftarrow (1\beta ){\widehat{S}}_{k}^{b}+\beta {S}_{t}$  6:
${S}_{rlw}\leftarrow (1\beta ){\widehat{S}}^{w}+\beta {I}_{m}$  7:
Implement the matrix exponential on ${S}_{rlb}$ and ${S}_{rlw}$ to construct the optimization problem of (25)  8:
$\{{\lambda}_{ce,i},{w}_{ce,i}\}\leftarrow $ Solve the optimization problem in (25) by $exp\left({S}_{rlb}\right){w}_{ce,i}={\lambda}_{ce,i}exp\left({S}_{rlw}\right){w}_{ce,i}$  9:
Rank the eigenvectors ${w}_{ce,i}$ according to the eigenvalues ${\lambda}_{ce,i}$ in descending order  10:
end for  11:
for $d<n$
do  12:
Choose the first d eigenvectors associated with the first d eigenvalues defined by ${W}_{d}=[{w}_{ce,1},\dots ,{w}_{ce,d}]$  13:
end for  14:
The projection of X into the discriminant subspace is given by ${W}_{d}^{\mathrm{T}}X$

3.4. SELFDABased Fault Diagnosis Scheme
After introducing the theoretical framework of SELFDA, the Bayesian inference is borrowed to realize fault classification. Suppose all the samples follow the Gaussian distributions and thereby the priori probability of each category is
$P\left(x\in {C}_{k}\right)=\frac{1}{K}$. Hence, the conditional probability density function (PDF) of the new testing sample
x is given as:
where
${\overline{x}}_{k}$ and
${S}_{k}$ are the mean vector and withinclass scatter matrix of the labeled training dataset in
${C}_{k}$.
${W}_{d}$ is a matrix used for dimensionality reduction of the data. In the light of Bayes rule, the posterior probability of the
x belonging to
ith fault category is expressed as:
As a result, the testing samples are classified into the certain type by the classification criterion defined in (36).
To simplify the discriminant task in practice, a discriminant function can be redefined as follows
In conclusion, the flowchart of the favorable SELFDA method is briefly showed in
Figure 1.