Next Article in Journal
Accumulation of Particles and Formation of a Dissipative Structure in a Nonequilibrium Bath
Next Article in Special Issue
Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems
Previous Article in Journal
Simulation of Cardiac Flow under the Septal Defect Based on Lattice Boltzmann Method
Previous Article in Special Issue
Error Exponents and α-Mutual Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Discriminant Analysis under f-Divergence Measures

Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(2), 188; https://doi.org/10.3390/e24020188
Submission received: 18 November 2021 / Revised: 24 January 2022 / Accepted: 25 January 2022 / Published: 27 January 2022

Abstract

:
In statistical inference, the information-theoretic performance limits can often be expressed in terms of a statistical divergence between the underlying statistical models (e.g., in binary hypothesis testing, the error probability is related to the total variation distance between the statistical models). As the data dimension grows, computing the statistics involved in decision-making and the attendant performance limits (divergence measures) face complexity and stability challenges. Dimensionality reduction addresses these challenges at the expense of compromising the performance (the divergence reduces by the data-processing inequality). This paper considers linear dimensionality reduction such that the divergence between the models is maximally preserved. Specifically, this paper focuses on Gaussian models where we investigate discriminant analysis under five f-divergence measures (Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, total variation, and χ 2 ). We characterize the optimal design of the linear transformation of the data onto a lower-dimensional subspace for zero-mean Gaussian models and employ numerical algorithms to find the design for general Gaussian models with non-zero means. There are two key observations for zero-mean Gaussian models. First, projections are not necessarily along the largest modes of the covariance matrix of the data, and, in some situations, they can even be along the smallest modes. Secondly, under specific regimes, the optimal design of subspace projection is identical under all the f-divergence measures considered, rendering a degree of universality to the design, independent of the inference problem of interest.

1. Introduction

1.1. Motivation

Consider a simple binary hypothesis testing problem in which we observe an n-dimensional sample X and aim to discern the underlying model according to:
H 0 :   X P vs . H 1 :   X Q .
The optimal decision rule (in the Neyman-Pearson sense) involves computing the likelihood ratio d P d Q ( X ) and the performance limit (sum of type I and type II errors) is related to the total variation distance between P and Q . We emphasize that our focus is on the settings in which the n elements of X are not statistically independent, in which case the likelihood ratio d P d Q ( X ) cannot be decomposed into the product of the coordinate-level likelihood ratios. One of the key practical obstacles to solve such problems pertains to the computational cost of finding and performing the statistical tests. This renders a gap between the performance that is information-theoretically viable (unbounded complexity) versus a performance possible under bounded computational complexity [1,2].
Dimensionality reduction techniques have become an integral part of statistical analysis in high dimensions [3,4,5,6]. In particular, linear dimensionality reduction methods have been developed and used for over a century for various reasons, such as their low computational complexity and simple geometric interpretation, as well as for a multitude of applications, such as data compression, storage, and visualization, to name only a few. These methods linearly map the high-dimensional data to lower dimensions while ensuring that the desired features of the data are preserved. There exist two broad sets of approaches to linear dimensionality reduction in one dataset X, which we review next.

1.2. Related Literature

(1)Feature extraction: In one set of approaches, the objective is to select and extract informative and non-redundant features in the dataset X. These approaches are generally unsupervised. These widely-used approaches are principal component analysis (PCA), and its variations [7,8,9], multidimensional scaling (MDS) [10,11,12,13], and sufficient dimensionality reduction (SDR) [14]. The objective of PCA is to retain as much variation in the data in a lower dimension by minimizing the reconstruction error. In contrast, MDS aims to maximize the scatter of the projection and maximizes an aggregate scatter metric. Finally, the objective of SDR is to design an orthogonal mapping of the data that makes the data X and the responses conditionally independent (given the projected data). There exist extensive variations to the three approaches, and we refer the reader to Reference [6] for more discussions.
(2)Class separation: In another set of approaches, the objective is to perform classification in the lower dimensional space. These approaches are supervised. Depending on the problem formulation and the underlying assumptions, the resulting decision boundaries between the models can be linear or non-linear. One approach pertinent to this paper’s scope is discriminant analysis (DA), that leverages the distinction between given models and designs a mapping such that its lower-dimensional output exhibits maximum separation across different models [15,16,17,18,19,20]. In general, this approach generates two matrices: within-class and between-class scatter matrices. The within-class scatter matrix shows the scatter of the samples around their respective class means, whereas, in contrast, the between-class scatter matrix captures the scatter of the samples around the mixture mean of all the models. Subsequently, a univariate function of these matrices is formed such that it increases when the between-class scatter becomes larger, or when the within-class scatter becomes smaller. Examples of such a function of between-class and within-class matrices is a classification index that includes the ratio of their determinants, difference of their determinants, and ratio of their traces [17]. These approaches focus on reducing the dimension to one and maximize separability between the two classes. There exist, however, studies that consider reducing to dimensions higher than one and separation across more than two classes. Finally, depending on the structure of the class-conditional densities, the resulting shape of the decision boundaries give rise to linear and quadratic DA.
The f-divergences between a pair of probability measures quantifies the similarity between them. Shannon [21] introduced the mutual information as a divergence measure, which was later studied comprehensively by Kullback and Leibler [22] and Kolmogorov [23], establishing the importance of such measures in information theory, probability theory, and related disciplines. The family of f-divergences, independently introduced by Csiszár [24], Ali and Silvey [25], and Morimoto [26], generalize the Kullback–Leibler divergence which enable characterizing the information-theoretic performance limits of a wide range of inference, learning, source coding, and channel coding problems. For instance, References [27,28,29,30] consider their application to various statistical decision-making problems [31,32,33,34]. More recent developments on the properties of f-divergence measures can be found in References [31,35,36,37].

1.3. Contributions

The contribution of this paper has two main distinctions from the existing literature on DA. First, DA generally focuses on the classification problem for determining the underlying model of the data. Secondly, motivated by the complexities of finding the optimal decision rules for classification (e.g., density estimation), the existing criteria used for separation are selected heuristically. In this paper, we study this problem by referring to the family of f-divergences as measures of the distinction between a pair of probability distributions. Such a choice has three main features: (i) it enables designing linear mappings for a wider range of inference problems (beyond classification); (ii) it provides the designs that are optimal for the inference problem at hand; and (iii) it enables characterizing the information-theoretic performance limits after linear mapping. Our analyses are focused on Gaussian models. Even though we observe that the design of the linear mapping has differences under different f-divergence measures, we have two main observations in the case of zero-mean Gaussian models: (i) the optimal design of the linear mapping is not necessarily along the most dominant components of the data matrix; and (ii) in certain regimes, irrespective of the choice of the f-divergence measure, the design of the linear map that retains the maximal divergence between the two models is robust. In such cases, this makes the optimal design of the linear map independent of the inference problem at hand rendering a degree of universality (in the considered space of the Gaussian probability measures).
The remainder of the paper is organized as follows. Section 2 provides the linear dimensionality reduction model, and it provides an overview of the f-divergence measures considered in this paper. Section 3 formulates the problem, and it helps to facilitate the mathematical analysis in subsequent sections. In Section 4, we provide a motivating operational interpretation for each f-divergence measure and then characterize an optimal design of the linear mapping for zero-mean Gaussian models. Section 5 considers numerical simulations for inference problems associated with the f-divergence measure of interest for zero-mean Gaussian models. Section 6 generalizes the theory to non-zero mean Gaussian models and discusses numerical algorithms that help characterize the design of the linear map, and Section 7 concludes the paper. A list of abbreviations used in this paper is provided on page 22.

2. Preliminaries

Consider a pair of n-dimensional Gaussian models:
P :     N ( μ P , Σ P ) ,   and   Q :     N ( μ Q , Σ Q ) ,
where μ P , μ Q and Σ P , Σ Q are two distinct mean vectors and covariance matrices, respectively, and  P and Q denote their associated probability measures. The nature selects one model and generates a random variable X R n . We perform linear dimensionality reduction on X via matrix A R r × n , where r < n , rendering
Y   = A · X .
After linear mapping, the two possible distributions of Y induced by matrix A are denoted by P A and Q A , where
P A   : N ( A · μ P , A · Σ P · A ) Q A : N ( A · μ Q , A · Σ Q · A ) .
Motivated by inference problems that we discuss in Section 3, our objective is to design the linear mapping parameterized by matrix A that ensures that the two possible distributions of Y, i.e.,  P A and Q A , are maximally distinguishable. That is, to design A as a function of the statistical models (i.e., μ P , μ Q , Σ P and Σ Q ) such that relevant notions of f-divergences between P A and Q A are maximized. We use a number of f-divergence measures for capturing the distinction between P A and Q A , each with a distinct operational meaning under specific inference problems. For this purpose, we denote the f-divergence of Q A from P A by D f ( A ) , where
D f ( A )   = E P A f d Q A d P A .
We use the shorthand D f ( A ) for the canonical notation D f ( Q A     P A ) for emphasizing the dependence on A and for the simplicity in notations. E P A denotes the expectation with respect to P A , and f : ( 0 , + ) R is a convex function that is strictly convex at 1 and f ( 1 ) = 0 . Strict convexity at 1 ensures that the f-divergence between a pair of probability measures is zero if and only if the probability measures are identical. Given the linear dimensionality reduction model in (3), the objective is to solve
P :   max A R r × n   D f ( A ) ,
for the following choices of the f-divergence measures.
  • Kullback–Leibler (KL) divergence for f ( t ) = t log t :
    D KL ( A )   = E Q A log d Q A d P A .
    We also denote the KL divergence from P A to Q A by D KL ( P A     Q A ) .
  • Symmetric KL divergence for f ( t ) = ( t 1 ) log t :
    D SKL ( A )   = D KL ( Q A     P A ) + D KL ( P A     Q A ) .
  • Squared Hellinger distance for f ( t ) = ( 1 t ) 2 :
    H 2 ( A )   = R r d Q A d P A 2 .
  • Total variation distance for f ( t ) = 1 2 · | t 1 | :
    d TV ( A )   = 1 2 R r d Q A d P A .
  • χ 2 -divergence for f ( t ) = ( t 1 ) 2 :
    χ 2 ( A )   = R r ( d Q A d P A ) 2 d P A .
    We also denote the χ 2 -divergence from P A to Q A by χ 2 ( P A     Q A ) .

3. Problem Formulation

In this section, without loss of generality, we focus on the setting where one of the covariance matrices is the identity matrix, and the other one has a covariance matrix Σ in order to avoid complex representations. One key observation is that the design of A under different measures has strong similarities. We first note that, by defining A ¯   = A · Σ P 1 / 2 , μ   = Σ P 1 / 2 · ( μ Q μ P ) , and Σ   = Σ P 1 / 2 · Σ Q · Σ P 1 / 2 , designing A for maximally distinguishing
N ( A · μ P , A · Σ P · A )   and   N ( A · μ Q , A · Σ Q · A )
is equivalent to designing A ¯ for maximally distinguishing
N ( 0 , A ¯ · A ¯ )   and   N ( A ¯ · μ , A ¯ · Σ · A ¯ ) .
Hence, without loss of generality, we focus on the setting where μ P = 0 , Σ P = I n , and  Σ Q = Σ . Next, we show that determining an optimal design for A can be confined to the class of semi-orthogonal matrices.
Theorem 1.
For every A , there exists a semi-orthogonal matrix A ¯ such that D f ( A ¯ ) = D f ( A ) .
Proof. 
See Appendix A.    □
This observation indicates that we can reduce the unconstrained problem in (6) to the following constrained problem:
Q :   max A R r × n   D f ( A )   s . t .   A · A = I r .
We show that the design of A in the case of μ = 0 , under the considered f-divergence measures, directly relates to analyzing the eigenspace of matrix Σ . For this purpose, we denote the non-negative eigenvalues of Σ ordered in the descending order by { λ i : i [ n ] } , where for an integer m we have defined [ m ] = { 1 , , m } . For an arbitrary permutation function π : [ n ] [ n ] , we denote the permutation of { λ i : i [ n ] } with respect to π by { λ π ( i ) : i [ n ] } . We also denote the eigenvalues of A · Σ · A ordered in the descending order by { γ i : i [ r ] } . Throughout the analysis, we frequently use Poincaré separation theorem [38] for finding the row space of matrix A with respect to the eigenvalues of Σ .
Theorem 2
(Poincaré Separation Theorem). Let Σ be a real symmetric n × n matrix and A be a semi-orthogonal r × n matrix. The eigenvalues of Σ denoted by { λ i : i [ n ] } (sorted in the descending order) and the eigenvalues of A · Σ · A denoted by { γ i : i [ r ] } (sorted in the descending order) satisfy
λ n ( r i ) γ i λ i ,   i [ r ] .
Finally, we define the following functions, which we will refer to frequently throughout the paper:
h 1 ( A )   = A · Σ · A ,
h 2 ( A )   = μ · A · A · μ ,  
h 3 ( A )   = μ · A · [ h 1 ( A ) ] 1 · A · μ .
In the next sections, we analyze the design of A under different f-divergence measures. In particular, in Section 4 and Section 5, we focus on zero-mean Gaussian models for P and Q where we provide an operational interpretation of the measure in the dichotomous mode in (4). Subsequently, we will discuss the generalization to non-zero mean Gaussian models in Section 6.

4. Main Results for Zero-Mean Gaussian Models

In this section, we analyze problem Q defined in (14) for each of the f-divergence measures separately. Specifically, for each case, we briefly provide an inference problem as a motivating example, in the context of which we relate the optimal performance limit of that inference problem to the f-divergence of interest. These analyses are provided in Section 4.1, Section 4.2, Section 4.3, Section 4.4 and Section 4.5. Subsequently, we provide the main results on the optimal design of the linear mapping matrix A in Section 4.6.

4.1. Kullback Leibler Divergence

4.1.1. Motivation

The KL divergence, being the expected value of the log-likelihood ratio, captures, at least partially, the performance of a wide range of inference problems. One specific problem whose performance is completely captured by D KL ( A ) is the quickest change-point detection. Consider an observation process (time-series) { X t : t N } in which the observations X t R n are generated by a distribution with probability measure P specified in (2). This distribution changes to Q at an unknown (random or deterministic) time κ , i.e.,
X t P t < κ ,     and     X t Q t κ .
Change-point detection algorithms sample the observation process sequentially and aim to detect the change point with the minimal delay after it occurs subject to a false alarm constraint. Hence, the two key figures of merit capturing the performance of a sequential change-point detection algorithm are the average detection delay ( ADD ) and the rate of false alarms. Whether the change-point κ is random or deterministic gives rise to two broad classes of quickest change-point detection problems, namely the Bayesian setting ( κ is random) and minimax setting ( κ is deterministic). Irrespective of their discrepancies in settings and the nature of performance guarantees, the  ADD for the (asymptotically) optimal algorithms are in the form [39]:
ADD c 1 D KL ( Q     P ) .
Hence, after the linear mapping induced by matrix A , for the ADD , we have
ADD c 2 D KL ( Q A     P A ) ,
where c 1 and c 2 are constants specified by the false alarm constraints. Clearly, the design of A that minimizes the ADD will be maximizing the disparity between the pre- and post-change distributions P A and Q A , respectively.

4.1.2. Connection between D KL and A

By noting that A is a semi-orthogonal matrix and recalling that the eigenvalues of h 1 ( A ) are denoted by { γ i : i [ r ] } , simple algebraic manipulations simplify D KL ( Q A     P A )  to:
D KL ( Q A     P A ) = 1 2 log 1 | h 1 ( A ) | r + Tr h 1 ( A ) + h 2 ( A ) .
By setting, and leveraging, Theorem 2, the problem of finding an optimal design for A that solves (14) can be found as the solution to:
max { γ i   :   i [ r ] } i = 1 r g KL ( γ i )   s . t .   λ n ( r i ) γ i λ i     i [ r ] ,
where we have defined
g KL ( x )   = 1 2 ( x log x 1 ) .
Likewise, finding the optimal design for A that optimizes D KL ( P A     Q A ) when μ = 0 can be found by replacing g KL ( γ i ) by g KL 1 γ i in (23). In either case, the optimal design of A is constructed by choosing r eigenvectors of Σ as the rows of A . The results and observations are formalized in Section 4.6.

4.2. Symmetric KL Divergence

4.2.1. Motivation

The KL divergence discussed in Section 4.1 is an asymmetric measure of separation between two probability measures. It is symmetrized by adding two directed divergence measures in reverse directions. The symmetric KL divergence has applications in model selection problems in which the model selection criteria is based on a measure of disparity between the true model and the approximating models. As shown in Reference [40], using the symmetric KL divergence outperforms the individual directed KL divergences since it better reflects the risks associated with underfitting and overfitting of the models, respectively.

4.2.2. Connection between D SKL and A

For a given A , the symmetric KL divergence of interest specified in (8) is given by
D SKL ( A ) = 1 2 · Tr [ h 1 ( A ) ] 1 + h 1 ( A ) + h 2 ( A ) + h 3 ( A ) r .
By setting μ = 0 , and leveraging Theorem 2, the problem of finding an optimal design for A that solves (14) can be found as the solution to:
max { γ i   :   i [ r ] } i = 1 r g SKL ( γ i )   s . t .   λ n ( r i ) γ i λ i     i [ r ] ,
where we have defined
g SKL ( x )   = 1 2 x + 1 x 2 .

4.3. Squared Hellinger Distance

4.3.1. Motivation

Squared Hellinger distance facilitates analysis in high dimensions, especially when other measures fail to take closed-form expressions. We will discuss an important instance of this in the next subsection in the analysis of d TV . Squared Hellinger distance is symmetric, and it is confined in the range [ 0 , 2 ] .

4.3.2. Connection between H 2 and A

For a given matrix A , we have the following closed-form expression:
H 2 ( A ) =     2 2 | 4 · h 1 ( A ) | 1 4 | h 1 ( A ) + I r | 1 2 · exp μ · A · h 1 ( A ) + I r 1 · A · μ 4 .
By setting μ = 0 , and leveraging Theorem 2, the problem of finding an optimal design for A that solves (14) can be found as the solution to:
max { γ i   :   i [ r ] } i = 1 r g H ( γ i )   s . t .   λ n ( r i ) γ i λ i     i [ r ] ,
where we have defined
g H ( x )   = ( x + 1 ) 2 x .

4.4. Total Variation Distance

4.4.1. Motivation

The total variation distance appears as the key performance metric in binary hypothesis testing and in high-dimensional inference, e.g., Le Cam’s method for the binary quantization and testing of the individual dimensions (which is in essence binary hypothesis testing). In particular, for the simple binary hypothesis testing model in (65), the minimum total probability of error (sum of type-I and type-II error probabilities) is related to the total variation d TV ( A ) . Specifically, for a decision rule d : X { H 0 , H 1 } , the following holds:
inf d   [ P A ( d = H 1 ) + Q A ( d = H 0 ) ] = 1 d TV ( A ) .
The total variation between two Gaussian distributions does not have a closed-form expression. Hence, unlike the other settings, an optimal solution to (6) in this context cannot be obtained analytically. Alternatively, in order to gain intuition into the structure of a near optimal matrix A , we design A such that it optimizes known bounds on d TV ( A ) . In particular, we use two sets of bounds on d TV ( A ) . One set is due to bounding it via the Hellinger distance, and another set is due to a recent study that established upper and lower bounds that are identical up to a constant factor [41].

4.4.2. Connection between d TV and A

(1) Bounding by Hellinger Distance: The total variation distance can be bounded by the Hellinger distance according to
1 2 H 2 ( A ) d TV ( A ) H ( A ) 1 H 2 ( A ) 4 .
It can be readily verified that these bounds are monotonically increasing with H 2 ( A ) in the interval [ 0 , 2 ] . Hence, they are maximized simultaneously by maximizing the squared Hellinger distance as discussed in Section 4.3. We refer to this bound as the Hellinger bound.
(2) Matching Bounds up to a Constant: The second set of bounds that we used are provided in Reference [41]. These bounds relate the total variation between two Gaussian models to the Frobenius norm (FB) of a matrix related to their covariance matrices. Specifically, these FB-based bounds on the total variation d TV ( A ) are given by
1 100 d TV ( A ) min { 1 , i = 1 r g TV ( γ i ) } 3 2 ,
where we have defined
g TV ( x )   = 1 x 1 2 .
Since the lower and upper bounds on d TV ( A ) are identical up to a constant, they will be maximized by the same design of A .

4.5. χ 2 -Divergence

4.5.1. Motivation

χ 2 -divergence appears in a wide range of statistical estimation problems for the purpose of finding a lower bound on the estimation noise variance. For instance, consider the canonical problem of estimating a latent variable θ from the observed data X, and denote two candidate estimates by p ( X ) and q ( X ) . Define P and Q as the probability measures of p ( X ) and q ( X ) , respectively. According to the Hammersly-Chapman-Robbins (HCR) bound on the quadratic loss function, for any estimator θ ^ , we have
var θ ( θ ^ ) sup p q E Q [ q ( X ) ] E P [ p ( X ) ] 2 χ 2 ( Q     P ) ,
which, for unbiased estimators p and q, simplifies to the Cramér-Rao lower bound
var θ ( θ ^ ) sup p q ( q p ) 2 χ 2 ( Q     P ) ,
depending on P and Q through their χ 2 -divergence. Besides the applications to estimation problems, χ 2 is easier to compute compared to some of other f-divergence measures (e.g., total variation). Specifically, for product distributions χ 2 tensorizes to be expressed in terms of the one-dimensional components that are easier to compute than the KL divergence and TV variation distance. Hence, a combination of bounding other measures with χ 2 and then analyzing χ 2 appears in a wide range of inference problems.

4.5.2. Connection between χ 2 and A

By setting μ = 0 , for a given matrix A , from (11), we have the following closed-form expression:
χ 2 ( A ) = 1 | h 1 ( A ) | | 2 ( h 1 ( A ) ) 1 I r | 1
  = i = 1 r g χ 1 ( γ i ) 1 ,  
where we have defined
g χ 1 ( x )   = 1 x ( 2 x ) .
As we show in Appendix C, for  χ 2 ( A ) to exist (i.e., be finite), all the eigenvalues { λ i : i [ r ] } should fall in the interval ( 0 , 2 ) . Subsequently, finding the optimal design for A that optimizes χ 2 ( P A     Q A ) when μ = 0 can be done by replacing g χ 1 in (38) by g χ 2 , which is given by
g χ 2 ( x )   = x 2 2 x 1 .
Based on this, and by following a similar line of argument as in the case of the KL divergence, designing an optimal A reduces to identifying a subset of the eigenvalues of Σ and assigning their associated eigenvectors as the rows of matrix A . These observations are formalized in Section 4.6.

4.6. Main Results

In this section, we provide analytical closed-form solutions to design optimal matrices A for the following f-divergence measures: D KL , D SKL , H 2 , and  χ 2 . The total variation measure d TV does not admit a closed-form for Gaussian models. In this case, we provide a design for A that optimizes the bound we have provided for d TV in Section 4.4. Due to their structural similarities of the results, we group and treat D KL , D SKL , and  d TV in Theorem 3. Similarly, we group and treat H 2 and χ 2 in Theorem 4.
Theorem 3
( D KL , D SKL , d TV ). For a given function g : R R , define the permutations:
π   = arg max π i = 1 r g ( λ π ( i ) ) .
Then, for D f ( A ) { D KL ( A ) , D SKL ( A ) , d TV ( A ) } and functions g f { g KL , g SKL , g TV } :
  • For maximizing D f , set g = g f and select the eigenvalues of A Σ A as
    γ i = λ π ( i ) ,   f o r   i [ r ] .
  • Row i [ r ] of matrix A is the eigenvector ofΣassociated with the eigenvalue γ i .
Proof. 
See Appendix B.    □
By further leveraging the structures of functions g KL , g SKL , and g TV , we can simplify approaches for designing the matrix A . Specifically, note that the functions g KL , g SKL , a n d   g TV are all strictly convex functions taking their global minima at x = 1 . Based on this, we have the following observations.
Corollary 1
( D KL , D SKL , d TV ). For maximizing D f ( A ) { D KL ( A ) , D SKL ( A ) , d TV ( A ) } , when λ n 1 , we have γ i = λ i for all i [ r ] , and the rows of A are eigenvectors of Σ associated with its r largest eigenvalues, i.e.,  { λ i : i [ r ] } .
Corollary 2
( D KL , D SKL , d TV ). For maximizing D f ( A ) { D KL ( A ) , D SKL ( A ) , d TV ( A ) } , when λ 1 1 , we have γ i = λ n r + i for all i [ r ] , and the rows of A are eigenvectors of Σ associated with its r smallest eigenvalues, i.e.,  { λ i : i { n r + 1 , , n } } .
Remark 1.
In order to maximize D f ( A ) { D KL ( A ) , D SKL ( A ) , d TV ( A ) } when λ n 1 λ 1 , finding the best permutation of eigenvalues involves sorting all the n eigenvalues λ i ’s and subsequently performing r comparisons as illustrated in Algorithm 1. This amounts to O ( n · log ( n ) ) time complexity instead of O ( n · log ( r ) ) time complexity involved in determining the design for A in the case of Corollaries 1 and 2, which require finding the r extreme eigenvalues in determining the design for π .
Remark 2.
The optimal design of A often does not involve being aligned with the largest eigenvalues of the covariance matrixΣ, which is in contrast to some of the key approaches to linear dimensionality reduction that generally perform linear mapping along the eigenvectors associated with the largest eigenvalues of the covariance matrix. When the eigenvalues ofΣare all smaller than 1, in particular, A will be designed by choosing eigenvectors associated with the smallest eigenvalues ofΣin order to preserve largest separability.
Next, we provide the counterpart results for the H 2 and χ 2 -divergence measures. Their major distinction from the previous three measures is that, for these two, D f ( A ) can be decomposed into a product of individual functions of the eigenvalues { γ i : i [ r ] } . Next, we provide the counterparts of Theorem 3 and Corollaries 1 and 2 for H 2 and χ 2 .
Theorem 4
( H 2 , χ 2 ). For a given function g : R R , define the permutations:
π   = arg max π i = 1 r g ( λ π ( i ) ) .
Then, for D f ( A ) { H 2 ( A ) , χ 2 ( A ) , χ 2 ( P A     Q A ) } and functions g f { g H , g χ 1 , g χ 2 } :
  • For maximizing D f , set g = g f and select the eigenvalues of A Σ A as
    γ i = λ π ( i ) ,   f o r   i [ r ] .
  • Row i [ r ] of matrix A is the eigenvector ofΣassociated with the eigenvalue γ i .
Proof. 
See Appendix C.    □
Next, note that g H is a strictly convex function taking its global minimum at x = 1 . Furthermore, g χ i for i [ 2 ] are strictly convex over ( 0 , 2 ) and take their global minimum at x = 1 .
Corollary 3
( H 2 , χ 2 ). For maximizing D f ( A ) { H 2 ( A ) , χ 2 ( A ) , χ 2 ( P A     Q A ) } , when λ n 1 , we have γ i = λ i for all i [ r ] , and the rows of A are eigenvectors of Σ associated with its r largest eigenvalues, i.e.,  { λ i : i [ r ] } .
Corollary 4
( H 2 , χ 2 ). For maximizing D f ( A ) { H 2 ( A ) , χ 2 ( A ) , χ 2 ( P A     Q A ) } , when λ 1 1 , we have γ i = λ n r + i for all i [ r ] , and the rows of A are eigenvectors of Σ associated with its r smallest eigenvalues, i.e.,  { λ i : i { n r + 1 , , n } } .
Algorithm 1: Optimal Permutation π When λ n 1 λ 1
1:
Initialize i n , j 1 , p k λ k   k { i , j } , π
2:
Sort the eigenvalues of Σ in descending order { λ k : k [ n ] }
3:
while | π | r do
4:
  if  g f ( p i ) > g f ( p j )  then
5:
     π π { p i }
6:
     i i 1
7:
  else
8:
     π π { p j }
9:
     j j + 1
10:
  end if
11:
end while
12:
return π
Finally, we remark that, unlike the other measures, total variation does not admit a closed-form, and we used two sets of tractable bounds to analyze this case of total variations. By comparing the design of A based on different bounds, we have the following observation.
Remark 3.
We note that both sets of bounds lead to the same design of A when either λ 1 1 or λ n 1 . Otherwise, each will be selecting a different set of the eigenvectors ofΣto construct A according to the functions
g H ( x ) = ( x + 1 ) 2 x   versus   g TV ( x ) = 1 x 1 2 .

5. Zero-Mean Gaussian Models–Simulations

5.1. KL Divergence

In this section, we show gains of the above analysis for the KL divergence measure D KL ( A ) through simulations on a change-point detection problem. We focus on the minimax setting in which the change-point κ is deterministic. The objective is to detect a change in the stochastic process X t with minimal delay after the change in the probability measure occurs at κ and define τ N as the time that we can form a confident decision. A canonical model to quantify the decision delay is the conditional average detection delay ( CADD ) due to Pollak [42]
CADD ( τ )   = sup κ 1   E κ   [ τ κ   |   τ κ ] ,
where E κ is the expectation with respect to the probability distribution when the change happens at time κ . The objective of this formulation is to optimize the decision delay for the worst-case realization of the random change-point κ (that is, the change-point realization that leads to the maximum decision delay), while the constraints on the false alarm rate are satisfied. In this formulation, this worst-case realization is κ = 1 , in which case all the data points are generated from the post-change distribution. In the minimax setting, a reasonable measure of false alarms is the mean-time to false alarm, or its reciprocal, which is the false alarm rate ( FAR ) defined as
FAR ( τ )   = 1 E [ τ ]   ,
where E is the expectation with respect to the distribution when a change never occurs, i.e., κ   = . A standard approach to balance the trade-off between decision delay and false alarm rates involves solving [42]
min τ CADD ( τ )   s . t .   FAR ( τ ) α ,
where α R + controls the rate of false alarms. For the quickest change-point detection formulation in (48), the popular cumulative sum (CuSum) test generates the optimal solutions, involving computing the following test statistic:
W [ t ]   = max 1 k t + 1 i = k t log d Q A ( X i ) d P A ( X i ) .
Computing W [ t ] follows a convenient recursion given by
W [ t ]   = W [ t 1 ] + log d Q A ( X t ) d P A ( X t ) + ,
where W [ 0 ] = 0 . The CuSum statistic declares a change at a stopping time τ given by
τ   = inf { t 1 : W [ t ] > C } ,
where C is chosen such that the constraint on FAR ( τ ) in (48) is satisfied.
In this setting, we consider two zero-mean Gaussian models with the following pre- and post-linear dimensionality reduction structures:
P :     N ( 0 , I n )   and   Q :     N ( 0 , Σ ) P A :     N ( 0 , I r )   and   Q A :     N ( 0 , h 1 ( A ) ) ,
where the covariance matrix Σ is generated randomly, and its eigenvalues are sampled from a uniform distribution. In particular, for the original data dimension n, 0.9 n eigenvalues are sampled such that { λ i U ( 0.064 , 1 ) } , and the remaining eigenvalues are sampled such that { λ i U ( 1 , 4.24 ) } . We note that this is done since the objective function lies in the same range for the eigenvalues within the range [ 0.0649 , 1 ] and [ 1 , 4.24 ] . In order to consider the worst case detection delay, we set κ = 1 and generate stochastic observations according to the model described in (52) that follows the change-point detection model in (19). For every random realization of covariance matrix Σ , we run the CuSum statistic (50), where we generate A according to the following two schemes:
(1) Largest eigen modes: In this scheme, the linear map A is designed such that its rows are eigenvectors associated with the r largest eigenvalues of Σ .
(2) Optimal design: In this scheme, the linear map A is designed such that its rows are eigenvectors associated with r eigenvalues of Σ that maximize D KL ( A ) according to Theorem 3.
In order to evaluate and compare the performance of the two schemes, we compute the ADD obtained by running a Monte-Carlo simulation over 5000 random realizations of the stochastic process X t following the change-point detection model in (19) for every random realization of Σ and for each reduced dimension 1 r 9 . The detection delays obtained are then averaged again over 100 random realizations of covariance matrices Σ for each reduced dimension r. Figure 1 shows the plot for ADD versus r for multiple initial data dimension n and for a fixed FAR = 1 5000 . Owing to the dependence on D KL ( A ) given in (21), the delay associated with the optimal linear mapping in Theorem 3 achieves better performance.

5.2. Symmetric KL Divergence

In this section, we show the gains of the analysis by numerically computing D SKL ( A ) . We follow the pre- and post-linear dimensionality reduction structures given in (52), where the covariance matrix Σ is randomly generated following the setup used in Section 5.1. As plotted in Figure 2, by choosing the design scheme for D SKL ( A ) according to Theorem 3, the optimal design outperforms other schemes.

5.3. Squared Hellinger Distance

We consider a Bayesian hypothesis testing problem given class a priori parameters p P A , p Q A and Gaussian class conditional densities for the linear dimensionality reduction model in (52). Without loss of generality, we assume a 0–1 loss function associated with misclassification for the hypothesis test. In order to quantify the performance of the Bayes decision rule, it is imperative to compute the associated probability of error, also known as the Bayes error, which we denote by P e . Since, in general, computing P e for the optimal decision rule for multivariate Gaussian conditional densities is intractable, numerous techniques have been devised to bound P e . Owing to its simplicity, one of the most commonly employed metric is the Bhattacharyya coefficient given by
BC ( A )   = R r d P A · d Q A .
The metric in (53) facilitates upper bounding the error probability as
P e p P A   p Q A · BC ( A ) ,
which is widely referred to as the Bhattacharrya bound. Relevant to this study is that the squared Hellinger distance is related to the Bhattacharyya coefficient in (53) through
H 2 ( A ) = 2 BC ( A ) .
Hence, maximizing the Hellinger distance H 2 ( A ) results in a tighter bound on P e from (54). To show the performance numerically, we compute the BC ( A ) via (55). For the pre- and post-linear dimensionality reduction structures as given in (52), the covariance matrix Σ is randomly generated following the setup used in Section 5.1. As plotted in Figure 3, by employing the design scheme according to Theorem 4, the optimal design results in a smaller BC ( A ) and, hence, a tighter upper bound on P e in comparison to other schemes.

5.4. Total Variation Distance

Consider a binary hypothesis test with Gaussian class conditional densities following the model in (52) and equal class a priori probabilities, i.e., p P A = p Q A . We define c i j as the cost associated with deciding in favor of H i when the true hypothesis is H j such that 0 i , j 1 , and denote the densities associated with measures P A , Q A by f P A and f Q A , respectively. Without loss of generality, we assume a 0–1 loss function such that c i j = 1     i j and c i i = 0     i . The optimal Bayes decision rule that minimizes the error probability is given by
f P A ( x ) f Q A ( x )     d = H 0 d = H 1     1 .
Since the total variation distance cannot be computed in closed-form, we numerically compute the error probability P e under the two bounds (Hellinger-based and FB-based) introduced in Section 4.4.2 to quantify the performance of the design of matrix A for the underlying inference problem. The covariance matrix Σ is randomly generated following the setup used in Section 5.1. As plotted in Figure 4, by optimizing the Hellinger-based bound according to Theorem 4 and optimizing the FB-based bound according to Theorem 3, the two design schemes achieve a smaller P e . We further observe that the bounds due to FB-based are loose in comparison to Hellinger-based bounds. Therefore, we choose not to plot the lower bound on P e for the FB-based bounds in Figure 4.

5.5. χ 2 -Divergence

In this section, we show the gains of the proposed analysis through numerical evaluations by numerically computing χ 2 ( A ) , to find a lower bound on the noise variance var θ ( θ ^ ) up to a constant. Following the pre- and post-linear dimensionality reduction structures given in (52), the covariance matrix Σ is randomly generated following the setup used in Section 5.1. As shown in Figure 5, constructing the optimal design according to Theorem 4 achieves a tighter lower bound in comparison to the other scheme.

6. General Gaussian Models

In the previous section, we focused on μ = 0 . When μ 0 , optimizing each f-divergence measure under the semi-orthogonality constraint does not render closed-form expressions. Nevertheless, to provide some intuitions, we provide a numerical approach to the optimal design of A , which might also enjoy some local optimality guarantees. To start, note that the feasible set of solutions given by M n r   = { A R r × n : A · A = I r } owing to the orthogonality constraints in Q is often referred to as the Stiefel manifold. Therefore, solving Q requires designing algorithms that optimize the objective while preserving manifold constraints during iterations.
We employ the method of Lagrange multipliers to formulate the Lagrangian function. By denoting the matrix of Lagrangian multipliers by L R r × r , the Lagrangian function of problem (14) is given by
L ( A , L ) = D f ( A ) + L , A · A I r .
From the first order optimality condition, for any local maximizer A of (14), there exists a Lagrange multiplier L such that
A L ( A , L ) | A , L = 0 ,
where we denote the partial derivative with respect to A by A . In what follows, we iterate the design mapping A using the gradient ascent algorithm in order to find a solution for A . As discussed in the next subsection, this solution is guaranteed to be at least locally optimal.

6.1. Optimizing via Gradient Ascent

We use an iterative gradient ascent-based algorithm to find the local maximizer of D f ( A ) such that A M n r . The gradient ascent update at any given iteration k N is given by
A k + 1 = A k + α · A L ( A , L ) | A k .
Note that, following this update, since the new point A k + 1 in (59) may not satisfy the semi-orthogonality, i.e., A k + 1 M n r , it is imperative to establish a relation between the multipliers L and A k in every iteration k to ensure a constraint-preserving update scheme. In particular, to enforce the semi-orthogonality constraint on A k + 1 , a relationship between the multipliers and the gradients in every iteration k is derived. Following a similar line of analysis for gradient descent in Reference [43], the relationship between multipliers and the gradients is provided in Appendix E. More details on the analysis of the update scheme can be found in Reference [43], and a detailed discussion on the convergence guarantees of classical steepest descent update schemes adapted to semi-orthogonality constraints can be found in Reference [44].
In order to simplify A L ( A , L ) and state the relationships, we define Λ   = L + L and subsequently find a relationship between Λ and A k in every iteration k. This is obtained by right-multiplying (59) by A k + 1 and solving for Λ that enforces the semi-orthogonality constraint on A k + 1 . To simplify the analysis, we take a finite Taylor series expansion of Λ around α = 0 and choose α such that the error in forcing the constraint is a good approximation of the gradient of the objective subjected to A · A = I r . As derived in the Appendix E, by simple algebraic manipulations, it can be shown that the matrices Λ 0 , Λ 1 , and Λ 2 , for which the finite Taylor series expansion of Λ Λ 0 + α · Λ 1 + α 2 · Λ 2 is a good approximation of the constraint, are given by
Λ 0   = 1 2 A D f ( A ) · ( A ) + A · A D f ( A ) ,  
Λ 1   = 1 2 A D f ( A ) + Λ 0 A · A D f ( A ) + Λ 0 A ,
Λ 2   = 1 2 Λ 1 · A · A D f ( A ) + A D f ( A ) · ( A ) · Λ 1 + Λ 0 · Λ 1 + Λ 1 · Λ 0 .
Additionally, we note that, since finding the global maximum is not guaranteed, it is imperative to initialize A 0 close to the estimated maximum. In this regard, we leverage the structure of the objective function for each f-divergence measure as given in Appendix D. In particular, we observe that the objective of each f-divergence measure can be decomposed into two objectives: the first not involving μ (making this objective a convex problem as shown in Section 4), and the second objective a function of μ . Hence, leveraging the structure of the solution from Section 4, we initialize A 0 such that it maximizes the objective in the case of zero-mean Gaussian models. We further note that, while there are more sophisticated orthogonality constraint-preserving algorithms [45], we find that our method adopted from Reference [43] is sufficient for our purpose, as we show next through numerical simulations.

6.2. Results and Discussion

The design of A when μ 0 is not characterized analytically. Therefore, we resort to numerical simulations to show the gains of optimizing f-divergence measures when μ 0 . In particular, we consider the linear discriminant analysis (LDA) problem where the goal is to design a mapping A and perform classification in the lower dimensional space (of dimension r). Without loss of generality, we assume n = 10 and consider Gaussian densities with the following pre- and post-linear dimensionality reduction structures:
P :     N ( 0 , I n )   and   Q :     N ( μ , Σ ) P A :     N ( 0 , I r )   and   Q A :     N ( A · μ , h 1 ( A ) ) ,
where the covariance matrix Σ is generated randomly the eigenvalues of which are sampled from a uniform distribution { λ i U ( 0 , 1 ) } i = 1 10 . For the model in (63), we consider two kinds of performance metrics that have information-theoretic performance interpretations: (i) the total probability of error related to the d TV ( A ) , and (ii) the exponential decay of error probability related to D KL ( P A     Q A ) . In what follows, we demonstrate that optimizing appropriate f-divergence measures between P A and Q A lead to better performance when compared to the performance of the popular Fisher’s quadratic discriminant analysis (QDA) classifier [20]. In particular, the Fisher’s approach sets r = 1 and designs A by solving
arg max A R 1 × n   ( μ · A ) 2 A · ( I n + Σ ) · A .
In contrast, we design A such that the information-theoretic objective functions associated with the total probability of error (captured by d TV ( A ) ) and the exponential decay of error probability (captured by D KL ( P A     Q A ) ) are minimized. The structure of the objective functions is discussed in Total probability of error and Type-II error subjected to type-I error constraints. Both methods and Fisher’s method, after projecting the data into a lower dimension, deploy optimal detectors to discern the true model. It is noteworthy that, in both methods the data in the lower dimensions has a Gaussian model, and the conventional QDA [20] classifier is the optimal detector. Hence, we emphasize that our approach aims to have a design for A that maximizes the distance between the probability measures after reducing the dimensions, i.e., the distance between P A and Q A . Since this distance captures the quality of the decisions, our design of A outperforms that of Fisher’s. For each comparison, we consider various values for μ and compare the appropriate performance metrics with that of Fisher’s QDA for each. In all cases, the data is synthetically generated, i.e., sampled from a Gaussian distribution where we consider 2000 data points associated with each measure P and Q .

6.2.1. Schemes for Linear Map

(1) Total Probability of Error: In this scheme, the linear map A is designed such that d TV ( A ) is optimized via gradient ascent iterations until convergence. As discussed in Section 4.4.1, since the total probability of error is the key performance metric that arises while optimizing d TV ( A ) , it is expected that optimizing d TV ( A ) will result in a smaller total error in comparison to other schemes that optimize other objective functions (e.g., Fisher’s QDA). We note that, since there do not exist closed-form expressions for the total variation distance, we maximize bounds on d TV ( A ) instead via the Hellinger bound in (33) as a proxy to minimize the total probability of error. The corresponding gradient expression to optimize H 2 ( A ) (to perform iterative updates as in (59)) is derived in closed-form and is given in Appendix D.
(2) Type-II Error Subjected to Type-I Error Constraints: In this scheme, the linear map A is designed such that D KL ( P A     Q A ) is optimized via gradient ascent iterations until convergence. In order to establish a relation, consider the following binary hypothesis test:
H 0   : X P A   versus   H 1   : X Q A .
When minimizing the probability of type-II error subjected to type-I error constraints, the optimal test guarantees that the probability of type-II error decays exponentially as
lim s log ( Q A ( d = H 0 ) ) s = D KL ( P A     Q A ) ,
where we have define d : X { H 0 , H 1 } as the decision rule for the hypothesis test, and s denotes the sample size. As a result, D KL ( P A     Q A ) appears as the error exponent for hypothesis test in (65). Hence, it is expected that optimizing D KL ( P A     Q A ) will result in a smaller type-II error for the same type-I error when comparing with a method that optimizes other objectives (e.g., Fisher’s QDA). The corresponding gradient expression to optimize the D KL ( P A     Q A ) is derived in closed-form and is given in Appendix D.
For the sake of comparison and reference, we also consider schemes in which A is designed to optimize the objectives D KL ( A ) , the largest eigen modes (LEM), and the smallest eigen modes (SEM), which carry no specific operational significance in the context of the binary classification problem. In the case of LEM and SEM schemes, the linear map A is designed such that the rows of A are the eigenvector associated with the largest and smallest modes of the matrix Σ , respectively. Furthermore, we define 𝟙 as the vector of all those of appropriate dimension.

6.2.2. Performance Comparison

After learning the linear map A for each scheme described in Section 6.2.1, we perform classification in the lower dimensional space of dimension r to find the type-I, type-II, and total probability of error for each scheme. Table 1, Table 2, Table 3 and Table 4 tabulate the results for various choices of the mean parameter μ . We have the following important observations: (i) we observe that optimizing H 2 ( A ) results in a smaller total probability of error in comparison to the total error obtained by optimizing the Fisher’s objective; it is important to note that the superior performance is observed despite maximizing bounds on d TV ( A ) (that is sub-optimal) and not the distance itself; and (ii) we observe that except for the case of μ = 0.8 · 𝟙 , optimizing D KL ( P A     Q A ) results in a smaller type-II error in comparison to that obtained by optimizing the Fisher’s objective indicating a gain in optimizing D KL ( P A     Q A ) in comparison to the Fisher’s objective in (64).
It is important to note that the convergence of the gradient ascent algorithm only guarantees a locally optimal solution. While we have restricted the results that consider a maximum separation of μ = 0.8 · 𝟙 , we have performed additional simulations for larger separation between models (greater μ > 0.8 ). We have the following observations: (i) solution for the linear map A obtained through gradient ascent becomes highly sensitive to the initialization A 0 ; specifically, it was observed that optimizing the Fisher’s objective outperforms optimizing H 2 ( A ) for various initializations A 0 , and vice versa, for other random initializations; and (ii) the gradient ascent solver becomes more prone to getting stuck at the local maxima for larger separations between the models. We conjecture that the odd observation in the case of μ = 0.8 · 𝟙 when optimizing D KL ( P A     Q A ) (where optimizing the Fisher’s objective outperforms optimizing D KL ( P A     Q A ) ) supports this observation. Furthermore, we note that, since the problem is convex for μ = 0 , a deviation from this assumption moves the problem further from being convex, making the solver prone to getting stuck at the locally optimal solutions for larger separation between the Gaussian models.

6.2.3. Subspace Representation

In order to gain more intuition towards the learned representations, we illustrate the 2-dimensional projections of the original 10-dimensional data obtained after optimizing the corresponding f-divergence measures. For brevity, we only show the plots for D KL ( P A     Q A ) and H 2 ( A ) . Figure 6 and Figure 7 plot the two-dimensional projections of the synthetic dataset that optimize D KL ( P A     Q A ) and H 2 ( A ) , respectively. As expected, it is observed that the total probability of error is smaller when optimizing H 2 ( A ) . Figure 8 shows the variation in the objective function as a function of gradient ascent iterations. As the iterations grow, the objective functions eventually converges to a locally optimal solution.

7. Conclusions

In this paper, we have considered the problem of discriminant analysis such that separation between the classes is maximized under f-divergence measures. This approach is motivated by dimensionality reduction for inference problems, where we have investigated discriminant analysis under Kullback–Leibler, symmetrized Kullback–Leibler, Hellinger, χ 2 , and total variation measures. We have characterized the optimal design for the linear transformation of the data onto a lower-dimensional subspace for each in the case of zero-mean Gaussian models and adopted numerical algorithms to find the design of the linear transformation in the case of general Gaussian models with non-zero means. We have shown that, in the case of zero-mean Gaussian models, the row space of the mapping matrix lies in the eigenspace of a matrix associated with the covariance matrix of the Gaussian models involved. While each f-divergence measure favors specific eigenvector components, we have shown that all the designs become identical in certain regimes, making the design of the linear mapping independent of the inference problem of interest.

Author Contributions

A.D., S.W. and A.T. contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the U. S. National Science Foundation under grants CAREER Award ECCS-1554482 and ECCS-1933107, and RPI-IBM Artificial Intelligence Research Collaboration (AIRC).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
PCAPrincipal Component Analysis
MDSMultidimensional Scaling
SDRSufficient Dimension Reduction
DADiscriminant Analysis
KLKullback Leibler
TVTotal Variation
ADD Average Detection Delay
FAR False Alarm Rate
CuSumCumulative Sum
BC Bhattacharyya Coefficient
LEMLargest Eigen Modes
SEMSmallest Eigen Modes
LDALinear Discriminant Analysis
QDAQuadratic Discriminant Analysis

Appendix A. Proof of Theorem 1

Consider two pairs of probability measures ( P A , Q A ) and ( P A ¯ , Q A ¯ ) associated with the mapping A in space X and A ¯ in space Y , respectively. Let g : X Y denote any invertible transformation. Under the invertible map, we have
d Q A ¯ = d Q A · | T | 1 ,   a n d   d P A ¯ = d P A · | T | 1 ,
where | T | denotes the determinant of the Jacobian matrix associated with g. Leveraging (A1), the f-divergence measure D f ( A ¯ ) simplifies as follows.
D f ( A ¯ )   = E P A ¯ f d Q A ¯ d P A ¯
  = Y   f d Q A ¯ d P A ¯   d P A ¯ ( y )
= X   | T ( x ) | 1 · f d Q A · | T ( x ) | 1 d P A · | T ( x ) | 1 · | T ( x ) |   d P A ( x )
  = X   f d Q A d P A   d P A ( x )
  = D f ( A ) .
Therefore, f-divergence measures are invariant under invertible transformations (both linear and non-linear) ensuring the existence of A ¯ for every A as a special case for linear transformations.

Appendix B. Proof of Theorem 3

We observe that D KL ( A ) , D SKL ( A ) , and the objective to be optimized through the matching bound Section 4.4.2, Matching Bounds up to a Constant on d TV ( A ) can be decomposed as the summation of strictly convex functions involving g KL ( x ) , g SKL ( x ) , and g TV ( x ) , respectively. Since the summation of strictly convex functions is strictly convex, we conclude that each objective D f { D KL ( A ) , D SKL ( A ) , d TV ( A ) } is strictly convex.
Next, the goal is to choose { γ i } i = 1 r such that D f { D KL ( A ) , D SKL ( A ) , d TV ( A ) } is maximized subjected to spectral constraints given by λ n ( r i ) γ i λ i . In order to choose appropriate γ i ’s, we first note that the global minimizer for functions g f { g KL , g SKL , g TV } appears at x = 1 . By noting that each g f is strictly convex, it can be readily verified that g f ( x ) is monotonically increasing for x > 1 and monotonically decreasing for x < 1 . This will guide selecting { γ i } i = 1 r , as explained next.
In the case of λ n 1 , i.e., when all the eigenvalues are larger than or equal to 1, the objective of maximizing each D f { D KL ( A ) , D SKL ( A ) , d TV ( A ) } boils down to maximizing a monotonically increasing function (considering the domain). This is trivially done by choosing γ i = λ i for i [ r ] , proving Corollary 1. On the other hand, when λ 1 1 , i.e., when all the eigenvalues are smaller than or equal to 1, following the same line of argument, the objective boils down to maximizing each D f { D KL ( A ) , D SKL ( A ) , d TV ( A ) } , where each D f is a monotonically decreasing function (considering the domain). This is trivially done by choosing γ i = λ n r + i for i [ r ] .
When λ n 1 λ 1 , the selection process is not trivial. Rather, an iterative algorithm can be followed, where we start from the eigenvalues farthest away from 1 on both sides and, subsequently, choose the one in every iteration that achieves a higher objective. This procedure can be repeated recursively until r eigenvalues are chosen. This procedure is also discussed in Algorithm 1 in Section 4.6.
Finally, constructing the optimal matrix A , which maximizes D f for any data matrix Σ , becomes equivalent to choosing eigenvectors as the rows of A associated with the chosen permutation of eigenvalues for each of the aforementioned cases.

Appendix C. Proof for Theorem 4

We first find a closed-form expression for χ 2 ( A ) and χ 2 ( P A     Q A ) . From the definition, we have
χ 2 ( A )   = | I r | 1 2 ( 2 π ) r 2 · | h 1 ( A ) | · R r exp 1 2 · Y · K 1 · Y   d Y 1 ,
where we defined K 1   = 2 · h 1 ( A ) 1 I r . We note that K 1 is a real symmetric matrix since h 1 ( A ) is a real symmetric matrix. We denote the eigen decomposition of K 1 as K 1 = U · Θ · U , where the matrix Θ is a diagonal matrix with the eigenvalues { θ i } i = 1 r as its elements. Based on this decomposition, we have
χ 2 ( A ) = 1 ( 2 π ) r 2 · | h 1 ( A ) | · R r exp 1 2 Y · U Θ U · Y   d Y 1
  = 1 ( 2 π ) r 2 · | h 1 ( A ) | · R r exp 1 2 W · Θ · W   d W 1
  = 1 ( 2 π ) r 2 · | h 1 ( A ) | · i = 1 r exp 1 2 θ i · w i 2   d w i 1 ,
where we have defined W   = U · Y . We note that, in order for χ 2 ( A ) to be finite, it is required that the eigenvalues { θ i } i = 1 r be non-negative. Hence, based on the definition of K 1 , all the eigenvalues λ i should fall in the interval ( 0 , 2 ) . Hence, we obtain:
χ 2 ( A ) = 1 ( 2 π ) r 2 · | h 1 ( A ) | · i = 1 r exp 1 2 θ i · w i 2   d w i 1
  = 1 ( 2 π ) r 2 · | h 1 ( A ) | · i = 1 r 2 π θ i 1  
  = 1 | h 1 ( A ) | · 1 | K 1 | 1 .  
Recall that the eigenvalues of h 1 ( A ) are given by { γ i } i = 1 r in the descending order. Therefore, (A13) simplifies to:
χ 2 ( A ) = i = 1 r 1 γ i · ( 2 γ i ) 1 = i = 1 r g χ 1 ( γ i ) 1 .
Hence, from (A14), maximizing χ 2 ( A ) is equivalent to choosing the eigenvalues { γ i } i = 1 r such that they maximize g χ 1 ( x ) . Similarly, the closed-form expression for χ 2 ( P A     Q A ) can be derived as follows:
χ 2 ( P A     Q A ) = | h 1 ( A ) | 1 2 ( 2 π ) r 2 · | I r | · R r exp 1 2 · Y · K 2 · Y   d Y 1 ,
where we defined K 2   = 2 · I r h 1 ( A ) 1 . We note that K 2 is a real symmetric matrix due to h 1 ( A ) being a real symmetric matrix. Hence, following a similar line of argument as in the case of χ 2 ( A ) , and as a consequence of Theorem 2, we conclude that all the eigenvalues λ i should fall in the interval ( 0.5 , ) to ensure a finite value for χ 2 ( P A     Q A ) . Following this requirement, since the integrands are bounded, we obtain the following closed-form expression:
χ 2 ( P A     Q A ) = | h 1 ( A ) | 1 2 1 · 1 | K 2 | 1 .
Recall that the eigenvalues of h 1 ( A ) are given by { γ i } i = 1 r ; then, (A16) simplifies to
χ 2 ( P A     Q A ) = i = 1 r γ i 2 ( 2 γ i 1 ) 1 = i = 1 r g χ 2 ( γ i ) 1 .
Hence, from (A17), maximizing χ 2 ( P A     Q A ) is equivalent to choosing the eigenvalues { γ i } i = 1 r such that they maximize g χ 2 ( x ) .
We observe that H 2 ( A ) , χ 2 ( A ) , and χ 2 ( P A     Q A ) can be decomposed as the product of r non-negative identical convex functions involving g H ( x ) , g χ 1 ( x ) , and g χ 2 ( x ) , respectively. Hence, the goal is to choose { γ i } i = 1 r such that D f { H 2 ( A ) , χ 2 ( A ) , χ 2 ( P A     Q A ) } is maximized subjected to spectral constraints given by λ n ( r i ) γ i λ i . In order to choose appropriate γ i ’s, we first note that the global minimizer for each g f { g H , g χ 1 , g χ 2 } is attained at x = 1 . Leveraging this observation, along with the structure that each g f is convex, it is easy to infer that each g f ( x ) is monotonically increasing for x > 1 and monotonically decreasing x < 1 . From the exact same argument in Appendix B, we obtain Corollaries 3 and 4.
Therefore, similar to Appendix B, constructing the linear map A that maximizes D f { H 2 ( A ) , χ 2 ( A ) , χ 2 ( P A     Q A ) } for any data matrix Σ boils down to choosing eigenvectors as rows of A associated with the chosen permutation of eigenvalues for each of the aforementioned cases.

Appendix D. Gradient Expressions for f-Divergence Measures

For clarity in analysis, we define the following functions:
h 2 ( A )   = μ · A · A · μ ,
h 3 ( A )   = μ · A · [ h 1 ( A ) ] 1 · A · μ .
Based on these definitions, we have the following representations for the divergence measures and their associated gradients:
D KL ( A ) = 1 2 log 1 | h 1 ( A ) | r + Tr h 1 ( A ) + h 2 ( A ) , A D KL ( A ) = [ h 1 ( A ) ] 1 · I r [ h 1 ( A ) ] 1 A · μ · μ · A · [ h 1 ( A ) ] 1 · A · Σ     + [ h 1 ( A ) ] 1 · A · μ · μ .
D KL ( P A     Q A ) = 1 2 log | h 1 ( A ) | r + Tr h 1 ( A ) 1 + h 3 ( A ) , A D KL ( P A     Q A ) = I r [ h 1 ( A ) ] 1 · A · Σ + A · μ · μ .
D SKL ( A ) = 1 2 · Tr [ h 1 ( A ) ] 1 + h 1 ( A ) + h 2 ( A ) + h 3 ( A ) r , A D SKL ( A ) = I r [ h 1 ( A ) ] 2 [ h 1 ( A ) ] 1 · A · μ · μ · A · [ h 1 ( A ) ] 1 · A · Σ
    + I r + [ h 1 ( A ) ] 1 · A · μ · μ .
H 2 ( A ) = 2 2 | 4 · h 1 ( A ) | 1 4 | h 1 ( A ) + I r | 1 2 · exp μ · A · h 1 ( A ) + I r 1 · A · μ 4 , A H 2 ( A ) [ 1 H 2 ( A ) ] = 1 2 · [ h 1 ( A ) ] 1 · A · Σ + [ h 1 ( A ) + I r ] 1 · A · [ Σ + I n ] 1 2 · A · μ · μ     + 1 2 · A · μ · μ · A · [ h 1 ( A ) + I r ] 1 · A · [ Σ + I n ] .

Appendix E. Proof for Lagrange Multipliers

Denoting A L by Δ and A D f by Δ , and further post-multiplying (59) by A k + 1 , we have:
A k + 1 · ( A k + 1 ) = A k · ( A k + 1 ) + α · Δ · ( A k + 1 ) ,
I r = A k · ( A k + α · Δ ) + α · Δ · ( A k + α · Δ ) ,
0 = A k · Δ + Δ · ( A k ) + α · Δ · Δ .
Substituting Δ = Δ + Λ · A in (A27) and simplifying the expression, we obtain:
2 · Λ + A k · Δ + Δ · ( A k ) = α · ( Δ · Δ + Δ · ( A k ) Λ + Λ · A k · Δ + Λ · Λ ) .
By noting that Λ is symmetric, taking the Taylor series expansions of Λ around α = 0 and equating the terms of α in both sides, we obtain the relationships in (60)–(62).

References

  1. Kunisky, D.; Wein, A.S.; Bandeira, A.S. Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXiv 2019, arXiv:1907.11636. [Google Scholar]
  2. Gamarnik, D.; Jagannath, A.; Wein, A.S. Low-degree hardness of random optimization problems. arXiv 2020, arXiv:2004.12063. [Google Scholar]
  3. van der Maaten, L.; Postma, E.; van den Herik, J. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 2009, 10, 66–71. [Google Scholar]
  4. Lee, J.A.; Verleysen, M. Nonlinear Dimensionality Reduction; Springer Science: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  5. DeMers, D.; Cottrell, G.W. Non-linear dimensionality reduction. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 3–6 November 1993; pp. 580–587. [Google Scholar]
  6. Cunningham, J.P.; Ghahramani, Z. Linear dimensionality reduction: Survey, insights, and generalizations. J. Mach. Learn. Res. 2015, 16, 2859–2900. [Google Scholar]
  7. Pearson, K. On lines and planes of closest fit to systems of points in space. Philos. Mag. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
  8. Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218. [Google Scholar] [CrossRef]
  9. Jolliffe, I. Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
  10. Torgerson, W.S. Multidimensional scaling: I. Theory and method. Psychometrika 1952, 17, 401–419. [Google Scholar] [CrossRef]
  11. Cox, T.F.; Cox, M.A. Multidimensional scaling. In Handbook of Data Visualization; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  12. Borg, I.; Groenen, P.J. Modern Multidimensional Scaling: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  13. Izenman, A.J. Linear discriminant analysis. Modern Multivariate Statistical Techniques; Springer: New York, NY, USA, 2013; pp. 237–280. [Google Scholar]
  14. Globerson, A.; Tishby, N. Sufficient dimensionality reduction. J. Mach. Learn. Res. 2003, 3, 1307–1331. [Google Scholar]
  15. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  16. Rao, C.R. The utilization of multiple measurements in problems of biological classification. J. R. Stat. Soc. Ser. B 1948, 10, 159–203. [Google Scholar] [CrossRef]
  17. Fukunaga, K. Introduction to Statistical Pattern Recognition; Elsevier: Amsterdam, The Netherlands, 2013. [Google Scholar]
  18. Suresh, B.; Ganapathiraju, A. Linear discriminant analysis- A brief tutorial. Inst. Signal Inf. Process. 1998, 18, 1–8. [Google Scholar]
  19. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  20. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  21. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  22. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  23. Gelfand, I.M.; Kolmogorov, A.N.; Yaglom, A.M. On the general definition of the amount of information. Dokl. Akad. Nauk SSSR 1956, 11, 745–748. [Google Scholar]
  24. Csiszár, I. Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizität von Markhoffschen Ketten. Magy. Tudományos Akad. Mat. Kut. Intézetének Közleményei 1948, 8, 379–423. [Google Scholar]
  25. Ali, S.M.; Silvey, S.D. General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. 1966, 28, 131–142. [Google Scholar] [CrossRef]
  26. Morimoto, T. Markov Processes and the H-Theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
  27. Arimoto, S. Information-theoretical considerations on estimation problems. Inf. Control 1971, 19, 181–194. [Google Scholar] [CrossRef] [Green Version]
  28. Barron, A.R.; Gyorfi, L.; Meulen, E.C. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inf. Theory 1992, 38, 1437–1454. [Google Scholar] [CrossRef] [Green Version]
  29. Berlinet, A.; Vajda, I.; Meulen, E.C. About the asymptotic accuracy of Barron density estimates. IEEE Trans. Inf. Theory 1998, 44, 999–1009. [Google Scholar] [CrossRef]
  30. Gyorfi, L.; Morvai, G.; Vajda, I. Information-theoretic methods in testing the goodness of fit. In Proceedings of the IEEE International Symposium on Information Theory, Sorrento, Italy, 25–30 June 2000. [Google Scholar]
  31. Liese, F.; Vajda, I. On Divergences and Informations in Statistics and Information Theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
  32. Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  33. Poor, H. Robust decision design using a distance criterion. IEEE Trans. Inf. Theory 1980, 26, 575–587. [Google Scholar] [CrossRef]
  34. Clarke, B.S.; Barron, A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inf. Theory 1990, 36, 453–471. [Google Scholar] [CrossRef] [Green Version]
  35. Harremoes, P.; Vajda, I. On Pairs of f-divergences and their joint range. IEEE Trans. Inf. Theory 2011, 57, 3230–3235. [Google Scholar] [CrossRef]
  36. Sason, I.; Verdú, S. f-Divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  37. Sason, I. On f-divergence: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef] [Green Version]
  38. Rao, C.R.; Statistiker, M. Linear Statistical Inference and Its Applications; Wiley: New York, NY, USA, 1973. [Google Scholar]
  39. Poor, H.V.; Hadjiliadis, O. Quickest Detection; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  40. Cavanaugh, J.E. Criteria for linear model selection based on Kullback’s symmetric divergence. Aust. N. Z. J. Stat. 2004, 46, 257–274. [Google Scholar] [CrossRef]
  41. Devroye, L.; Mehrabian, A.; Reddad, T. The total variation distance between high-dimensional Gaussians. arXiv 2020, arXiv:1810.08693. [Google Scholar]
  42. Pollak, M. Optimal detection of a change in distribution. Ann. Stat. 1985, 13, 206–227. [Google Scholar] [CrossRef]
  43. Carter, K.M.; Raich, R.; Finn, W.G.; Hero, A.O. Information preserving component analysis: Data projections for flow cytometry analysis. IEEE J. Sel. Top. Signal Process. 2009, 3, 148–158. [Google Scholar] [CrossRef] [Green Version]
  44. Wen, Z.; Yin, W. A feasible method for optimization with orthogonality constraints. Math. Program. 2013, 142, 397–434. [Google Scholar] [CrossRef] [Green Version]
  45. Edelman, A.; Arias, T.; Smith, S. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 1998, 20, 303–353. [Google Scholar] [CrossRef]
Figure 1. Comparison of the average detection delay ( ADD ) under the optimal design and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n for a fixed false alarm rate ( FAR ) which is equal to 1 / 5000 .
Figure 1. Comparison of the average detection delay ( ADD ) under the optimal design and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n for a fixed false alarm rate ( FAR ) which is equal to 1 / 5000 .
Entropy 24 00188 g001
Figure 2. Comparison of the empirical average computed for the optimal design and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n.
Figure 2. Comparison of the empirical average computed for the optimal design and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n.
Entropy 24 00188 g002
Figure 3. Comparison of the empirical average of the Bhattacharyya coefficient BC ( A ) under optimal design and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n.
Figure 3. Comparison of the empirical average of the Bhattacharyya coefficient BC ( A ) under optimal design and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n.
Entropy 24 00188 g003
Figure 4. Comparing the logarithm of the empirical average value for P e under the two bounds on d TV ( A ) (Hellinger-based and Frobenius norm (FB)-based) with the largest eigen modes scheme for multiple projected data dimensions r as a function of initial data dimension n.
Figure 4. Comparing the logarithm of the empirical average value for P e under the two bounds on d TV ( A ) (Hellinger-based and Frobenius norm (FB)-based) with the largest eigen modes scheme for multiple projected data dimensions r as a function of initial data dimension n.
Entropy 24 00188 g004
Figure 5. Comparison of the lower bound on noise variance given by 1 χ 2 ( A ) under the optimal and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n.
Figure 5. Comparison of the lower bound on noise variance given by 1 χ 2 ( A ) under the optimal and largest eigen modes schemes for multiple reduced data dimensions r as a function of original data dimension n.
Entropy 24 00188 g005
Figure 6. Two-dimensional projected data obtained by optimizing D KL ( P A     Q A ) .
Figure 6. Two-dimensional projected data obtained by optimizing D KL ( P A     Q A ) .
Entropy 24 00188 g006
Figure 7. Two-dimensional projected data obtained by optimizing H 2 ( A ) .
Figure 7. Two-dimensional projected data obtained by optimizing H 2 ( A ) .
Entropy 24 00188 g007
Figure 8. Convergence of the gradient ascent algorithm as a result of optimizing H 2 ( A ) .
Figure 8. Convergence of the gradient ascent algorithm as a result of optimizing H 2 ( A ) .
Entropy 24 00188 g008
Table 1. μ = 0.2 · 𝟙 , r = 1 .
Table 1. μ = 0.2 · 𝟙 , r = 1 .
Fisher’s QDA D KL ( P A     Q A ) H 2 ( A ) D KL ( A ) SEMLEM
P A ( d = H 1 ) 331/2000331/2000331/2000331/2000337/2000915/2000
Q A ( d = H 0 ) 1226/200063/200063/200063/200064/2000811/2000
Total Error1557/4000394/4000394/4000394/4000401/40001726/4000
Table 2. μ = 0.4 · 𝟙 , r = 1 .
Table 2. μ = 0.4 · 𝟙 , r = 1 .
Fisher’s QDA D KL ( P A     Q A ) H 2 ( A ) D KL ( A ) SEMLEM
P A ( d = H 1 ) 344/2000344/2000344/2000345/2000347/2000782/2000
Q A ( d = H 0 ) 594/200063/200063/200063/200064/2000739/2000
Total Error938/4000407/4000407/4000408/4000411/40001521/4000
Table 3. μ = 0.6 · 𝟙 , r = 1 .
Table 3. μ = 0.6 · 𝟙 , r = 1 .
Fisher’s QDA D KL ( P A     Q A ) H 2 ( A ) D KL ( A ) SEMLEM
P A ( d = H 1 ) 326/2000326/2000335/2000318/2000335/2000638/2000
Q A ( d = H 0 ) 137/200055/2000108/200057/200061/2000669/2000
Total Error463/4000381/4000443/4000375/4000396/40001307/4000
Table 4. μ = 0.8 · 𝟙 , r = 1 .
Table 4. μ = 0.8 · 𝟙 , r = 1 .
Fisher’s QDA D KL ( P A     Q A ) H 2 ( A ) D KL ( A ) SEMLEM
P A ( d = H 1 ) 264/2000264/2000159/2000255/2000307/2000561/2000
Q A ( d = H 0 ) 25/200053/200064/200055/200060/2000580/2000
Total Error289/4000317/4000214/4000310/4000367/40001141/4000
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dwivedi, A.; Wang, S.; Tajer, A. Discriminant Analysis under f-Divergence Measures. Entropy 2022, 24, 188. https://doi.org/10.3390/e24020188

AMA Style

Dwivedi A, Wang S, Tajer A. Discriminant Analysis under f-Divergence Measures. Entropy. 2022; 24(2):188. https://doi.org/10.3390/e24020188

Chicago/Turabian Style

Dwivedi, Anmol, Sihui Wang, and Ali Tajer. 2022. "Discriminant Analysis under f-Divergence Measures" Entropy 24, no. 2: 188. https://doi.org/10.3390/e24020188

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop