Next Article in Journal
A Voronoi-Based Semantically Balanced Dummy Generation Framework for Location Privacy
Previous Article in Journal
Smart Multimedia Information Retrieval
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Survey of Distances between the Most Popular Distributions

Department of Statistics and Data Analysis, The National Research University Higher School of Economics, Moscow 101000, Russia
Analytics 2023, 2(1), 225-245; https://doi.org/10.3390/analytics2010012
Submission received: 6 December 2022 / Revised: 19 January 2023 / Accepted: 13 February 2023 / Published: 1 March 2023

Abstract

:
We present a number of upper and lower bounds for the total variation distances between the most popular probability distributions. In particular, some estimates of the total variation distances in the cases of multivariate Gaussian distributions, Poisson distributions, binomial distributions, between a binomial and a Poisson distribution, and also in the case of negative binomial distributions are given. Next, the estimations of Lévy–Prohorov distance in terms of Wasserstein metrics are discussed, and Fréchet, Wasserstein and Hellinger distances for multivariate Gaussian distributions are evaluated. Some novel context-sensitive distances are introduced and a number of bounds mimicking the classical results from the information theory are proved.

1. Introduction

Measuring a distance, whether in the sense of a metric or a divergence, between two probability distributions (PDs) is a fundamental endeavor in machine learning and statistics [1]. We encounter it in clustering, density estimation, generative adversarial networks, image recognition and just about any field that undertakes a statistical approach towards data. The most popular case is measuring the distance between multivariate Gaussian PDs, but other examples such as Poisson, binomial and negative binomial distributions, etc., frequently appear in applications too. Unfortunately, the available textbooks and reference books do not present them in a systematic way. Here, we make an attempt to fill this gap. For this aim, we review the basic facts about the metrics for probability measures, and provide specific formulae and simplified proofs that could not be easily found in the literature. Many of these facts may be considered as a scientific folklore known to experts but not represented in any regular way in the established sources. A tale that becomes folklore is one that is passed down and whispered around. The second half of the word, lore, comes from Old English lār, i.e., ‘instruction’. The basic reference for the topic is [2], and, in recent years, the theory has achieved substantial progress. A selection of recent publications on stability problems for stochastic models may be found in [3], but not much attention is devoted to the relationship between different metrics useful in specific applications. Hopefully, this survey helps to make this treasure more accessible and easy to handle.
The rest of the paper proceeds as follows: In Section 2, we define the total variation, Kolmogorov–Smirnov, Jensen–Shannon and geodesic metrics. Section 3 is devoted to the total variation distance for 1D Gaussian PDs. In Section 4, we survey a variety of different cases: Poisson, binomial, negative-binomial, etc. In Section 5, the total variation bounds for multivariate Gaussian PDs are presented, and they are proved in Section 6. In Section 7, the estimations of Lévy–Prohorov distance in terms of Wasserstein metrics are presented. The Gaussian case is thoroughly discussed in Section 8. In Section 9, a relatively new topic of distances between the measures of different dimensions is briefly discussed. Finally, in Section 10, new context-sensitive metrics are introduced and a number of inequalities mimicking the classical bounds from information theory are proved.

2. The Most Popular Distances

The most interesting metrics on the space of probability distributions are the total variation (TV), Lévy–Prohorov, Wasserstein distances. We will also discuss Fréchet, Kolmogorov–Smirnov and Hellinger distances. Let us remind readers that, for probability measures P , Q with densities p , q ,
TV ( P , Q ) = sup A R d | P ( A ) Q ( A ) | = 1 2 R d | p ( u ) q ( u ) | d u
We need the coupling characterization of the total variation distance. For two distributions, P and Q , a pair ( X , Y ) of random variables (r.v.) defined on the same probability space is called a coupling for P and Q if X P and Y Q . Note the following fact: there exists a coupling ( X , Y ) such that P ( X Y ) = TV ( P , Q ) . Therefore, for any measurable function f, we have P ( f ( X ) f ( Y ) ) TV ( X , Y ) with equality iff f is reversible.
In a one-dimensional case, the Kolmogorov–Smirnov distance is useful (only for probability measures on R ): Kolm ( P , Q ) = sup x R | P ( , x ) Q ( , x ) | TV ( P , Q ) . Suppose X P , Y Q are two r.v.’s, and Y has a density w.r.t. Lebesgue measure bounded by a constant C. Then, Kolm ( P , Q ) 2 C Wass 1 ( P , Q ) . Here, Wass 1 ( P , Q ) = inf [ E | X Y | : X P , Y Q ] .
Let X 1 , X 2 be random variables with the probability density functions p , q , respectively. Define the Kullback–Leibler (KL) divergence
KL ( P X 1 | | P X 2 ) = p log p q .
Example 1.
Consider the scale family { p s ( x )   = 1 s p x s , s ( 0 , ) } . Then,
KL ( p s 1 | | p s 2 )   = KL ( p s 1 s 2 | | p 1 )   = KL ( p 1 | | p s 2 s 1 ) .
The total variance distance and the Kullback–Leibler (KL) divergence appear naturally in statistics. Say, for example, in the testing of binary hypothesis H 0 : X P versus H 1 : X Q , the sum of errors of both types
inf d [ P ( d ( X ) = H 1 ) + Q ( d ( X ) = H 0 ) ] = min [ p , q ] = 1 TV ( P , Q )
as the infimum over all reasonable decision rules d: X { H 0 , H 1 } or the critical domains W is achieved for W * = { p ( x ) < q ( x ) } . Moreover, when minimizing the probability of type-II error subjected to type-I error constraints, the optimal test guarantees that the probability of type-II error decays exponentially in view of Sanov’s theorem
lim n ln Q ( d ( X ) = H 0 ) n = KL ( P | | Q ) .
where n is the sample size. In the case of selecting between M 2 distributions,
inf d max 1 j M P j ( d ( X ) j ) 1 1 M 2 j , k = 1 M KL ( P j , P k ) + log 2 M 1 .
The KL-divergence is not symmetric and does not satisfy the triangle inequality. However, it gives rise to the so-called Jensen–Shannon metric [4]
JS ( P , Q ) = D ( P | | R ) + D ( Q | | R )
with R = 1 2 ( P + Q ) . It is a lower bound for the total variance distance
0 JS ( P , Q ) TV ( P , Q ) .
The Jensen–Shannon metric is not easy to compute in terms of covariance matrices in the multi-dimensional Gaussian case.
A natural way to develop a computationally effective distance in the Gaussian case is to define first a metric between the positively definite matrices. Let λ 1 , , λ d be the generalized eigenvalues, i.e., the solutions of det ( Σ 1 λ Σ 2 ) = 0 . Define the distance between the positively definite matrices by d ( Σ 1 , Σ 2 ) = j = 1 d ( ln λ j ) 2 , and a geodesic metric between Gaussian PDs X 1 N ( μ 1 , Σ 1 ) and X 2 N ( μ 2 , Σ 2 ) :
d ( X 1 , X 2 ) = δ T S 1 δ 1 / 2 + j = 1 d ( ln λ j ) 2 1 / 2
where δ = μ 1 μ 2 and S = 1 2 Σ 1 + 1 2 Σ 2 . Equivalently,
d 2 ( Σ 1 , Σ 2 ) = tr ( ln ( Σ 1 1 / 2 Σ 2 Σ 1 1 / 2 ) ) 2 .
Remark 1.
It may be proved that the set of symmetric positively definite matrices M + ( d , R ) is a Riemannian manifold, and (8) is a geodesic distance corresponding to the bilinear form B ( X , Y ) = 4 tr ( X Y ) on the tangent space of symmetric matrices M ( d , R ) .

3. Total Variation Distance between 1D Gaussian PDs

Let Φ and φ be the standard normal distribution and its density. Let X i N ( μ i , σ i 2 ) , i = 1 , 2 . Define τ = τ ( X 1 , X 2 ) = TV ( N ( μ 1 , σ 1 2 ) , N ( μ 2 , σ 2 2 ) ) . Note that τ depends on the parameters Δ = | δ | , with δ = μ 1 μ 2 , and σ 1 2 , σ 2 2 .
Proposition 1.
In the case σ 1 2 = σ 2 2 = σ 2 , the total variation distance is computed exactly: τ ( X 1 , X 2 ) = 2 Φ ( | μ 1 μ 2 | 2 σ ) 1 .
Proof. 
By using a shift, we can assume that μ 1 = 0 and μ 2 = Δ > 0 . Then, the set A = { x : p 1 ( x ) > p 2 ( x ) } is specified as
A = { e x 2 2 σ 2 > e ( x Δ ) 2 2 σ 2 } = ( , Δ / 2 ) .
Hence,
τ ( X 1 , X 2 ) = 1 2 π σ Δ / 2 e x 2 2 σ 2 e ( x Δ ) 2 2 σ 2 d x = Φ ( b ) Φ ( b )
where b = Δ 2 σ . Using the property Φ ( b ) = 1 Φ ( b ) leads to the answer. □
Theorem 1.
1 200 min 1 , max [ | σ 1 2 σ 2 2 | min [ σ 1 2 , σ 2 2 ] , 40 Δ min [ σ 1 , σ 2 ] ] τ 3 | σ 1 2 σ 2 2 | 2 max [ σ 1 2 , σ 2 2 ] + Δ 2 max [ σ 1 , σ 2 ]
The proof is sketched in Section 6. The upper bound is based on the following.
Proposition 2 (Pinsker’s inequality).
Let X 1 , X 2 be random variables with the probability density functions p , q , and the Kullback–Leibler divergence KL ( P X 1 | | P X 2 ) . Then, for τ ( X 1 , X 2 ) = TV ( X 1 , X 2 ) ,
τ ( X 1 , X 2 ) min [ 1 , KL ( P X 1 | | P X 2 ) / 2 ] .
Proof of Pinsker’s inequality. 
We need the following bound:
| x 1 | 4 3 + 2 x 3 ϕ ( x ) , ϕ ( x ) : = x ln x x + 1
If P and Q are singular, then KL = and Pinsker’s inequality holds true. Assume P and Q are absolutely continuous. In view of (7) and Cauchy–Schwarz inequality,
τ ( X , Y ) = 1 2 | p q | = 1 2 q | p q 1 | 1 { q > 0 } 1 2 4 q 3 + 2 p 3 1 / 2 q ϕ ( p q ) 1 { q > 0 } 1 / 2 = 1 2 p ln ( p q ) 1 { q > 0 } 1 / 2 = KL ( P | | Q ) / 2 1 / 2
To check (12), define
g ( x ) = ( x 1 ) 2 4 3 + 2 x 3 ϕ ( x )
Then, g ( 1 ) = g ( 1 ) = 0 , g ( x ) = 4 ϕ ( x ) 3 x < 0 . Hence,
g ( x ) = g ( 1 ) + g ( 1 ) ( x 1 ) + 1 2 g ( ξ ) ( x 1 ) 2 = 4 ϕ ( ξ ) 6 ξ ( x 1 ) 2 0 .
  • [Mark S. Pinsker was invited to be the Shannon Lecturer at the 1979 IEEE International Symposium on Information Theory, but could not obtain permission at that time to travel to the symposium. However, he was officially recognized by the IEEE Information Theory Society as the 1979 Shannon Award recipient].
For one-dimensional Gaussian distributions,
KL ( P X 1 | | P X 2 ) = 1 2 σ 2 2 σ 1 2 1 + Δ 2 σ 1 2 ln σ 2 2 σ 1 2 .
In the multi-dimensional Gaussian case,
KL ( P X 1 | | P X 2 ) = 1 2 tr Σ 1 1 Σ 2 I + δ T Σ 1 1 δ ln det ( Σ 2 Σ 1 1 )
Next, define the Hellinger distance
η ( X , Y ) = 1 2 ( p X ( u ) p Y ( u ) ) 2 d u 1 / 2
and note that, for one-dimensional Gaussian distributions,
η ( X , Y ) 2 = 1 2 σ 1 σ 2 σ 1 2 + σ 2 2 e Δ 2 4 ( σ 1 2 + σ 2 2 )
For multi-dimensional Gaussian PDs with δ = μ 1 μ 2 ,
η ( X , Y ) 2 = 1 2 d / 2 det ( Σ 1 ) 1 / 4 det ( Σ 2 ) 1 / 4 det ( Σ 1 + Σ 2 ) 1 / 2 exp 1 8 δ T Σ 1 + Σ 2 2 1 δ .
In fact, the following inequalities hold:
τ ( X , Y ) 2 η ( X , Y ) 2 KL ( P X | | P Y ) 2 χ 2 ( X , Y )
where χ 2 ( P , Q ) = ( p ( x ) q ( x ) ) 2 p ( x ) d x . These inequalities are not sharp. For example, the Cauchy–Schwarz inequality immediately implies τ ( X , Y ) 1 2 χ 2 ( X , Y ) . There are also reverse inequalities in some cases.
Proposition 3 (Le Cam’s inequalities). 
The following inequality holds:
η ( X , Y ) 2 τ ( X , Y ) η ( X , Y ) 2 η ( X , Y ) 2 1 / 2
Proof of Le Cam’s inequalities. 
From τ ( X , Y ) = 1 2 | p q | = 1 min [ p , q ] and min [ p , q ] p q , it follows that τ ( X , Y ) 1 p q = η 2 ( X , Y ) . Next, min [ p , q ] + max [ p , q ] = 2 . Therefore, by Cauchy–Schwarz:
p q 2 = min [ p , q ] max [ p , q ] 2 min [ p , q ] max [ p , q ] = min [ p , q ] 2 min [ p , q ]
Hence,
1 η ( X , Y ) 2 2 ( 1 τ ( X , Y ) ) ( 1 + τ ( X , Y ) ) τ ( X , Y ) η ( X , Y ) 2 η ( X , Y ) 2 1 / 2 .
Example 2.
Let X N ( 0 , Σ 1 ) , Y N ( 0 , Σ 2 ) be d-dimensional Gaussian vectors. Suppose that Σ 2 = ( 1 + Δ ) Σ 1 , where Δ is small enough. Let r < d and A be r × d semi-orthogonal matrix A A T = I r . Define τ : = τ ( A X , A Y ) . Then,
1 16 Δ 2 r τ 1 2 3 / 2 Δ r .
Proof. 
In view of Le Cam’s inequalities, it is enough to evaluate η 2 . Note that all r eigenvalues of Σ 1 Σ 2 1 equal ( 1 + Δ ) 1 . Thus,
η 2 = 1 4 r / 4 ( 1 + Δ ) r / 4 ( 2 + Δ ) r / 2 = 1 8 Δ 2 [ r ( r 4 1 ) + r 2 ( r 2 1 ) ] + o ( Δ 2 ) = 1 16 Δ 2 r + o ( Δ 2 ) .
[Ernst Hellinger was imprisoned in Dachau but released by the interference of influential friends and emigrated to the US].

4. Bounds on the Total Variation Distance

This section is devoted to the basic examples and partially based on [5]. However, it includes more proofs and additional details (Figure 1).
Proposition 4 (Distances between exponential distributions). 
(a) Let X Exp ( λ ) , Y Exp ( μ ) , 0 < λ μ < . Then,
τ ( X , Y ) = λ μ λ μ λ λ μ μ μ λ .
(b) Let X = ( X 1 , , X d ) , Y = ( Y 1 , , Y d ) , each with d i.i.d. components X i Exp ( λ ) , Y i Exp ( μ ) . Then,
τ ( X , Y ) = z * ( λ d e λ y μ d e μ y ) ( 2 y ) d 1 ( d 1 ) ! d y
where z * = d μ λ ln μ λ .
Proof. 
(a) Indeed, the set A = { x > 0 : λ e λ x > μ e μ x } coincides with the half-axis ( y * , ) with y * = 1 μ λ ln μ λ . Consequently, τ ( X , Y ) = e λ y * e μ y * . (b) In this case, the set A = { X : x i > 0 , j = 1 d x j > z * } with z * = d μ λ ln μ λ . Given y > 0 , the area of an ( d 1 ) -dimensional simplex { x : x i > 0 , j = 1 d x j = y } equals ( 2 y ) d 1 ( d 1 ) ! . Then, τ ( X , Y ) = x A [ j = 1 d λ e λ x i j = 1 d μ e μ x i ] d x coincides with (24). □
Proposition 5 (Distances between Poisson distributions). 
Let X i Po ( λ i ) , where 0 < λ 1 < λ 2 . Then,
τ ( X 1 , X 2 ) = λ 1 λ 2 P ( N ( u ) = l 1 ) d u min λ 2 λ 1 , 2 e ( λ 2 λ 1 )
where N ( u ) Po ( u ) and
l = l ( λ 1 , λ 2 ) = ( λ 2 λ 1 ) ln λ 2 / λ 1 1
with λ 1 l λ 2 .
Proof. 
Let N ( t ) Po ( t ) ; then, via iterated integration by part,
P ( N ( t ) n ) = k = 0 n e t t k k ! = t e u u n n ! d u = t P ( N ( u ) = n ) d u .
Hence, Kolm ( X 1 , X 2 ) = τ ( X 1 , X 2 ) = P ( X 2 l ) P ( X 1 l ) =
P ( X 1 l 1 ) P ( X 2 l 1 ) = λ 1 λ 2 P ( N ( u ) = l 1 ) d u
where
l = min [ k Z + : f ( k ) 1 ] = ( λ 2 λ 1 ) ln λ 2 / λ 1 1
and f ( k ) = P ( N ( λ 2 ) = k ) P ( N ( λ 1 ) = k ) . □
Proposition 6 (Distances between binomial distributions). 
X i Bin ( n , p i ) , 0 < p 1 < p 2 < 1 .
τ ( X 1 , X 2 ) = n p 1 p 2 P ( S n 1 ( u ) = l 1 ) d u e 2 ψ ( p 2 p 1 ) ( 1 ψ ( p 2 p 1 ) ) 2
where S n 1 ( u ) Bin ( n 1 , u ) and ψ ( x ) = x n + 2 2 p 1 ( 1 p 1 ) . Finally, define
l = n ln ( 1 p 2 p 1 1 p 1 ) ln ( 1 + p 2 p 1 p 1 ) ln ( 1 p 2 p 1 1 p 1 ) )
with n p 1 l n p 2 .
Proof. 
Let us prove the following inequality:
n p n ln ( 1 x / q ) ln ( 1 + x / p ) ln ( 1 x / q ) n ( p + x ) , 0 < x < q
where p = p 1 , p + x = p 2 and q = 1 p . By concavity of the ln, given p ( 0 , 1 ) and q = 1 p ,
f ( x ) = p ln ( 1 + x / p ) + q ln ( 1 x / q ) ln 1 = 0 , 0 < x < q .
This gives the bound n p 1 l as follows:
p ln ( 1 + x / p ) + q ln ( 1 x / q ) 0 n p ln ( 1 + x / p ) n p ln ( 1 x / q ) n ln ( 1 x / q ) n p n ln ( 1 x / q ) ln ( 1 + x / p ) ln ( 1 x / q ) .
On the other hand,
h ( x ) = ( p + x ) ln ( 1 + x / p ) + ( q x ) ln ( 1 x / q ) 0 , 0 x q
as h ( 0 ) = 0 and h ( x ) = ln ( 1 + x / p ln ( 1 x / q ) 0 ; this implies the bound l n p 2 . Indeed:
( p + x ) ln ( 1 + x / p ) + ( q x ) ln ( 1 x / q ) 0 n ( p + x ) ln ( 1 + x / p ) + n ( p + x ) ln ( 1 x / q ) n ln ( 1 x / q ) n ( p + x ) n ln ( 1 x / q ) ln ( 1 + x / p ) ln ( 1 x / q ) .
The rest of the solution goes in parallel with that of Proposition 5. Equation (27) is replaced with the following relation: if S n ( p ) Bin ( n , p ) ; then,
P ( S n ( p ) k ) = n 0 p P ( S n 1 ( u ) = k 1 ) d u
In fact, iterated integration by parts yields the RHS of (35)
= n ( n 1 ) ( n k + 1 ) 1 k p k ( 1 p ) n k + n ( n 1 ) ( n k ) 1 ( k + 1 ) p k + 1 ( 1 p ) n k + 1 + + p n =
the LHS of (35). □
Proposition 7 (Distance between binomial and Poisson distributions).
X Bin ( n , p ) and Y Po ( n p ) , 0 < n p < 2 2
τ ( X , Y ) = n p [ ( 1 p ) n 1 e n p ]
Alternative bound
TV ( Bin ( n , λ n ) , Pois ( λ ) ) 1 1 λ n 1 / 2 .
For the sum of Bernoulli r.v.’s S n = j = 1 n X j with P ( X i = 1 ) = p i ,
τ ( S n , Y n ) = 1 2 k = 1 | P ( S n = k ) λ n k k ! e λ n | < i = 1 n p i 2
where Y n Po ( λ n ) , λ n = p 1 + p 2 + + p n (Le Cam). A stronger result: for X i Bernoulli ( p i ) and Y i Po ( λ i = p i ) , there exists a coupling s.t.
τ ( X i , Y i ) = P ( X i Y i ) = p i ( 1 e p i ) .
The stronger form of (39):
1 32 1 λ n 1 j = 1 n p i 2 τ ( X n , Y n ) λ n 1 1 e λ n j = 1 n p i 2 .
Proposition 8 (Distance between negative binomial distributions).
Let X i NegBin ( m , p i ) , 0 < p 1 < p 2 < 1
τ ( X 1 , X 2 ) = ( m + l 1 ) p 1 p 2 P ( S m + l 2 ( u ) = m 1 ) d u
where S n ( u ) Bin ( n , u ) and
l = m ln ( 1 + p 2 p 1 p 1 ) ln ( 1 p 2 p 1 1 p 1 )
with m 1 p 2 p 2 ] l [ m 1 p 1 p 1 .

5. Total Variance Distance in the Multi-Dimensional Gaussian Case

Theorem 2.
Let τ = TV ( N μ 1 , Σ 1 ) , N ( μ 2 , Σ 2 ) , and Σ 1 , Σ 2 be positively definite. Let δ = μ 1 μ 2 and Π be a d × ( d 1 ) matrix whose columns form a basis for the subspace orthogonal to δ. Let λ 1 , , λ d 1 denote the eigenvalues of the matrix ( Π T Σ 1 Π ) 1 Π T Σ 2 Π I d 1 and λ = i = 1 d 1 λ i 2 . In μ 1 μ 2 , then
1 200 min [ 1 , φ ( δ , Σ 1 , Σ 2 ) ] τ 9 2 min [ 1 , φ ( δ , Σ 1 , Σ 2 ) ]
where
φ ( δ , Σ 1 , Σ 2 ) = max [ δ T ( Σ 1 Σ 2 ) δ δ T Σ 1 δ , δ T δ δ T Σ 1 δ , λ ]
In the case of equal means μ 1 = μ 2 , the bound (43) is simplified:
1 100 min [ 1 , λ ] τ 3 2 min [ 1 , λ ] .
Here, λ = j = 1 d λ j 2 , λ 1 , , λ d are the eigenvalues of Σ 1 1 Σ 2 I d for positively definite Σ 1 , Σ 2 .
Proof is given in Section 6.
Suppose r d , and we want to find a low-dimensional projection A R r × d , A A T = I r of the multidimensional data X N ( μ 1 , Σ 1 ) and Y N ( μ 2 , Σ 2 ) such that TV ( A X , A Y ) max . The problem may be reduced to the case μ 1 = μ 2 = 0 ,   Σ 1 = I n , Σ 2 = Σ , cf. [6]. In view of (44), it is natural to maximize
min [ 1 , i = 1 r g ( γ i ) ]
where g ( x ) = 1 x 1 2 and γ i are the eigenvalues of A Σ A T . Consider all permutations π of these eigenvalues. Let
π * = argmax π i = 1 r g ( λ π ( i ) ) , γ i = λ π * ( i ) , i = 1 , , r .
Then, rows of matrix A should be selected as the normalized eigenvectors of Σ associated with the eigenvalues γ i .
Remark 2.
For zero-mean Gaussian models, this procedure may be repeated mutatis mutandis for any of the so-called f-divergences D f ( P | | Q ) : = E P f d Q d P , where f is a convex function such that f ( 1 ) = 0 , cf. [6]. The most interesting examples are:
(1)
KL-divergence: f ( t ) = t log t and g ( x ) = 1 2 ( x log x 1 ) ;
(2)
Symmetric KL-divergence: f ( t ) = ( t 1 ) log t and g ( x ) = 1 2 ( x + 1 x 2 ) ;
(3)
The total variance distance: f ( t ) = 1 2 | t 1 | and g ( x ) = 1 x 1 2 ;
(4)
The square of Hellinger distance: f ( t ) = ( t 1 ) 2 and g ( x ) = x + 1 x 2 ;
(5)
χ 2 divergence: f ( t ) = ( t 1 ) 2 and g ( x ) = 1 x ( 2 x ) .
For the optimization procedure in (47), the following result is very useful.
Theorem 3 (Poincaré Separation Theorem). 
Let Σ be a real symmetric d × d matrix and A be a semi-orthogonal r × d matrix. The eigenvalues of Σ (sorted in the descending order) and the eigenvalues of A Σ A T denoted by { γ i , i = 1 , , r } (sorted in the descending order) satisfy
λ d ( r i ) γ i λ i , i = 1 , , r .
Proposition 9.
Let X , Y be two Gaussian PDs with the same covariance matrix: X N ( μ 1 , Σ ) , Y N ( μ 2 , Σ ) . Suppose that matrix Σ is non-singular. Then,
τ ( X , Y ) = 2 Φ ( | | Σ 1 / 2 ( μ 1 μ 2 ) | | / 2 ) 1 .
Proof. 
Here, the set A : = { x R d : p ( x | μ 1 , Σ ) > p ( x | μ 2 , Σ ) } is a half-space. Indeed,
p ( x | μ 1 , Σ ) > p ( x | μ 2 , Σ ) 2 x T Σ 1 ( μ 2 μ 1 ) < μ 2 T Σ 1 μ 2 μ 1 T Σ 1 μ 1 .
After the change of variables x x + μ 1 , we need to evaluate the expression
I : = 1 ( 2 π ) d / 2 det ( Σ ) 1 / 2 R d 1 x T Σ 1 δ < 1 2 | | Σ 1 / 2 δ | | 2 × e x T Σ 1 x / 2 e ( x δ ) T Σ 1 ( x δ ) / 2 d x .
Take an orthogonal d × d matrix O such that O Σ 1 / 2 δ = | | Σ 1 / 2 δ | | e 1 and change the variables x = Σ 1 / 2 O T u . Then,
x T Σ 1 δ = | | Σ 1 / 2 δ | | u 1 , x T Σ 1 x = u T u , ( x δ ) T Σ 1 ( x δ ) = u T u + | | Σ 1 / 2 δ | | 2 2 | | Σ 1 / 2 δ | | u 1 .
Thus,
I = 1 ( 2 π ) d / 2 R d 1 e v T v / 2 d v | | Σ 1 / 2 δ | | / 2 e u 1 2 / 2 e ( u 1 | | Σ 1 / 2 δ | | ) 2 / 2 d u 1 = Φ ( b ) Φ ( b )
where b = | | Σ 1 / 2 δ | | / 2 . □

6. Proofs for the Multi-Dimensional Gaussian Case

Let X i N ( μ i , Σ i ) , i = 1 , 2 . W.l.o.g., assume that Σ 1 , Σ 2 are positively definite, and the general case may be followed from the identity
TV ( N ( 0 , Σ 1 ) , N ( 0 , Σ 2 ) ) = TV ( N ( 0 , Π T Σ 1 Π ) , N ( 0 , Π T Σ 2 Π ) )
where Π is d × r matrix whose columns form an orthogonal basis for range ( Σ 1 , 2 ) . Denote u = ( μ 1 + μ 2 ) / 2 , δ = μ 1 μ 2 and decompose w R d as
w = u + f 1 ( w ) δ + f 2 ( w ) , f 2 ( w ) T δ = 0 .
Then,
max [ TV ( f 1 ( X 1 ) , f 1 ( X 2 ) ) , TV ( f 2 ( X 1 ) , f 2 ( X 2 ) ) ] TV ( X 1 , X 2 ) TV ( f 1 ( X 1 ) , f 1 ( X 2 ) ) + TV ( f 2 ( X 1 ) , f 2 ( X 2 ) )
All the components are Gaussian and f 1 ( X 1 ) N 1 2 , δ T Σ 1 δ δ T δ , f 1 ( X 2 ) N 1 2 , δ T Σ 2 δ δ T δ , f 2 ( X 1 ) N ( 0 , P Σ 1 P ) , f 2 ( X 2 ) N ( 0 , P Σ 2 P ) , P = I d δ δ T δ T δ . We claim that
1 200 min [ 1 , max [ δ T ( Σ 1 Σ 2 ) δ 2 δ T Σ 1 δ , 40 δ T δ δ T Σ 1 δ ] ] TV ( f 1 ( X 1 ) , f 1 ( X 2 ) ) 3 δ T ( Σ 1 Σ 2 ) δ 2 δ T Σ 1 δ + δ T δ 2 δ T Σ 1 δ ,
1 100 min [ 1 , λ ] TV ( f 2 ( X 1 ) , f 2 ( X 2 ) ) 3 2 λ
where λ = j = 1 d λ j 1 / 2 and λ i are the eigenvalues of Σ 1 1 Σ 2 I d .
Proof of upper bound. 
It follows from Pinsker’s inequality. Let d = 1 and σ 2 σ 1 . Then, for x = σ 2 2 σ 1 2 , we have x 1 ln x ( x 1 ) 2 and, by Pinsker’s inequality,
TV N ( μ 1 , σ 1 2 ) , N ( μ 2 , σ 2 2 ) 1 2 σ 2 2 σ 1 2 1 ln σ 2 2 σ 1 2 + Δ 2 σ 1 2 1 2 σ 2 2 σ 1 2 1 ln σ 2 2 σ 1 2 + 1 2 Δ 2 σ 1 2 1 2 | σ 2 2 σ 1 2 | σ 1 2 + 1 2 Δ σ 1 .
For d > 1 , it is enough to obtain the upper bound in the case μ 1 = μ 2 = 0 . Again, Pinsker’s inequality implies: if λ i > 2 3 i ,
4 TV N ( 0 , Σ 1 ) , N ( 0 , Σ 2 ) 2 i = 1 d λ i ln ( 1 + λ i ) i = 1 d λ i 2 = λ 2
Sketch of proof for lower bound, cf. [7]. 
In a 1D case with X i N ( μ i , σ i 2 ) ( μ 1 μ 2 ),
TV ( N ( μ 1 , σ 1 2 ) , N ( μ 2 , σ 2 2 ) ) P ( X 2 μ 2 ) P ( X 1 μ 2 ) = 1 2 1 2 P ( X 1 ( μ 1 , μ 2 ) ) = P ( X 1 ( μ 1 , μ 2 ) ) 1 5 min [ 1 , Δ σ 1 ]
Next,
TV ( N ( μ 1 , σ 1 2 ) , N ( μ 2 , σ 2 2 ) ) 1 2 TV ( N ( 0 , σ 1 2 ) , N ( 0 , σ 2 2 ) )
Indeed, assume w.l.o.g. μ 1 μ 2 , σ 1 σ 2 . Then, c = c ( σ 1 , σ 2 ) :
TV ( N ( 0 , σ 1 2 ) , N ( 0 , σ 2 2 ) ) = P ( N ( 0 , σ 2 2 ) [ c , c ] ) P ( N ( 0 , σ 1 2 ) [ c , c ] )
Hence,
TV ( N ( μ 1 , σ 1 2 ) , N ( μ 2 , σ 2 2 ) ) P ( N ( μ 2 , σ 2 2 ) > c + μ 1 ) P ( N ( μ 1 , σ 1 2 ) > c + μ 1 ) 1 2 TV ( N ( 0 , σ 1 2 ) , N ( 0 , σ 2 2 ) )
Thus, it is enough to study the case μ 1 = μ 2 = 0 . Let C = diag ( 1 + λ i ) . Then,
TV ( N ( 0 , Σ 1 ) , N ( 0 , Σ 2 ) ) = TV ( N ( 0 , C 1 ) , N ( 0 , I d ) )
In the case when there exists i: | λ i | > 0.1 ,
TV ( N ( 0 , C 1 ) , N ( 0 , I d ) ) TV ( N ( 0 , ( 1 + λ i ) 1 ) , N ( 0 , 1 ) ) = TV ( N ( 0 , 1 ) , N ( 0 , 1 + λ i ) ) P ( N ( 0 , 1 ) [ 1 , 1 ] ) P ( N ( 0 , 1.1 ) [ 1 , 1 ] ) > 0.68 0.66 > 0.01
Finally, in the case when | λ i | 0.1 i , the result follows from the lower bound
TV ( N ( 0 , C 1 ) , N ( 0 , I d ) ) λ 6 λ 2 8 1 2 e λ 2 1
The bound (63) > λ 100 if λ < 0.17 and > 0.01 if λ 0.17 and | λ i | < 0.1 i . We refer to [7] for the proofs of these facts. □

7. Estimation of Lévy–Prokhorov Distance

Let P i , i = 1 , 2 , be probability distributions on a metric space W with metric r. Define the Lévy–Prokhorov distance ρ L P ( P 1 , P 2 ) between P 1 , P 2 as the infimum of numbers ϵ > 0 such that, for any closed set C W ,
P 1 ( C ) P 2 ( C ϵ ) < ϵ , P 2 ( C ) P 1 ( C ϵ ) < ϵ
where C ϵ stands for the ϵ -neighborhood of C in metric r. It could be checked that ρ L P ( P 1 , P 2 ) τ ( P 1 , P 2 ) , i.e., the total variance distance. Equivalently,
ρ L P ( P 1 , P 2 ) = inf P ¯ P ( P 1 , P 2 ) inf [ ϵ > 0 : P ( r ( X 1 , X 2 ) > ϵ ) < ϵ ]
where P ( P 1 , P 2 ) is the set of all joint P ¯ on W × W with marginals P i .
Next, define the Wasserstein distance W p r ( P 1 , P 2 ) between P 1 , P 2 by
W p r ( P 1 , P 2 ) = inf P ¯ P ( P 1 , P 2 ) E P ¯ r ( X 1 , X 2 ) p 1 / p .
In the case of Euclidean space with r ( x 1 , x 2 ) = | | x 1 x 2 | | , the index r is omitted.
Total Variation, Wasserstein and Kolmogorov–Smirnov distances defined above are stronger than weak convergence (i.e., convergence in distribution, which is weak* convergence on the space of probability measures, seen as a dual space). That is, if any of these metrics go to zero as n , then we have weak convergence. However, the converse is not true. However, weak convergence is metrizable (e.g., by the Lévy–Prokhorov metric).
Theorem 4 (Dobrushin’s bound).
ρ L P ( P 1 , P 2 ) [ W 1 r ( P 1 , P 2 ) ] 1 / 2 .
Proof. 
Suppose that there exists a closed set C for which at least one of the inequalities (64) fails, say P 1 ( C ) ϵ + P 2 ( C ϵ ) . Then, for any joint P ¯ with marginals P 1 and P 2 ,
E P ¯ r ( X 1 , X 2 ) E P ¯ 1 ( r ( X 1 , X 2 ) ϵ ) r ( X 1 , X 2 ) ϵ P ¯ ( r ( X 1 , X 2 ) ϵ ) ϵ P ¯ ( X 1 C , X 2 W C ϵ ) ϵ P ¯ ( X 1 C ) P ¯ ( X 1 C , X 2 C ϵ ) ϵ P ¯ ( X 1 C ) P ¯ ( X 2 C ϵ ) = ϵ P 1 ( X 1 C ) P 2 ( X 2 C ϵ ) ϵ 2 .
This leads to (67), as claimed. □
The Lévy–Prokhorov distance is quite tricky to compute, whereas the Wasserstein distance can be found explicitly in a number of cases. Say, in a 1D case W = R 1 , we have
Theorem 5.
For d = 1 ,
W 1 ( P 1 , P 2 ) = R | F 1 ( x ) F 2 ( x ) | d x .
Proof. 
First, check the upper bound W 1 ( P 1 , P 2 ) R | F 1 ( x ) F 2 ( x ) | d x . Consider ξ U [ 0 , 1 ] , X i = F i 1 ( ξ ) , i = 1 , 2 . Then, in view of the Fubini theorem,
E [ | X 1 X 2 | ] = 0 1 | F 1 1 ( y ) F 2 1 ( y ) | d y = R | F 1 ( x ) F 2 ( x ) | d x .
For the proof of the inverse inequality, see [8]. □
Proposition 10.
For d = 1 and p > 1 ,
W p ( P 1 , P 2 ) p = p ( p 1 ) d y y max [ F 2 ( y ) F 1 ( x ) , 0 ] ( x y ) p 2 d x + p ( p 1 ) d x x max [ F 1 ( x ) F 2 ( y ) , 0 ] ( y x ) p 2 d y .
Proof. 
It follows from the identity
E [ | X Y | p ] = p ( p 1 ) d y y [ F 2 ( y ) F ( x , y ) ] ( x y ) p 2 d x + p ( p 1 ) d x x [ F 1 ( x ) F ( x , y ) ] ( y x ) p 2 d y
The minimum is achieved for F ¯ ( x , y ) = min [ F 1 ( x ) , F 2 ( y ) ] . For an alternative expression (see [9]):
W p ( P 1 , P 2 ) p = 0 1 | F 1 1 ( t ) F 2 1 ( t ) | p d t .
Proposition 11.
Let ( X , Y ) R 2 d be jointly Gaussian random variables (RVs) with E [ X ] = μ X , E [ Y ] = μ Y . Then, the Frechet-1 distance
ρ F 1 ( X , Y ) : = E j = 1 d | X j Y j | = j = 1 d ( μ j X μ j Y ) 1 2 Φ ( ( μ j X μ j Y ) σ ^ j ) + 2 σ ^ j φ ( ( μ j X μ j Y ) σ ^ j ) .
where σ ^ j = ( σ j X ) 2 + ( σ j Y ) 2 2 Cov ( X j , Y j ) 1 / 2 , φ and Φ are PDF and CDF of the standard Gaussian RV. Note that, in the case μ X = μ Y , the first term in (74) vanishes, and the second term gives
ρ F 1 ( X , Y ) = 2 π j = 1 d σ ^ j .
We also present expressions for the Frechet-3 and Frechet-4 distances
ρ F 3 ( X , Y ) = j = 1 d | X j Y j | 3 1 / 3 = ( j = 1 d ( μ j X μ j Y ) 3 1 2 Φ ( ( μ j X μ j Y ) σ ^ j ) + 6 ( μ j X μ j Y ) 2 σ ^ j φ ( ( μ j X μ j Y ) σ ^ j ) + 3 ( σ ^ j ) 2 ( μ j X μ j Y ) [ 1 2 Φ ( ( μ j X μ j Y ) σ ^ j ) 2 ( μ j X μ j Y ) σ ^ j φ ( ( μ j X μ j Y ) σ ^ j ) ] + 2 ( σ ^ j ) 3 φ ( ( μ j X μ j Y ) σ ^ j ) ( μ j X μ j Y ) σ ^ j 2 + 2 ) 1 / 3 ρ F 4 ( X , Y ) = j = 1 d | X j Y j | 4 1 / 4 = j = 1 d ( μ j X μ j Y ) 4 + 6 ( μ j X μ j Y ) 2 ( σ ^ j ) 2 + 3 ( σ ^ j ) 4 1 / 4 .
All of these expressions are minimized when Cov ( X j , Y j ) , j = 1 , , d are maximal. However, this fact does not lead immediately to the explicit expressions for Wasserstein’s metrics. The problem here is that the joint covariance matrix Σ X , Y should be positively definite. Thus, the straightforward choice Corr ( X j , Y j ) = 1 is not always possible; see Theorem 6 below and [10].
[Maurice René Fréchet (1878–1973), a French mathematician, worked in topology, functional analysis, probability theory and statistics. He was the first to introduce the concept of a metric space (1906) and prove the representation theorem in L 2 (1907). However, in both cases, the credit was given to other people: Hausdorff and Riesz. Some sources claim that he discovered the Cramér–Rao inequality before anybody else, but such a claim was impossible to verify since lecture notes of his class appeared to be lost. Fréchet worked in several places in France before moving to Paris in 1928. In 1941, he succeeded Borel at the Chair of Calculus of Probabilities and Mathematical Physics in Sorbonne. In 1956, he was elected to the French Academy of Sciences, at the age of 78, which was rather unusual. He influenced and mentored a number of young mathematicians, notably Fortet and Loève. He was an enthusiast of Esperanto; some of his papers were published in this language].

8. Wasserstein Distance in the Gaussian Case

In the Gaussian case, it is convenient to use the following extension of Dobrushin’s bound for p = 2 :
ρ L P ( P 1 , P 2 ) [ W p ( P 1 , P 2 ) ] p / 2 , p 1 .
Theorem 6.
Let X i N ( μ i , Σ i 2 ) , i = 1 , 2 , be d-dimensional Gaussian RVs. For simplicity, assume that both matrices Σ 1 2 and Σ 2 2 are non-singular (In the general case, the statement holds with Σ 1 1 understood as Moore–Penrose inversion). Then, the L 2 Wasserstein distance W 2 ( X 1 , X 2 ) = W 2 ( N ( μ 1 , Σ 1 2 ) , N ( μ 2 , Σ 2 2 ) ) equals
W 2 ( X 1 , X 2 ) = | | μ 1 μ 2 | | 2 + tr ( Σ 1 2 ) + tr ( Σ 2 2 ) 2 tr [ ( Σ 1 Σ 2 2 Σ 1 ) 1 / 2 ] 1 / 2
where ( Σ 1 Σ 2 2 Σ 1 ) 1 / 2 stands for the positively definite matrix square-root. The value (78) is achieved when X 2 = μ 2 + A ( X 1 μ 1 ) where A = Σ 1 1 ( Σ 1 Σ 2 2 Σ 1 ) 1 / 2 Σ 1 1 .
Corollary 1.
Let μ 1 = μ 2 = 0 . Then, for d = 1 , W 2 ( X 1 , X 2 ) = | σ 1 σ 2 | . For d = 2 ,
W 2 ( X 1 , X 2 ) = tr ( Σ 1 2 ) + tr ( Σ 2 2 ) 2 [ tr ( Σ 1 2 Σ 2 2 ) + 2 det ( Σ 1 Σ 2 ) ] 1 / 2 1 / 2 .
Note that the expression in (79) vanishes when Σ 1 2 = Σ 2 2 .
Example 3.
(a) Let X N ( 0 , Σ X 2 ) , Y N ( 0 , Σ Y 2 ) where Σ X 2 = σ X 2 I d and Σ Y 2 = σ Y 2 I d . Then, W 2 ( X , Y ) = d | σ X σ Y | .
  • (b) Let d = 2 , X N ( 0 , Σ X 2 ) , Y N ( 0 , Σ Y 2 ) , where Σ X 2 = σ X 2 I 2 , Σ Y 2 = σ Y 2 1 ρ ρ 1 and ρ ( 1 , 1 ) . Then,
    W 2 ( X , Y ) = 2 1 / 2 σ X 2 + σ Y 2 σ X σ Y 2 + 2 ( 1 ρ 2 ) 1 / 2 1 / 2 1 / 2 .
  • (c) Let d = 2 , X N ( 0 , Σ X 2 ) , Y N ( 0 , Σ Y 2 ) , where Σ X 2 = σ X 2 1 ρ 1 ρ 1 1 , Σ Y 2 = σ Y 2 1 ρ 2 ρ 2 1 and ρ 1 , ρ 2 ( 1 , 1 ) . Then,
    W 2 ( X , Y ) = 2 1 / 2 σ X 2 + σ Y 2 σ X σ Y 2 + 2 ρ 1 ρ 2 + 2 ( 1 ρ 1 2 ) 1 / 2 ( 1 ρ 2 2 ) 1 / 2 1 / 2 1 / 2 .
Note that, in the case ρ 1 = ρ 2 , W 2 ( X , Y ) = 2 | σ X σ Y | as in (a).
Proof. 
First, reduce to the case μ 1 = μ 2 = 0 by using the identity W 2 2 ( X 1 , X 2 ) = | | μ 1 μ 2 | | 2 + W 2 2 ( ξ 1 , ξ 2 ) with ξ i = X i μ i . Note that the infimum in (19) is always attained on Gaussian measures as W 2 ( X 1 , X 2 ) is expressed in terms of the covariance matrix Σ 2 = Σ X , Y 2 only (cf. (81) below). Let us write the covariance matrix in the block form
Σ 2 = Σ 1 2 K K T Σ 2 2 = Σ 1 0 K T Σ 1 1 I I 0 0 S Σ 1 Σ 1 1 K 0 I
where the so-called Shur’s complement S = Σ 2 2 K T Σ 1 2 K . The problem is reduced to finding the matrix K in (80) that minimizes the expression
R d × R d | | x y | | 2 d P X , Y ( x , y ) = tr ( Σ 1 2 ) + tr ( Σ 2 2 ) 2 tr ( K )
subject to a constraint that the matrix Σ 2 in (80) is positively definite. The goal is to check that the minimum (81) is achieved when the Shur’s complement S in (80) equals 0. Consider the fiber σ 1 ( S ) , i.e., the set of all matrices K such that σ ( K ) : = Σ Y 2 K T ( Σ X 2 ) 1 K = S . It is enough to check that the maximum value of tr ( K ) on this fiber equals
max F σ 1 ( S ) tr ( K ) = tr ( Σ Y ( Σ X 2 S ) Σ Y ) 1 / 2 .
Since the matrix S is positively defined, it is easy to check that the fiber S = 0 should be selected. In order to establish (82), represent the positively definite matrix Σ Y 2 S in the form Σ Y 2 S = U D r 2 U T , where the diagonal matrix D r 2 = diag ( λ 1 2 , , λ r 2 , 0 , , 0 ) and λ i > 0 . Next, U = ( U r | U d r ) is the orthogonal matrix of the corresponding eigenvectors. We obtain the following r × r identity:
( Σ X 1 K U r D r 1 ) T ( Σ X 1 K U r D r 1 ) = I r .
It means that Σ X 1 K U r D r 1 = O r , an ‘orthogonal’ d × r matrix, with O r T O r = I r , and K = Σ X O r D r U r T . The matrix O r parametrises the fiber σ 1 ( S ) . As a result, we have an optimization problem
tr ( O T M ) max , M = Σ X U r D r
in a matrix-valued argument O r , subject to the constraint O r T O r = I r . A straightforward computation gives the answer tr [ ( M T M ) 1 / 2 ] , which is equivalent to (82). Technical details can be found in [11,12]. □
Remark 3.
For general zero means RVs X , Y R d with the covariance matrices Σ i 2 , i = 1 , 2 , the following inequality holds [13]:
tr ( Σ 1 2 ) + tr ( Σ 2 2 ) 2 tr [ ( Σ 1 Σ 2 2 Σ 1 ) 1 / 2 ] E [ | | X Y | | 2 ] tr ( Σ 1 2 ) + tr ( Σ 2 2 ) + 2 tr [ ( Σ 1 Σ 2 2 Σ 1 ) 1 / 2 ] .

9. Distance between Distributions of Different Dimensions

For m d , define a set of matrices with orthonormal rows:
O ( m , d ) = { V R m × d : V V T = I m }
and a set of affine maps φ : R d R m such that φ V , b ( x ) = V x + b .
Definition 1.
For any measures μ M ( R m ) and ν M ( R d ) , the embeddings of μ into R d are the set of d-dimensional measures Φ + ( μ , d ) : = { α M ( R n ) : φ V , β ( α ) = μ } for some V O ( m , d ) , b R m , and the projections of ν onto R m are the set of m-dimensional measures Φ ( ν , m ) : = { β M ( R m ) : φ V , β ( ν ) = β } for some V O ( m , d ) , b R m .
Given a metric κ between measures of the same dimension, define the projection distance d ( μ , ν ) : = inf β Φ ( ν , m ) κ ( μ , β ) and the embedding distance d + ( μ , ν ) : = inf α Φ + ( μ , d ) κ ( α , ν ) . It may be proved [14] that d + ( μ , ν ) = d ( μ , ν ) ; denote the common value by d ^ ( μ , ν ) .
Example 4.
Let us compute Wasserstein distance between one-dimensional X N ( μ 1 , σ 2 ) and d-dimensional Y N ( μ 2 , Σ ) . Denote by λ 1 λ 2 λ d the eigenvalues of Σ. Then,
W ^ 2 ( X , Y ) = σ λ 1 i f σ > λ 1 0 i f λ d σ λ 1 λ d σ i f σ < λ d .
Indeed, in view of Theorem 6, write
( W 2 ( X , Y ) ) 2 = min | | x | | 2 = 1 , b R [ | | μ 1 x T μ 2 b | | 2 2 + tr ( σ 2 + x T Σ x 2 σ x T Σ x ) ] = min | | x | | 2 = 1 ( σ x T Σ x ) 2 ,
and (87) follows.
Example 5 (Wasserstein-2 distance between Dirac measure on R m and a discrete measure on R d ).
Let y R m and μ 1 M ( R m ) be the Dirac measure with μ 1 ( y ) = 1 , i.e., all mass centered at y . Let x 1 , , x k R d be distinct points, p 1 , , p k 0 , p 1 + + p k = 0 , and let μ 2 M ( R d ) be the discrete measure of point masses with μ 2 ( x i ) = p i , i = 1 , , k . We seek the Wasserstein distance W ^ 2 ( μ 1 , μ 2 ) in a closed-form solution. Suppose m d ; then,
( W 2 ( μ 1 , μ 2 ) ) 2 = inf V O ( m , d ) inf b R m i = 1 k p i | | V x i + b y | | 2 2 = inf V O ( m , d ) i = 1 k p i | | V x i i = 1 k p i V x i | | 2 2 = inf V O ( m , d ) tr ( V C V T )
noting that the second infimum is attained by b = y i = 1 k p i V x i and defining C in the last infimum to be
C : = i = 1 k p i x i i = 1 k p i x i x i i = 1 k p i x i T R d × d .
Let the eigenvalue decomposition of the symmetric positively semidefinite matrix C be C = Q Λ Q T with Λ = diag ( λ 1 , , λ d ) , λ 1 λ d 0 . Then,
inf V O ( m , d ) tr ( V C V T ) = i = 0 m 1 λ d i
and is attained when V O ( m , d ) has row vectors given by the last m columns of Q O ( d ) .
Note that the geodesic distance (7) and (8) between Gaussian PDs (or corresponding covariance matrices) is equivalent to the formula for the Fisher information metric for the multivariate normal model [15]. Indeed, the multivariate normal model is a differentiable manifold, equipped with the Fisher information as a Riemannian metric; this may be used in statistical inference.
Example 6.
Consider i.i.d. random variables Z l , , Z n to be bi-variately normally distributed with diagonal covariance matrices, i.e., we focus on the manifold M d i a g = { N ( μ , Λ ) : μ R 2 , Λ diagonal } . In this manifold, consider the submodel M d i a g * = { N ( μ , σ 2 I ) : μ R 2 , σ 2 R + } corresponding to the hypothesis H 0 : σ 1 2 = σ 2 2 . First, consider the standard statistical estimates Z ¯ for the mean and s 1 , s 2 for the variances. If σ ¯ 2 denotes the geodesic estimate of the common variance, the squared distance between the initial estimate and the geodesic estimate under the hypothesis H 0 is given by
n 2 ln σ ¯ 2 s 1 2 2 + ln σ ¯ 2 s 2 2 2
which is minimized by σ ¯ 2 = s 1 s 2 . Hence, instead of the arithmetic mean of the initial standard variation estimates, we use as an estimate the geometric mean of these quantities.
Finally, we present the distance between the symmetric positively definite matrices of different dimensions. Let m d , A is m × m and B = B 11 B 12 B 21 B 22 is d × d ; here, B 11 is a m × m block. Then, the distance is defined as follows:
d 2 ( A , B ) : = j = 1 m max [ 0 , ln λ j ( A 1 B 11 ) ] 2 1 / 2 .
In order to estimate the distance (93), after the simultaneous diagonalization of matrices A and B, the following classical result is useful:
Theorem 7 (Cauchy interlacing inequalities).
Let B = B 11 B 12 B 21 B 22 be a d × d symmetric positively definite matrix with eigenvalues λ 1 ( B ) λ d ( B ) and m × m block B 11 . Then,
λ j ( B ) λ j ( B 11 ) λ j + d m ( B ) , j = 1 , , m .

10. Context-Sensitive Probability Metrics

The weighted entropy and other weighted probabilistic quantities generated a substantial amount of literature (see [16,17] and the references therein). The purpose was to introduce a disparity between outcomes of the same probability: in the case of a standard entropy, such outcomes contribute the same amount of information/uncertainty, which is appropriate in context-free situations. However, imagine two equally rare medical conditions, occurring with probability p 1 , one of which carries a major health risk while the other is just a peculiarity. Formally, they provide the same amount of information: log p , but the value of this information can be very different. The applications of the weighted entropy to the clinical trials are in the process of active development (see [18] and the literature cited therein). In addition, the contribution to the distance (say, from a fixed distribution Q ) related to these outcomes, is the same in any conventional sense. The weighted metrics, or weight functions, are supposed to fulfill the task of samples graduation, at least to a certain extent.
Let the weight function or graduation φ > 0 on the phase space X be given. Define the total weighted variation (TWV) distance
τ φ ( P 1 , P 2 ) = 1 2 sup A [ A φ d P 1 A φ d P 2 ] + sup A [ A φ d P 2 A φ d P 1 ] .
Similarly, define the weighted Hellinger distance. Let p 1 , p 2 be the densities of P 1 , P 2 w.r.t. to a measure ν . Then,
η φ ( P 1 , P 2 ) : = 1 2 φ ( p 1 p 2 ) 2 d ν 1 / 2 .
Lemma 1.
Let p 1 , p 2 be the densities of P 1 , P 2 w.r.t. to a measure ν. Then, τ φ ( P 1 , P 2 ) is a distance and
τ φ ( P 1 , P 2 ) = 1 2 φ | p 1 p 2 | d ν
Proof. 
The triangular inequality and other properties of the distance follow immediately. Next,
p 1 > p 2 φ ( p 1 p 2 ) = 1 2 ( φ p 1 φ p 2 ) + 1 2 φ | p 1 p 2 | d ν p 2 > p 1 φ ( p 2 p 1 ) = 1 2 ( φ p 2 φ p 1 ) + 1 2 φ | p 1 p 2 | d ν
Summing up these equalities implies (97). □
Let φ p 1 d ν φ p 2 d ν . Then, by the weighted Gibbs inequality [16], KL φ ( P 1 | | P 2 ) : = φ p 1 log p 1 p 2 0 .
Theorem 8 (Weighted Pinsker’s inequality).
1 2 φ | p 1 p 2 | KL φ ( P 1 | | P 2 ) / 2 φ p 1 .
Proof. 
Define the function G ( x ) = x log x x + 1 . The following bound holds, cf. (12):
G ( x ) = x log x x + 1 3 2 ( x 1 ) 2 x + 2 , x > 0 .
Now, by the Cauchy–Schwarz inequality,
φ p 2 | p 1 p 2 1 | 2 φ ( p 1 p 2 1 ) 2 p 1 p 2 + 2 p 2 φ p 1 p 2 + 2 p 2 3 φ ( p 1 p 2 1 ) 2 p 1 p 2 + 2 p 2 φ p 1 2 φ G ( p 1 p 2 ) p 2 φ p 1 KL φ ( P 1 | | P 2 ) φ p 1 .
Theorem 9 (Weighted Le Cam’s inequality).
τ φ ( P 1 , P 2 ) η φ ( P 1 , P 2 ) 2 .
Proof. 
In view of inequality
1 2 | p 1 p 2 | = 1 2 p 1 + 1 2 p 2 min [ p 1 , p 2 ] 1 2 p 1 + 1 2 p 2 p 1 p 2 ,
one obtains
τ φ ( P 1 , P 2 ) 1 2 φ p 1 + 1 2 φ p 2 φ p 1 p 2 = η φ ( P 1 , P 2 ) 2 .
Next, we relate TWV distance to the sum of sensitive errors of both types in statistical estimation. Let C be the critical domain for the checking the hypothesis H 1 : P 1 versus the alternative H 2 : P 2 . Define by α φ = C φ p 1 and β φ = X C φ p 2 the weighted error probabilities of the I and II types.
Lemma 2.
Let d = d C be the decision rule with the critical domain C. Then,
inf d [ α φ + β φ ] = 1 2 φ d P 1 + φ d P 2 τ φ ( P 1 , P 2 ) .
Proof. 
Denote C * = { x : p 2 ( x ) > p 1 ( x ) } . Then, the result follows from the equality C
C φ d P 1 + X C φ d P 2 = 1 2 φ d P 1 + φ d P 2 + φ | p 1 p 2 | 1 ( x C X C * ) 1 ( x C C * ) .
Theorem 10 (Weighted Fano’s inequality).
Let P 1 , , P M , M 2 be probability distributions such that P j P k , j , k . Then,
inf d max 1 j M φ ( x ) 1 ( d ( x ) j ) d P j ( x ) log ( M ) log ( M 1 ) 1 M j = 1 M φ p j 1 log ( M 1 ) 1 M 2 j , k = 1 M KL φ ( P j , P k ) + log 2 1 M j = 1 M φ p j
where the infimum is taken over all tests with values in { 1 , , M } .
Proof. 
Let Z { 1 , , M } be a random variable such that P ( Z = i ) = 1 M and let X P Z . Note that P Z is a mixture distribution so that, for any measure ν such that P Z ν , we have d P Z d ν = 1 M k = 1 M d P j d ν and so
P ( Z = j | X = x ) = d P j ( x ) k = 1 M d P k ( x ) 1 .
It implies by Jensen’s inequality applied to the convex function log x
φ ( x ) j = 1 M P ( Z = j | X = x ) log P ( Z = j | X = x ) d P X ( x ) 1 M 2 j , k = 1 M φ log d P j d P k d P j log ( M ) 1 M j = 1 M φ p j = j , k = 1 M KL φ ( P j , P k ) log ( M ) 1 M j = 1 M φ p j .
On the other hand, denote by q j = P ( Z = j | X ) P ( Z d ( X ) | X ) and h ( x ) = x log x + ( 1 x ) log ( 1 x ) . Note that h ( x ) log 2 and by Jensen’s inequality j d ( X ) q j log q j log ( M 1 ) . The following inequality holds:
j = 1 M P ( Z = j | X ) log P ( Z = j | X ) = 1 P ( Z d ( X ) | X ) log 1 P ( Z d ( X ) | X ) + j d ( X ) P ( Z = j | X ) log P ( Z = j | X ) = h ( P ( Z = d ( X ) | X ) ) + P ( Z d ( X ) | X ) j d ( X ) q j log q j log 2 log ( M 1 ) P ( d ( X ) Z | X ) .
Integration of (108) yields
φ ( x ) j = 1 M P ( Z = j | X = x ) log P ( Z = j | X = x ) d P X ( x ) log 2 1 M j = 1 M φ p j log ( M 1 ) max 1 j M φ ( x ) 1 ( d ( x ) j ) d P j .
Combining (107) and (109) proves (106). □

11. Conclusions

The contribution of the current paper is summarized in the Table 1 below. The objects 1–8 belong to the treasures of probability theory and statistics, and we present a number of examples and additional facts that are not easy to find in the literature. The objects 9–10, as well as the distances between distributions of different dimensions, appeared quite recently. They are not fully studied and quite rarely used in applied research. Finally, objects 11–12 have been recently introduced by the author and his collaborators. This is the field of the current and future research.

Funding

This research is supported by the grant 23-21-00052 of RSF and the HSE University Basic Research Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Suhov, Y.; Kelbert, M. Probability and Statistics by Example: Volume I. Basic Probability and Statistics; Second Extended Edition; Cambridge University Press: Cambridge, UK, 2014; 457p. [Google Scholar]
  2. Rachev, S.T. Probability Metrics and the Stability of Stochastic Models; Wiley: New York, NY, USA, 1991. [Google Scholar]
  3. Zeifman, A.; Korolev, V.; Sipin, A. (Eds.) Stability Problems for Stochastic Models: Theory and Applications; MDPI: Basel, Switzerland, 2020. [Google Scholar]
  4. Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef] [Green Version]
  5. Kelbert, M.; Suhov, Y. What scientific folklore knows about the distances between the most popular distributions. Izv. Sarat. Univ. (N.S.) Ser. Mat. Mekh. Inform. 2022, 22, 233–240. [Google Scholar] [CrossRef]
  6. Dwivedi, A.; Wang, S.; Tajer, A. Discriminant Analysis under f-Divergence Measures. Entropy 2022, 24, 188. [Google Scholar] [CrossRef] [PubMed]
  7. Devroye, L.; Mehrabian, A.; Reddad, T. The total variation distance between high-dimensional Gaussians. arXiv 2020, arXiv:1810.08693v5. [Google Scholar]
  8. Vallander, S.S. Calculation of the Wasserstein distance between probability distributions on the line. Theory Probab. Appl. 1973, 18, 784–786. [Google Scholar] [CrossRef]
  9. Rachev, S.T. The Monge-Kantorovich mass transference problem and its stochastic applications. Theory Probab. Appl. 1985, 29, 647–676. [Google Scholar] [CrossRef]
  10. Gelbrich, M. On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachrichten 1990, 147, 185–203. [Google Scholar] [CrossRef]
  11. Givens, R.M.; Shortt, R.M. A class of Wasserstein metrics for probability distributions. Mich. Math J. 1984, 31, 231240. [Google Scholar] [CrossRef]
  12. Olkin, I.; Pwelsheim, F. The distances between two random vectors with given dispersion matrices. Lin. Algebra Appl. 1982, 48, 267–2263. [Google Scholar] [CrossRef] [Green Version]
  13. Dowson, D.C.; Landau, B.V. The Fréchet distance between multivariate Normal distributions. J. Multivar. Anal. 1982, 12, 450–456. [Google Scholar] [CrossRef] [Green Version]
  14. Cai, Y.; Lim, L.-H. Distances between probability distributions of different dimensions. IEEE Trans. Inf. Theory 2022, 68, 4020–4031. [Google Scholar] [CrossRef]
  15. Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
  16. Stuhl, I.; Suhov, Y.; Yasaei Sekeh, S.; Kelbert, M. Basic inequalities for weighted entropies. Aequ. Math. 2016, 90, 817–848. [Google Scholar]
  17. Stuhl, I.; Kelbert, M.; Suhov, Y.; Yasaei Sekeh, S. Weighted Gaussian entropy and determinant inequalities. Aequ. Math. 2022, 96, 85–114. [Google Scholar] [CrossRef]
  18. Kasianova, K.; Kelbert, M.; Mozgunov, P. Response-adaptive randomization for multi-arm clinical trials using context-dependent information measures. Comput. Stat. Data Anal. 2021, 158, 107187. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Exact TV distance and the upper bound for (a) TV(Bin ( 20 , 1 2 ) , Bin ( 20 , 1 2 + a ) ) and (b) TV(Pois ( 1 ) , Pois ( 1 + a ) ). (a) Note that the upper bound becomes useless for p 2 p 1 0.07 ; (b) blue and orange curves – exact TV distance: the blue curve works for 1 λ 2 λ 1 2 and the orange curve for 2 λ 2 λ 1 4 . Note that the linear upper bound (red curve) is not relevant and the square root upper (green curve) bound becomes useless for λ 2 λ 1 4 .
Figure 1. Exact TV distance and the upper bound for (a) TV(Bin ( 20 , 1 2 ) , Bin ( 20 , 1 2 + a ) ) and (b) TV(Pois ( 1 ) , Pois ( 1 + a ) ). (a) Note that the upper bound becomes useless for p 2 p 1 0.07 ; (b) blue and orange curves – exact TV distance: the blue curve works for 1 λ 2 λ 1 2 and the orange curve for 2 λ 2 λ 1 4 . Note that the linear upper bound (red curve) is not relevant and the square root upper (green curve) bound becomes useless for λ 2 λ 1 4 .
Analytics 02 00012 g001
Table 1. The main metrics and divergencies.
Table 1. The main metrics and divergencies.
NumberNameReferenceComment
1Kullback–Leibler(2)Divergence but not a distance
2Total variation (TV)(1)Bounded by Pinsker’s inequality
3Kolmogorov–Smirnovp. 2Specific for 1D case
4Hellinger(16)Bounded by Le Cam’s inequality
5Lévy–Prohorov(1)Metrization of the weak convergence
6Fréchet(8, 80)Requires the joint distribution
7Wasserstein(69)Marginal distributions only
8 χ 2 p. 5Divergence but not a distance
9Jensen–Shannon(6)Constructed from Kullback–Leibler
10Geodesic(8)Specific for Gaussian case
11Weighted TV(97)Context sensitive
12Weighted Hellinger(98)Context sensitive
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kelbert, M. Survey of Distances between the Most Popular Distributions. Analytics 2023, 2, 225-245. https://doi.org/10.3390/analytics2010012

AMA Style

Kelbert M. Survey of Distances between the Most Popular Distributions. Analytics. 2023; 2(1):225-245. https://doi.org/10.3390/analytics2010012

Chicago/Turabian Style

Kelbert, Mark. 2023. "Survey of Distances between the Most Popular Distributions" Analytics 2, no. 1: 225-245. https://doi.org/10.3390/analytics2010012

Article Metrics

Back to TopTop