1. Introduction
Comparing probability measures is a fundamental problem in statistics and machine learning. A classical way to compare probability measures is the Kullback–Leibler divergence. Let
M be a measurable space and
be the probability measure on
M; then, the Kullback–Leibler divergence is defined as:
The Wasserstein distance [
1], also known as the earth mover distance [
2], is another way of comparing probability measures. It is a metric on the space of probability measures derived by the mass transportation theory of two probability measures. Informally, optimal transport theory considers an optimal transport plan between two probability measures under a cost function, and the Wasserstein distance is defined by the minimum total transport cost. A significant difference between the Wasserstein distance and the Kullback–Leibler divergence is that the former can reflect the metric structure, whereas the latter cannot. The Wasserstein distance can be written as:
where
is a distance function on a measurable metric space
M and
denotes the set of probability measures on
, whose marginal measures correspond to
and
. In recent years, the application of optimal transport and the Wasserstein distance has been studied in many fields such as statistics, machine learning, and image processing. For example, Reference [
3] generated the interpolation of various three-dimensional (3D) objects using the Wasserstein barycenter. In the field of word embedding in natural language processing, Reference [
4] embedded each word as an elliptical distribution, and the Wasserstein distance was applied between the elliptical distributions. There are many studies on the applications of optimal transport to deep learning, including [
5,
6,
7]. Moreover, Reference [
8] analyzed the denoising autoencoder [
9] with gradient flow in the Wasserstein space.
In the application of the Wasserstein distance, it is often considered in a discrete setting where
and
are discrete probability measures. Then, obtaining the Wasserstein distance between
and
can be formulated as a linear programming problem. In general, however, it is computationally intensive to solve such linear problems and obtain the optimal coupling of two probability measures. For such a situation, a novel numerical method, entropy regularization, was proposed by [
10],
This is a relaxed formulation of the original optimal transport of a cost function
, in which the negative Shannon entropy
is used as a regularizer. For a small
,
can approximate the
p-th power of the Wasserstein distance between two discrete probability measures, and it can be computed efficiently by using Sinkhorn’s algorithm [
11].
More recently, many studies have been published on improving the computational efficiency. According to [
12], the most computationally efficient algorithm at this moment to solve the linear problem for the Wasserstein distance is Lee–Sidford linear solver [
13], which runs in
. Reference [
14] proved that a complexity bound for the Sinkhorn algorithm is
, where
is the desired absolute performance guarantee. After [
10] appeared, various algorithms have been proposed. For example, Reference [
15] adopted stochastic optimization schemes for solving the optimal transport. The Greenkhorn algorithm [
16] is the greedy variant of the Sinkhorn algorithm, and Reference [
12] proposed its acceleration. Many other approaches such as adapting a variety of standard optimization algorithms to approximate the optimal transport problem can be found in [
12,
17,
18,
19]. Several specialized Newton-type algorithms [
20,
21] achieve complexity bound
[
22,
23], which are the best ones in terms of computational complexity at the present moment.
Moreover, entropy-regularized optimal transport has another advantage. Because of the differentiability of the entropy-regularized optimal transport and the simple structure of Sinkhorn’s algorithm, we can easily compute the gradient of the entropy-regularized optimal transport cost and optimize the parameter of a parametrized probability distribution by using numerical differentiation or automatic differentiation. Then, we can define a differentiable loss function that can be applied to various supervised learning methods [
24]. Entropy-regularized optimal transport can be used to approximate not only the Wasserstein distance, but also its optimal coupling as a mapping function. Reference [
25] adopted the optimal coupling of the entropy-regularized optimal transport as a mapping function from one domain to another.
Despite the empirical success of the entropy-regularized optimal transport, its theoretical aspect is less understood. Reference [
26] studied the expected Wasserstein distance between a probability measure and its empirical version. Similarly, Reference [
27] showed the consistency of the entropy-regularized optimal transport cost between two empirical distributions. Reference [
28] showed that minimizing the entropy-regularized optimal transport cost between empirical distributions is equivalent to a type of maximum likelihood estimator. Reference [
29] considered Wasserstein generative adversarial networks with an entropy regularization. Reference [
30] constructed information geometry from the convexity of the entropy-regularized optimal transport cost.
Our intrinsic motivation of this study is to produce an analytical solution about the entropy-regularized optimal transport problem between continuous probability measures so that we can gain insight into the effects of entropy regularization in a theoretical, as well as an experimental way. In our study, we generalized the Wasserstein distance between two multivariate normal distributions by entropy regularization. We derived the explicit form of the entropy-regularized optimal transport cost and its optimal coupling, which can be used to analyze the effect of entropy regularization directly. In general, the nonregularized Wasserstein distance between two probability measures and its optimal coupling cannot be expressed in a closed form; however, Reference [
31] proved the explicit formula for multivariate normal distributions. Theorem 1 is a generalized form of [
31]. We obtain an explicit form of the entropy-regularized optimal transport between two multivariate normal distributions. Furthermore, by adopting the Tsallis entropy [
32] as the entropy regularization instead of the Shannon entropy, our theorem can be generalized to multivariate
q-normal distributions.
Some readers may find it strange to study the entropy-regularized optimal transport for multivariate normal distributions, where the exact (nonregularized) optimal transport has been obtained explicitly. However, we think it is worth studying from several perspectives:
Normal distributions are the simplest and best-studied probability distributions, and thus, it is useful to examine the regularization theoretically in order to infer results for other distributions. In particular, we will partly answer the questions “How much do entropy constraints affect the results?” and “What does it mean to constrain by the entropy?’’ for the simplest cases. Furthermore, as a first step in constructing a theory for more general probability distributions, in
Section 4, we propose a generalization to multivariate
q-normal distributions.
Because normal distributions are the limit distributions in asymptotic theories using the central limit theorem, studying normal distributions is necessary for the asymptotic theory of regularized Wasserstein distances and estimators computed by them. Moreover, it was proposed to use the entropy-regularized Wasserstein distance to compute a lower bound of the generalization error for a variational autoencoder [
29]. The study of the asymptotic behavior of such bounds is one of the expected applications of our results.
Though this has not yet been proven theoretically, we suspect that entropy regularization is efficient not only for computational reasons, such as the use of the Sinkhorn algorithm, but also in the sense of efficiency in statistical inference. Such a phenomenon can be found in some existing studies, including [
33]. Such statistical efficiency is confirmed by some experiments in
Section 6.
The remainder of this paper is organized as follows. First, we review some definitions of optimal transport and entropy regularization in
Section 2. Then, in
Section 3, we provide an explicit form of the entropy-regularized optimal transport cost and its optimal coupling between two multivariate normal distributions. We also extend this result to
q-normal distributions for Tsallis entropy regularization in
Section 4. In
Section 5, we obtain the entropy-regularized Kantorovich estimator of probability measures on
with a finite second moment that are absolutely continuous with respect to the Lebesgue measure in Theorem 3. We emphasize that Theorem 3 is not limited to the case of multivariate normal distribution, but can handle a wider range of probability measures. We analyze how entropy regularization affects the optimal result experimentally in certain sections.
We note that after publishing the preprint version of the paper, we found closely related results [
34,
35] reported within half a year. In Janati et al. [
34], they proved the same result as Theorem 1 based on solving the fixed-point equation behind Sinkhorn’s algorithm. Their results include the unbalanced optimal transport between unbalanced multivariate normal distributions. They also studied the convexity and differentiability of the objective function of the entropy-regularized optimal transport. In [
35], the same closed-form as Theorem 1 was proven by ingeniously using the Schrödinger system. Although there are some overlaps, our paper has significant novelty in the following respects. Our proof is more direct than theirs and can be extended directly to the proof for the Tsallis entropy-regularized optimal transport between multivariate
q-normal distributions provided in
Section 4. Furthermore, Corollaries 1 and 2 are novel and important results to evaluate how much the entropy regularization affects the estimation results or not at all. We also obtain the entropy-regularized Kantorovich estimator in Theorem 3.
2. Preliminary
In this section, we review some definitions of optimal transport and entropy-regularized optimal transport. These definitions were referred to in [
1,
36]. In this section, we use a tuple
for a set
M and
-algebra on
M and
for the set of all probability measures on a measurable space
X.
Definition 1 (Pushforward measure)
. Given measurable spaces and , a measure , and a measurable mapping , the pushforward measure of μ by φ is defined by: Definition 2 (Optimal transport map)
. Consider a measurable space , and let denote a cost function. Given , we call the optimal transport map if φ realizes the infimum of: This problem was originally formalized by [
37]. However, the optimal transport map does not always exist. Then, Kantorovich considered a relaxation of this problem in [
38].
Definition 3 (Coupling)
. Given , the coupling of μ and ν is a probability measure on that satisfies: Definition 4 (Kantorovich problem)
. The Kantorovich problem is defined as finding a coupling π of μ and ν that realizes the infimum of: Hereafter, let be the set of all couplings of and . When we adopt a distance function as the cost function, we can define the Wasserstein distance.
Definition 5 (Wasserstein distance)
. Given , a measurable metric space , and with a finite p-th moment, the p-Wasserstein distance between μ and ν is defined as: Now, we review the definition of entropy-regularized optimal transport on .
Definition 6 (Entropy-regularized optimal transport).
Let , , and let be the density function of the coupling of μ and ν, whose reference measure is the Lebesgue measure. We define the entropy-regularized optimal transport cost as:where denotes the Shannon entropy of a probability measure: There is another variation in entropy-regularized optimal transport defined by the relative entropy instead of the Shannon entropy:
This is definable even when includes a coupling that is not absolutely continuous with respect to the Lebesgue measure. We note that when both and are absolutely continuous, the infimum is attained by the same for and , and it depends only on and . In the following part of the paper, we assume the absolute continuity of , and with respect to the Lebesgue measure for well-defined entropy regularization.
3. Entropy-Regularized Optimal Transport between Multivariate Normal Distributions
In this section, we provide a rigorous solution of entropy-regularized optimal transport between two multivariate normal distributions. Throughout this section, we adopt the squared Euclidean distance
as the cost function. To prove our theorem, we start by expressing
using mean vectors and covariance matrices. The following lemma is a known result; for example, see [
31].
Lemma 1. Let be two random variables on with means and covariance matrices , respectively. If is a coupling of P and Q, we have: Proof. Without loss of generality, we can assume
X and
Y are centralized, because:
By adding
, we obtain (
12). □
Lemma 1 shows that can be parameterized by the covariance matrices . Because and are fixed, the infinite-dimensional optimization of the coupling is a finite-dimensional optimization of covariance matrix .
We prepare the following lemma to prove Theorem 1.
Lemma 2. Under a fixed mean and covariance matrix, the probability measure that maximizes the entropy is a multivariate normal distribution.
Lemma 2 is a particular case of the principle of maximum entropy [
39], and the proof can be found in [
40] Theorem 3.1.
Theorem 1. Let be two multivariate normal distributions. The optimal coupling π of P and Q of the entropy-regularized optimal transport: Furthermore, can be written as: and the relative entropy version can be written as: We note that we use the regularization parameter in for the sake of simplicity.
Proof. Although the first half of the proof can be derived directly from Lemma 2, we provide a proof of this theorem by Lagrange calculus, which will be used later for the extension to
q-normal distributions. Now, we define an optimization problem that is equivalent to the entropy-regularized optimal transport as follows:
Here,
and
are probability density functions of
P and
Q, respectively. Let
,
be Lagrange multipliers that correspond to the above two constraints. The Lagrangian function of (
20) is defined as:
Taking the functional derivative of (
22) with respect to
, we obtain:
By the fundamental lemma of the calculus of variations, we have:
Here,
are determined from the constraints (
21). We can assume that
is a
-variate normal distribution, because for a fixed covariance matrix
,
takes the infimum when the coupling
is a multivariate normal distribution by Lemma 2. Therefore, we can express
by using
and a covariance matrix
as:
According to block matrix inversion formula [
41],
holds, where
is positive definite. Then, comparing the term
between (24) and (28), we obtain
and:
Here,
holds, because
A is a symmetric matrix, and thus, we obtain:
Completing the square of the above equation, we obtain:
Let
Q be an orthogonal matrix; then, (
31) can be solved as:
We rearrange the above equation as follows:
Because the left terms and
are all symmetric positive definite, we can conclude that
Q is the identity matrix by the uniqueness of the polar decomposition. Finally, we obtain:
We obtain (
18) by the direct calculation of
using Lemma 1 with this
. □
The following corollary helps us to understand the properties of .
Corollary 1. Let be the eigenvalues of ; then, monotonically decreases with λ for any .
Proof. Because has the same eigenvalues as , if we let be the eigenvalues of , , which is a monotonically decreasing function of the regularization parameter . □
By the proof, for large , we can prove by diagonalization and . Thus, , and each element of converges to zero as .
We show how entropy regularization behaves in two simple experiments. We calculate the entropy-regularized optimal transport cost
and
in the original version and the relative entropy version in
Figure 1. We separate the entropy-regularized optimal transport cost into the transport cost term and regularization term and display both of them.
It is reasonable that as
,
converges to
, which is equal to the original optimal coupling of nonregularized optimal transport and as
,
converges to 0. This is a special case of Corollary 1.The larger
becomes, the less correlated the optimal coupling is. We visualize this behavior by computing the optimal couplings of two one-dimensional normal distributions in
Figure 2.
The left panel shows the original version. The transport cost is always positive, and the entropy regularization term can take both signs in general; then, the sign and total cost depend on their balance. We note that the transport cost as a function of is bounded, whereas the entropy regularization is not. The boundedness of the optimal cost is deduced from (1) and Corollary 1, and the unboundedness of the entropy regularization is due to the regularization parameter multiplied by the entropy. The right panel shows the relative entropy version. It always takes a non-negative value. Furthermore, because the total cost is bounded by the value for the independent joint distribution (which is always a feasible coupling), both the transport cost and the relative entropy regularization regularization term are also bounded. Nevertheless, the larger the regularization parameter , the greater the influence of entropy regularization over the total cost.
It is known that a specific Riemannian metric can be defined in the space of multivariate normal distributions, which induces the Wasserstein distance [
42]. To understand the effect of entropy regularization, we illustrate how entropy regularization deforms this geometric structure in
Figure 3. Here, we generate 100 two-variate normal distributions
, where
is defined as:
To visualize the geometric structure of these two-variate normal distributions, we compute the relative entropy-regularized optimal transport cost
between each pairwise two-variate normal distributions. Then, we apply multidimensional scaling [
43] to embed them into a plane (see
Figure 3). We can see entropy regularization deforming the geometric structure of the space of multivariate normal distributions. The deformation for distributions close to the isotopic normal distribution is more sensitive to the change in
.
The following corollary states that if we allow orthogonal transformations of two multivariate normal distributions with fixed covariance matrices, then the minimum and maximum of are attained when and are diagonalizable by the same orthogonal matrix or, equivalently, when the ellipsoidal contours of the two density functions are aligned with the same orthogonal axes.
Corollary 2. With the same settings as in Theorem 1, fix , , , and all eigenvalues of . When is diagonalized as , where is the diagonal matrix of the eigenvalues of in descending order and Γ is an orthogonal matrix,
- (i)
is minimized by and
- (ii)
is maximized by ,
where and are the diagonal matrices of the eigenvalues of in descending and ascending order, respectively. Therefore, neither the minimizer, nor the maximizer depend on the choice of λ.
Proof. Because
,
,
, and all eigenvalues of
are fixed,
where
are the eigenvalues of
and:
Note that
is a concave function, because:
Let
and
be the eigenvalues of
and
, respectively. By Exercise 6.5.3 of [
44] or Theorem 6.13 and Corollary 6.14 of [
45],
Here, for
such that
and
,
means:
and
is said to be majorized by
. Because
is concave,
where
represents weak supermajorization, i.e.,
means:
(see Theorem 5.A.1 of [
46], for example). Therefore,
As in Case (i) (or (ii)), the eigenvalues of correspond to the eigenvalues of (or , respectively), the corollary follows. □
Note that a special case of Corollary 2 for the ordinary Wasserstein metric (
) has been studied in the context of fidelity and the Bures distance in quantum information theory. See Lemma 3 of [
47]. Their proof is not directly applicable to our generalized result; thus, we used another approach to prove it.
4. Extension to Tsallis Entropy Regularization
In this section, we consider a generalization of entropy-regularized optimal transport. We now focus on the Tsallis entropy [
32], which is a generalization of the Shannon entropy and appears in nonequilibrium statistical mechanics. We show that the optimal coupling of Tsallis entropy-regularized optimal transport between two
q-normal distributions is also a
q-normal distribution. We start by recalling the definition of the
q-exponential function and
q-logarithmic function based on [
32].
Definition 7. Let q be a real parameter, and let . The q-logarithmic function is defined as: and the q-exponential function is defined as: Definition 8. Let or ; an n-variate q-normal distribution is defined by two parameters: and a positive definite matrix Σ, and its density function is:where is a normalizing constant. μ and Σ are called the location vector and scale matrix, respectively. In the following, we write the multivariate
q-normal distribution
. We note that the property of the
q-normal distribution changes in accordance with
q. The
q-normal distribution has an unbounded support for
and a bounded support for
. The second moment exists for
, and the covariance becomes
. We remark that each
n-variate
-normal distribution is equivalent to an
n-variate
t-distribution with
degrees of freedom,
for
and an
n-variate normal distribution for
.
Definition 9. Let p be a probability density function. The Tsallis entropy is defined as: Then, the Tsallis entropy-regularized optimal transport is defined as:
The following lemma is a generalization of the maximum entropy principle for the Shannon entropy shown in Section 2 of [
48].
Lemma 3. Let P be a centered n-dimensional probability measure with a fixed covariance matrix Σ; the maximizer of the Renyi α-entropy:under the constraint is for . We note that the maximizers of the Renyi
-entropy and the Tsallis entropy with
coincide; thus, the above lemma also holds for the Tsallis entropy. This is mentioned, for example, in Section 9 of [
49].
To prove Theorem 2, we use the following property of multivariate
t-distributions, which is summarized in Chapter 1 of [
50].
Lemma 4. Let X be a random vector following an n-variate t-distribution with degree of freedom ν. Considering a partition of the mean vector μ and scale matrix Σ, such as: follows a p-variate t-distribution with degree of freedom ν, mean vector , and scale matrix , where p is the dimension of . Recalling the correspondence of the parameter of the multivariate q-normal distribution and the degree of freedom of the multivariate t-distribution , we can obtain the following corollary.
Corollary 3. Let X be a random vector following an n-variate q-normal distribution for . Consider a partition of the mean vector μ and scale matrix Σ in the same way as in (54). Then, follows a p-variate -normal distribution with mean vector and scale matrix , where p is the dimension of . Theorem 2. Let be n-variate q-normal distributions for and ; consider the Tsallis entropy-regularized optimal transport: Then, there exists a unique such that the optimal coupling π of the entropy-regularized optimal transport is expressed as:where: Proof. The proof proceeds in a similar way as in Theorem 1. Let
and
be the Lagrangian multipliers. Then, the Lagrangian function
of (
52) is defined as:
and the extremum of the Tsallis entropy-regularized optimal transport is obtained by the functional derivative with respect to
,
Here,
and
are quadratic polynomials by Lemma 3. To separate the normalizing constant, we introduce a constant
, and
can be written as:
with quadratic functions
and
.
Let
. Then, by the same argument as in the proof of Theorem 1 and using Corollary 3, we obtain the scale matrix of
as:
where:
Let
and
;
can be written as:
The constant
c is determined by:
We will show that the above equation has a unique solution. Let
be the eigenvalues of
;
can be expressed as
. We consider:
Because
,
is a monotonic decreasing function, and
,
, (
64) has a unique positive solution, and
is determined uniquely. □
5. Entropy-Regularized Kantorovich Estimator
Many estimators are defined by minimizing the divergence or distance
between probability measures, that is
for a fixed
. When
is the Kullback–Leibler divergence, the estimator corresponds to the maximum likelihood estimator. When
is the Wasserstein distance, the following estimator is called the minimum Kantorovich estimator, according to [
36]. In this section, we consider a probability measure
that minimizes
for a fixed
P over
, the set of all probability measures on
with finite second moment that are absolutely continuous with respect to the Lebesgue measure. In other words, we define the entropy-regularized Kantorovich estimator
The entropy-regularized Kantorovich estimator for discrete probability measures was studied in [
33], Theorem 2. We obtain the entropy-regularized Kantorovich estimator for continuous probability measures in the following theorem:
Theorem 3. For a fixed ,exists, and its density function can be written as:where is a density function of , and 🟉
denotes the convolution operator. We use the dual problem of the entropy-regularized optimal transport to prove Theorem 3 (for details, see Proposition 2.1 of [
15] or
Section 3 of [
51]).
Lemma 5. The dual problem of entropy-regularized optimal transport can be written as: Moreover, holds.
Now, we prove Theorem 3.
Proof. Let
be the minimizer of
. Applying Lemma 5, there exist
and
such that:
Now,
is the minimum value of
, such that the variation
is always zero. Then,
holds, and the optimal coupling of
can be written as:
Moreover, we can obtain a closed-form of
as follows from the equation
:
Then, by calculating the marginal distribution of
with respect to
x, we can obtain:
Therefore, we conclude that a probability measure
Q that minimizes
is expressed as (
75). □
It should be noted that when P in Theorem 3 are multivariate normal distributions, and P are simultaneously diagonalizable by a direct consequence of the theorem. This is consistent with the result of Corollary 2(1) for minimization when all eigenvalues are fixed.
We can determine that the entropy-regularized Kantorovich estimator is a measure convolved with an isotropic multivariate normal distribution scaled by the regularization parameter . This is similar to the idea of prior distributions in the context of Bayesian inference. Applying Theorem 3, the entropy-regularized Kantorovich estimator of the multivariate normal distribution is .