Next Article in Journal
Local Information as an Essential Factor for Quantum Entanglement
Next Article in Special Issue
Designing Bivariate Auto-Regressive Timeseries with Controlled Granger Causality
Previous Article in Journal
Predictions of Conjugate Heat Transfer in Turbulent Channel Flow Using Advanced Wall-Modeled Large Eddy Simulation Techniques
Previous Article in Special Issue
Information Geometry of the Exponential Family of Distributions with Progressive Type-II Censoring
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Representations of Divergence Measures and Related Quantities in Exponential Families

Institute of Statistics, RWTH Aachen University, 52056 Aachen, Germany
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(6), 726; https://doi.org/10.3390/e23060726
Submission received: 12 May 2021 / Revised: 3 June 2021 / Accepted: 5 June 2021 / Published: 8 June 2021
(This article belongs to the Special Issue Measures of Information)

Abstract

:
Within exponential families, which may consist of multi-parameter and multivariate distributions, a variety of divergence measures, such as the Kullback–Leibler divergence, the Cressie–Read divergence, the Rényi divergence, and the Hellinger metric, can be explicitly expressed in terms of the respective cumulant function and mean value function. Moreover, the same applies to related entropy and affinity measures. We compile representations scattered in the literature and present a unified approach to the derivation in exponential families. As a statistical application, we highlight their use in the construction of confidence regions in a multi-sample setup.

1. Introduction

There is a broad literature on divergence and distance measures for probability distributions, e.g., on the Kullback–Leibler divergence, the Cressie–Read divergence, the Rényi divergence, and Phi divergences as a general family, as well as on associated measures of entropy and affinity. For definitions and details, we refer to [1]. These measures have been extensively used in statistical inference. Excellent monographs on this topic were provided by Liese and Vajda [2], Vajda [3], Pardo [1], and Liese and Miescke [4].
Within an exponential family as defined in Section 2, which may consist of multi-parameter and multivariate distributions, several divergence measures and related quantities are seen to have nice explicit representations in terms of the respective cumulant function and mean value function. These representations are contained in different sources. Our focus is on a unifying presentation of main quantities, while not aiming at an exhaustive account. As an application, we derive confidence regions for the parameters of exponential distributions based on different divergences in a simple multi-sample setup.
For the use of the aforementioned measures of divergence, entropy, and affinity, we refer to the textbooks [1,2,3,4] and exemplarily to [5,6,7,8,9,10] for statistical applications, including the construction of test procedures as well as methods based on dual representations of divergences, and to [11] for a classification problem.

2. Exponential Families

Let Θ be a parameter set, μ be a σ -finite measure on the measurable space ( X , B ) , and P = { P ϑ : ϑ Θ } be an exponential family (EF) of distributions on ( X , B ) with μ -density
f ϑ ( x ) = C ( ϑ ) exp j = 1 k Z j ( ϑ ) T j ( x ) h ( x ) , x X ,
of P ϑ for ϑ Θ , where C , Z 1 , , Z k : Θ R are real-valued functions on Θ and h , T 1 , , T k : ( X , B ) ( R 1 , B 1 ) are real-valued Borel-measurable functions with h 0 . Usually, μ is either the counting measure on the power set of X (for a family of discrete distributions) or the Lebesgue measure on the Borel sets of X (in the continuous case). Without loss of generality and for a simple notation, we assume that h > 0 (the set { x X : h ( x ) = 0 } is a null set for all P P ). Let ν denote the σ -finite measure with μ -density h.
We assume that representation (1) is minimal in the sense that the number k of summands in the exponent cannot be reduced. This property is equivalent to Z 1 , , Z k being affinely independent mappings and T 1 , , T k being ν -affinely independent mappings; see, e.g., [12] (Cor. 8.1). Here, ν -affine independence means affine independence on the complement of every null set of ν .
To obtain simple formulas for divergence measures in the following section, it is convenient to use the natural parameter space
Ξ = ζ R k : e ζ t T h d μ <
and the (minimal) canonical representation { P ζ : ζ Z ( Θ ) } of P with μ -density
f ζ ( x ) = C ( ζ ) e ζ t T ( x ) h ( x ) , x X ,
of P ζ and normalizing constant C ( ζ ) for ζ = ( ζ 1 , , ζ k ) t Z ( Θ ) Ξ , where Z = ( Z 1 , , Z k ) t denotes the (column) vector of the mappings Z 1 , , Z k and T = ( T 1 , , T k ) t denotes the (column) vector of the statistics T 1 , , T k . For simplicity, we assume that P is regular, i.e., we have that Z ( Θ ) = Ξ ( P is full) and that Ξ is open; see [13]. In particular, this guarantees that T is minimal sufficient and complete for P ; see, e.g., [14] (pp. 25–27).
The cumulant function
κ ( ζ ) = ln ( C ( ζ ) ) , ζ Ξ ,
associated with P is strictly convex and infinitely often differentiable on the convex set Ξ ; see [13] (Theorem 1.13 and Theorem 2.2). It is well-known that the Hessian matrix of κ at ζ coincides with the covariance matrix of T under P ζ and that it is also equal to the Fisher information matrix I ( ζ ) at ζ . Moreover, by introducing the mean value function
π ( ζ ) = E ζ [ T ] , ζ Ξ ,
we have the useful relation
π = κ ,
where κ denotes the gradient of κ ; see [13] (Cor. 2.3). π is a bijective mapping from Ξ to the interior of the convex support of ν T , i.e., the closed convex hull of the support of ν T ; see [13] (p. 2 and Theorem 3.6).
Finally, note that representation (2) can be rewritten as
f ζ ( x ) = e ζ t T ( x ) κ ( ζ ) h ( x ) , x X ,
for ζ Ξ .

3. Divergence Measures

Divergence measures may be applied, for instance, to quantify the “disparity” of a distribution to some reference distribution or to measure the “distance” between two distributions within some family in a certain sense. If the distributions in the family are dominated by a σ -finite measure, various divergence measures have been introduced by means of the corresponding densities. In parametric statistical inference, they serve to construct statistical tests or confidence regions for underlying parameters; see, e.g., [1].
Definition 1.
Let F be a set of distributions on ( X , B ) . A mapping D : F × F R is called a divergence (or divergence measure) if:
(i) 
D ( P , Q ) 0 for all P , Q F and D ( P , Q ) = 0 P = Q (positive definiteness).
If additionally
(ii) 
D ( P , Q ) = D ( Q , P ) for all P , Q F (symmetry) is valid, D is called a distance (or distance measure or semi-metric). If D then moreover meets
(iii) 
D ( P 1 , P 2 ) D ( P 1 , Q ) + D ( Q , P 2 ) for all P 1 , P 2 , Q F (triangle inequality), D is said to be a metric.
Some important examples are the Kullback–Leibler divergence (KL-divergence):
D K L ( P 1 , P 2 ) = f 1 ln f 1 f 2 d μ ,
the Jeffrey distance:
D J ( P 1 , P 2 ) = D K L ( P 1 , P 2 ) + D K L ( P 2 , P 1 )
as a symmetrized version, the Rényi divergence:
D R q ( P 1 , P 2 ) = 1 q ( q 1 ) ln f 1 q f 2 1 q d μ , q R { 0 , 1 } ,
along with the related Bhattacharyya distance D B ( P 1 , P 2 ) = D R 1 / 2 ( P 1 , P 2 ) / 4 , the Cressie–Read divergence (CR-divergence):
D C R q ( P 1 , P 2 ) = 1 q ( q 1 ) f 1 f 1 f 2 q 1 1 d μ , q R { 0 , 1 } ,
which is the same as the Chernoff α -divergence up to a parameter transformation, the related Matusita distance D M ( P 1 , P 2 ) = D C R 1 / 2 ( P 1 , P 2 ) / 2 , and the Hellinger metric:
D H ( P 1 , P 2 ) = f 1 f 2 2 d μ 1 / 2
for distributions P 1 , P 2 F with μ -densities f 1 , f 2 , provided that the integrals are well-defined and finite.
D K L , D R q , and D C R q for q R { 0 , 1 } are divergences, and D J , D R 1 / 2 , D B , D C R 1 / 2 , and D M ( = D H 2 ) , since they moreover satisfy symmetry, are distances on F × F . D H is known to be a metric on F × F .
In parametric models, it is convenient to use the parameters as arguments and briefly write, e.g.,
D K L ( ϑ 1 , ϑ 2 ) for D K L ( P ϑ 1 , P ϑ 2 ) , ϑ 1 , ϑ 2 Θ ,
if the parameter ϑ Θ is identifiable, i.e., if the mapping ϑ P ϑ is one-to-one on Θ . This property is met for the EF P in Section 2 with minimal canonical representation (5); see, e.g., [13] (Theorem 1.13(iv)).
It is known from different sources in the literature that the EF structure admits simple formulas for the above divergence measures in terms of the corresponding cumulant function and/or mean value function. For the KL-divergence, we refer to [15] (Cor. 3.2) and [13] (pp. 174–178), and for the Jeffrey distance also to [16].
Theorem 1.
Let P be as in Section 2 with minimal canonical representation (5). Then, for ζ , η Ξ , we have
D K L ( ζ , η ) = κ ( η ) κ ( ζ ) + ( ζ η ) t π ( ζ ) a n d D J ( ζ , η ) = ( ζ η ) t ( π ( ζ ) π ( η ) ) .
Proof. 
By using Formulas (3) and (5), we obtain for ζ , η Ξ that
D K L ( ζ , η ) = ln ( f ζ ) ln ( f η ) f ζ d μ = ( ζ η ) t T κ ( ζ ) + κ ( η ) f ζ d μ = κ ( η ) κ ( ζ ) + ( ζ η ) t π ( ζ ) .
From this, the representation of D J is obvious. □
As a consequence of Theorem 1, D K L and D J are infinitely often differentiable on Ξ × Ξ , and the derivatives are easily obtained by making use of the EF properties. For example, by using Formula (4), we find D K L ( ζ , · ) = π ( · ) π ( ζ ) and that the Hessian matrix of D K L ( ζ , · ) at η is the Fisher information matrix I ( η ) , where ζ Ξ is considered to be fixed.
Moreover, we obtain from Theorem 1 that the reverse KL-divergence D K L ( ζ , η ) = D K L ( η , ζ ) for ζ , η Ξ is nothing but the Bregman divergence associated with the cumulant function κ ; see, e.g., [1,11,17]. As an obvious consequence of Theorem 1, other symmetrizations of the KL-divergence may be expressed in terms of κ and π as well, such as the so-called resistor-average distance (cf. [18])
D R A ( ζ , η ) = 2 1 D K L ( ζ , η ) + 1 D K L ( η , ζ ) 1 = 2 D K L ( ζ , η ) D K L ( η , ζ ) D J ( ζ , η ) , ζ , η Ξ , ζ η ,
with D R A ( ζ , ζ ) = 0 , ζ Ξ , or the distance
D G A ( ζ , η ) = D K L ( ζ , η ) D K L ( η , ζ ) 1 / 2 , ζ , η Ξ ,
obtained by taking the harmonic and geometric mean of D K L and D K L ; see [19].
Remark 1.
Formula (9) can be used to derive the test statistic
Λ ( x ) = 2 ln sup ζ Ξ 0 f ζ ( x ) sup ζ Ξ f ζ ( x ) , x X ,
of the likelihood-ratio test for the test problem
H 0 : ζ Ξ 0 a g a i n s t H 1 : ζ Ξ Ξ 0 ,
where Ξ 0 Ξ . If the maximum likelihood estimators (MLEs) ζ ^ = ζ ^ ( x ) and ζ ^ 0 = ζ ^ 0 ( x ) of ζ in Ξ and Ξ 0 (based on x) both exist, we have:
Λ = 2 ln ( f ζ ^ ) ln ( f ζ ^ 0 ) = 2 κ ( ζ ^ 0 ) κ ( ζ ^ ) + ( ζ ^ ζ ^ 0 ) t T = 2 D K L ( ζ ^ , ζ ^ 0 )
by using that the unrestricted MLE fulfils π ( ζ ^ ) = T ; see, e.g., [12] (p. 190) and [13] (Theorem 5.5). In particular, when testing a simple null hypothesis with Ξ 0 = { η } for some fixed η Ξ , we have Λ = 2 D K L ( ζ ^ , η ) .
Convenient representations within EFs of the divergences in Formulas (6)–(8) can also be found in the literature; we refer to [2] (Prop. 2.22) for D R q , D H , and D M , to [20] for D B , and to [9] for D R q . The formulas may all be obtained by computing the quantity
A q ( P 1 , P 2 ) = f 1 q f 2 1 q d μ , q R { 0 , 1 } .
For q ( 0 , 1 ) , we have the following identity (cf. [21]).
Lemma 1.
Let P be as in Section 2 with minimal canonical representation (5). Then, for ζ , η Ξ and q ( 0 , 1 ) , we have:
A q ( ζ , η ) = exp κ ( q ζ + ( 1 q ) η ) [ q κ ( ζ ) + ( 1 q ) κ ( η ) ] .
Proof. 
Let ζ , η Ξ and q ( 0 , 1 ) . Then,
A q ( ζ , η ) = ( f ζ ) q ( f η ) 1 q d μ = exp ( q ζ + ( 1 q ) η ) t T [ q κ ( ζ ) + ( 1 q ) κ ( η ) ] h d μ = exp κ ( q ζ + ( 1 q ) η ) [ q κ ( ζ ) + ( 1 q ) κ ( η ) ] ,
where the convexity of Ξ ensures that κ ( q ζ + ( 1 q ) η ) is defined. □
Remark 2.
For arbitrary divergence measures, several transformations and skewed versions as well as symmetrization methods, such as the Jensen–Shannon symmetrization, are studied in [19]. Applied to the KL-divergence, the skew Jensen–Shannon divergence is introduced as
D J S q ( P 1 , P 2 ) = q D K L ( P 1 , q P 1 + ( 1 q ) P 2 ) + ( 1 q ) D K L ( P 2 , q P 1 + ( 1 q ) P 2 )
for P 1 , P 2 P and q ( 0 , 1 ) , which includes the Jensen–Shannon distance for q = 1 / 2 (the distance D J S 1 / 2 1 / 2 even forms a metric). Note that, for ζ , η Ξ , the density q f ζ + ( 1 q ) f η of the mixture q P ζ + ( 1 q ) P η does not belong to P , in general, such that the identity in Theorem 1 for the KL-divergence is not applicable, here.
However, from the proof of Lemma 1, it is obvious that
1 A q ( ζ , η ) f ζ q f η 1 q = f q ζ + ( 1 q ) η , ζ , η Ξ , q ( 0 , 1 ) ,
i.e., the EF P is closed when forming normalized weighted geometric means of the densities. This finding is utilized in [19] to introduce another version of the skew Jensen–Shannon divergence based on the KL-divergence, where the weighted arithmetic mean of the densities is replaced by the normalized weighted geometric mean. The skew geometric Jensen–Shannon divergence thus obtained is given by
D G J S q ( ζ , η ) = q D K L ( ζ , q ζ + ( 1 q ) η ) + ( 1 q ) D K L ( η , q ζ + ( 1 q ) η ) , ζ , η Ξ ,
for q ( 0 , 1 ) . By using Theorem 1, we find
D G J S q ( ζ , η ) = q κ q ζ + ( 1 q ) η κ ( ζ ) + ( 1 q ) ( ζ η ) t π ( ζ ) + ( 1 q ) κ q ζ + ( 1 q ) η κ ( η ) + q ( η ζ ) t π ( η ) = κ q ζ + ( 1 q ) η q κ ( ζ ) + ( 1 q ) κ ( η ) + q ( 1 q ) ( ζ η ) t π ( ζ ) π ( η ) = ln ( A q ( ζ , η ) ) + q ( 1 q ) D J ( ζ , η ) ,
for ζ , η Ξ and q ( 0 , 1 ) .
In particular, setting q = 1 / 2 gives the geometric Jensen–Shannon distance:
D G J S ( ζ , η ) = κ ζ + η 2 κ ( ζ ) + κ ( η ) 2 + ( ζ η ) t π ( ζ ) π ( η ) 4 , ζ , η Ξ .
For more details and properties as well as related divergence measures, we refer to [19,22].
Formulas for D R q , D C R q , and D H are readily deduced from Lemma 1.
Theorem 2.
Let P be as in Section 2 with minimal canonical representation (5). Then, for ζ , η Ξ and q ( 0 , 1 ) , we have
D R q ( ζ , η ) = 1 q ( q 1 ) κ ( q ζ + ( 1 q ) η ) [ q κ ( ζ ) + ( 1 q ) κ ( η ) ] , D C R q ( ζ , η ) = 1 q ( q 1 ) exp κ ( q ζ + ( 1 q ) η ) [ q κ ( ζ ) + ( 1 q ) κ ( η ) ] 1 , a n d D H ( ζ , η ) = 2 2 exp κ ζ + η 2 κ ( ζ ) + κ ( η ) 2 1 / 2 .
Proof. 
Since
D R q = ln ( A q ) q ( q 1 ) , D C R q = A q 1 q ( q 1 ) , and D H = 2 2 A 1 / 2 1 / 2 ,
the assertions are directly obtained from Lemma 1. □
It is well-known that
lim q 1 D R q ( P 1 , P 2 ) = D K L ( P 1 , P 2 ) and lim q 0 D R q ( P 1 , P 2 ) = D K L ( P 2 , P 1 ) ,
such that Formula (9) results from the representation of the Rényi divergence in Theorem 2 by sending q to 1.
The Sharma–Mittal divergence (see [1]) is closely related to the Rényi divergence as well and, by Theorem 2, a representation in EFs is available.
Moreover, representations within EFs for so-called local divergences can be derived as, e.g., the Cressie–Read local divergence, which results from the CR-divergence by multiplying the integrand with some kernel density function; see [23].
Remark 3.
Inspecting the proof of Theorem 2, D R q and D C R q are seen to be strictly decreasing functions of A q for q ( 0 , 1 ) ; for q = 1 / 2 , this is also true for D H . From an inferential point of view, this finding yields that, for fixed q ( 0 , 1 ) , test statistics and pivot statistics based on these divergence measures will lead to the same test and confidence region, respectively. This is not the case within some divergence families such as D R q , q ( 0 , 1 ) , where different values of q correspond to different tests and confidence regions, in general.
A more general form of the Hellinger metric is given by
D H , m ( P 1 , P 2 ) = | f 1 1 / m f 2 1 / m | m d μ 1 / m
for m N , where D H , 2 = D H ; see Formula (8). For m 2 N , i.e., if m is even, the binomial theorem then yields
[ D H , m ( P 1 , P 2 ) ] m = f 1 1 / m f 2 1 / m m d μ = k = 0 m ( 1 ) k m k f 1 k / m f 2 ( m k ) / m d μ = k = 0 m ( 1 ) k m k A k / m ( P 1 , P 2 ) ,
and inserting for A k / m , k = 1 , 1 , , m 1 , according to Lemma 1 along with A 0 1 A 1 gives a formula for D H , m in terms of the cumulant function of the EF P in Section 2. This representation is stated in [16].
Note that the representation for A q in Lemma 1 (and thus the formulas for D R q and D C R q in Theorem 2) are also valid for ζ , η Ξ and q R [ 0 , 1 ] as long as q ζ + ( 1 q ) η Ξ is true. This can be used, e.g., to find formulas for D C R 2 and D C R 1 , which coincide with the Pearson χ 2 -divergence
D χ 2 ( ζ , η ) = 1 2 ( f ζ f η ) 2 f η d μ = 1 2 A 2 ( ζ , η ) 1 = 1 2 exp κ 2 ζ η 2 κ ( ζ ) + κ ( η ) 1
for ζ , η Ξ with 2 ζ η Ξ and the reverse Pearson χ 2 -divergence (or Neyman χ 2 -divergence) D χ 2 ( ζ , η ) = D χ 2 ( η , ζ ) for ζ , η Ξ with 2 η ζ Ξ . Here, the restrictions on the parameters are obsolete if Ξ = R k for some k N , which is the case for the EF of Poisson distributions and for any EF of discrete distributions with finite support such as binomial or multinomial distributions (with n N fixed). Moreover, quantities similar to A q such as f ζ ( f η ) γ d μ for γ > 0 arise in the so-called γ -divergence, for which some representations can also be obtained; see [24] (Section 4).
Remark 4.
If the assumption of the EF P to be regular is weakened to P being steep, Lemma 1 and Theorem 2 remain true; moreover, the formulas in Theorem 1 are valid for ζ lying in the interior of Ξ . Steep EFs are full EFs in which boundary points of Ξ that belong to Ξ satisfy a certain property. A prominent example is provided by the full EF of inverse normal distributions. For details, see, e.g., [13].
The quantity A q in Formula (12) is the two-dimensional case of the weighted Matusita affinity
ρ w 1 , , w n ( P 1 , , P n ) = i = 1 n f i w i d μ
for distributions P 1 , , P n with μ -densities f 1 , , f n , weights w 1 , , w n > 0 satisfying i = 1 n w i = 1 , and n 2 ; see [4] (p. 49) and [6]. ρ w 1 , , w n , in turn, is a generalization of the Matusita affinity
ρ n ( P 1 , , P n ) = i = 1 n f i 1 / n d μ
introduced in [25,26]. Along the lines of the proof of Lemma 1, we find the representation
ρ w 1 , , w n ( ζ ( 1 ) , , ζ ( n ) ) = exp κ i = 1 n w i ζ ( i ) i = 1 n w i κ ( ζ ( i ) ) , ζ ( 1 ) , , ζ ( n ) Ξ ,
for the EF P in Section 2; cf. [27]. In [4], the quantity in Formula (14) is termed Hellinger transform, and a representation within EFs is stated in Example 1.88.
ρ w 1 , , w n can be used, for instance, as the basis of a homogeneity test (with null hypothesis H 0 : ζ ( 1 ) = = ζ ( n ) ) or in discriminant problems.
For a representation of an extension of the Jeffrey distance to more than two distributions in an EF, the so-called Toussaint divergence, along with statistical applications, we refer to [8].

4. Entropy Measures

The literature on entropy measures, their applications, and their relations to divergence measures is broad. We focus on some selected results and state several simple representations of entropy measures within EFs.
Let the EF in Section 2 be given with h 1 , which is the case, e.g., for the one-parameter EFs of geometric distributions and exponential distributions as well as for the two-parameter EF of univariate normal distributions. Formula (5) then yields that
f ζ r d μ = e r ζ t T r κ ( ζ ) d μ = e κ ( r ζ ) r κ ( ζ ) = J r ( ζ ) , say ,
for r > 0 and ζ Ξ with r ζ Ξ . Note that the latter condition is not that restrictive, since the natural parameter space of a regular EF is usually a cartesian product of the form A 1 × × A k with A i { R , ( , 0 ) , ( 0 , ) } for 1 i k .
The Taneja entropy is then obtained as
H T ( ζ ) = 2 r 1 f ζ r ln f ζ d μ = 2 r 1 ζ t T e r ζ t T r κ ( ζ ) d μ κ ( ζ ) J r ( ζ ) = 2 r 1 J r ( ζ ) ζ t T f r ζ d μ κ ( ζ ) = 2 r 1 e κ ( r ζ ) r κ ( ζ ) ζ t π ( r ζ ) κ ( ζ )
for r > 0 and ζ Ξ with r ζ Ξ , which includes the Shannon entropy
H S ( ζ ) = f ζ ln f ζ d μ = κ ( ζ ) ζ t π ( ζ ) , ζ Ξ ,
by setting r = 1 ; see [7,28].
Several other important entropy measures are functions of J r and therefore admit respective representations in terms of the cumulant function of the EF. Two examples are provided by the Rényi entropy and the Havrda–Charvát entropy (or Tsallis entropy), which are given by
H R r ( ζ ) = 1 1 r ln J r ( ζ ) = κ ( r ζ ) r κ ( ζ ) 1 r , r > 0 , r 1 , and H H C r ( ζ ) = 1 1 r J r ( ζ ) 1 = 1 1 r e κ ( r ζ ) r κ ( ζ ) 1 , r > 0 , r 1 ,
for ζ Ξ with r ζ Ξ ; for the definitions, see, e.g., [1]. More generally, the Sharma–Mittal entropy is seen to be
H S M r , s ( ζ ) = 1 1 s J r ( ζ ) 1 s 1 r 1 = 1 1 s e κ ( r ζ ) r κ ( ζ ) 1 s 1 r 1 , r > 0 , r 1 , s R , s 1 ,
for ζ Ξ with r ζ Ξ , which yields the representation for H S as r = s 1 , for H R r as s 1 , and for H H C r as s r ; see [29].
If the assumption h 1 is not met, the calculus of the entropies becomes more involved. The Shannon entropy, for instance, is then given by
H S ( ζ ) = κ ( ζ ) ζ t π ( ζ ) + E ζ [ ln ( h ) ] , ζ Ξ ,
where the additional additive term E ζ [ ln ( h ) ] , as it is the mean of ln ( h ) under P ζ , will also depend on ζ , in general; see, e.g., [17]. Since
f ζ r d μ = e κ ( r ζ ) r κ ( ζ ) E r ζ h r 1
for r > 0 and ζ Ξ with r ζ Ξ (cf. [29]), more complicated expressions result for other entropies and require to compute respective moments of h. Of course, we arrive at the same expressions as for the case h 1 if the entropies are introduced with respect to the dominating measure ν , which is neither a counting nor a Lebesgue measure, in general; see Section 2. However, in contrast to divergence measures, entropies usually depend on the dominating measure, such that the resulting entropy values of the distributions will be different.
Representations of Rényi and Shannon entropies for various multivariate distributions including several EFs can be found in [30].

5. Application

As aforementioned, applications of divergence measures in statistical inference have been extensively discussed; see the references in the introduction. As an example, we make use of the representations of the symmetric divergences (distances) in Section 3 to construct confidence regions that are different from the standard rectangles for exponential parameters in a multi-sample situation.
Let n 1 , , n k N and X i j , 1 i k , 1 j n i , be independent random variables, where X i 1 , , X i n i follow an exponential distribution with (unknown) mean 1 / α i for 1 i k . The overall joint distribution P α , say, has the density function
f α ( x ) = e α t T ( x ) κ ( α ) ,
with the k-dimensional statistic
T ( x ) = ( x 1 , , x k ) t , where x i = j = 1 n i x i j , 1 i k ,
for x = ( x 11 , , x 1 n 1 , , x k 1 , , x k n k ) ( 0 , ) n , the cumulant function
κ ( α ) = i = 1 k n i ln ( α i ) , α = ( α 1 , , α k ) t ( 0 , ) k ,
and n = i = 1 k n i . It is easily verified that P = { P α : α ( 0 , ) k } forms a regular EF with minimal canonical representation (15). The corresponding mean value function is given by
π ( α ) = n 1 α 1 , , n k α k t , α = ( α 1 , , α k ) t ( 0 , ) k .
To construct confidence regions for α based on the Jeffrey distance D J , the resistor-average distance D R A , the distance D G A , the Hellinger metric D H , and the geometric Jensen–Shannon distance D G J S , we first compute the KL-divergence D K L and the affinity A 1 / 2 . Note that, by Remark 3, constructing a confidence region based on D H is equivalent to constructing a confidence region based on either A 1 / 2 , D R 1 / 2 , or D C R 1 / 2 .
For α = ( α 1 , , α k ) t , β = ( β 1 , , β k ) t ( 0 , ) k , we obtain from Theorem 1 that
D K L ( α , β ) = i = 1 k n i ln ( β i ) + i = 1 k n i ln ( α i ) i = 1 k n i α i ( α i β i ) = i = 1 k n i β i α i ln β i α i 1 ,
such that
D J ( α , β ) = D K L ( α , β ) + D K L ( β , α ) = i = 1 k n i α i β i + β i α i 2 .
D R A and D G A are then computed by inserting for D K L and D J in Formulas (10) and (11). Applying Lemma 1 yields
A 1 / 2 ( α , β ) = i = 1 k α i + β i 2 n i i = 1 k α i n i / 2 i = 1 k β i n i / 2 = i = 1 k 1 2 α i β i + β i α i n i ,
which gives D H ( α , β ) = [ 2 2 A 1 / 2 ( α , β ) ] 1 / 2 by inserting, and, by using Formula (13), also leads to
D G J S ( α , β ) = ln ( A 1 / 2 ( α , β ) ) + D J ( α , β ) 4 = 1 4 i = 1 k n i α i β i + β i α i 4 ln 1 2 α i β i + β i α i 2 .
The MLE α ^ = ( α ^ 1 , , α ^ k ) t of α based on X = ( X 11 , , X 1 n 1 , , X k 1 , , X k n k ) , is given by
α ^ = n 1 X 1 , , n k X k t ,
where α ^ 1 , , α ^ k are independent. By inserting the random distances D J ( α ^ , α ) , D R A ( α ^ , α ) , D G A ( α ^ , α ) , D H ( α ^ , α ) , and D G J S ( α ^ , α ) turn out to depend on X only through the vector ( α 1 / α ^ 1 , , α k / α ^ k ) t of component-wise ratios, where α i / α ^ i has a gamma distribution with shape parameter n i , scale parameter 1 / n i , and mean 1 for 1 i k . Since these ratios are moreover independent, the above random distances form pivot statistics with distributions free of α .
Now, confidence regions for α with confidence level p ( 0 , 1 ) are given by
C = α ( 0 , ) k : D ( α ^ , α ) c ( p ) ,
where c ( p ) denotes the p-quantile of D ( α ^ , α ) for = J , R A , G A , H , G J S , numerical values of which can readily be obtained via Monte Carlo simulation by sampling from gamma distributions.
Confidence regions for the mean vector m = ( 1 / α 1 , , 1 / α k ) t with confidence level p ( 0 , 1 ) are then given by
C ˜ = 1 α 1 , , 1 α k t ( 0 , ) k : ( α 1 , , α k ) t C
for = J , R A , G A , H , G J S .
In Figure 1 and Figure 2, realizations of C ˜ J , C ˜ R A , C ˜ G A , C ˜ H , and C ˜ G J S are depicted for the two-sample case ( k = 2 ) and some sample sizes n 1 , n 2 and values of α ^ = ( α ^ 1 , α ^ 2 ) t , where the confidence level is chosen as p = 90 % . Additionally, realizations of the standard confidence region
R = 2 n 1 α ^ 1 χ 1 q 2 ( 2 n 1 ) , 2 n 1 α ^ 1 χ q 2 ( 2 n 1 ) × 2 n 2 α ^ 2 χ 1 q 2 ( 2 n 2 ) , 2 n 2 α ^ 2 χ q 2 ( 2 n 2 )
with a confidence level of 90% for m = ( m 1 , m 2 ) t are shown in the figures, where q = ( 1 0.9 ) / 2 and χ γ 2 ( v ) denotes the γ -quantile of the chi-square distribution with v degrees of freedom.
It is found that over the sample sizes and realizations of α ^ considered, the confidence regions C ˜ J , C ˜ R A , C ˜ G A , C ˜ H , and C ˜ G J S are similarly shaped but do not coincide as the plots for different sample sizes show. In terms of (observed) area, all divergence-based confidence regions perform considerably better than the standard rectangle. This finding, however, depends on the parameter of interest, which here is the vector of exponential means; for the divergence-based confidence regions and the standard rectangle for α itself, the contrary assertion is true. Although the divergence-based confidence regions have a smaller area than the standard rectangle, this is not at the cost of large projection lengths with respect to the m 1 - and m 2 -axes, which serve as further characteristics for comparing confidence regions. Monte Carlo simulations may moreover be applied to compute the expected area and projection lengths as well as the coverage probabilities of false parameters for a more rigorous comparison of the performance of the confidence regions, which is beyond the scope of this article.

Author Contributions

Conceptualization, S.B. and U.K.; writing—original draft preparation, S.B.; writing—review and editing, U.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

CRCressie–Read
EFexponential family
KLKullback–Leibler
MLEmaximum likelihood estimator

References

  1. Pardo, L. Statistical Inference Based on Divergence Measures; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
  2. Liese, F.; Vajda, I. Convex Statistical Distances; Teubner: Leipzig, Germany, 1987. [Google Scholar]
  3. Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1989. [Google Scholar]
  4. Liese, F.; Miescke, K.J. Statistical Decision Theory: Estimation, Testing, and Selection; Springer: New York, NY, USA, 2008. [Google Scholar]
  5. Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivar. Anal. 2009, 100, 16–36. [Google Scholar] [CrossRef] [Green Version]
  6. Katzur, A.; Kamps, U. Homogeneity testing via weighted affinity in multiparameter exponential families. Stat. Methodol. 2016, 32, 77–90. [Google Scholar] [CrossRef]
  7. Menendez, M.L. Shannon’s entropy in exponential families: Statistical applications. Appl. Math. Lett. 2000, 13, 37–42. [Google Scholar] [CrossRef] [Green Version]
  8. Menéndez, M.; Salicrú, M.; Morales, D.; Pardo, L. Divergence measures between populations: Applications in the exponential family. Commun. Statist. Theory Methods 1997, 26, 1099–1117. [Google Scholar] [CrossRef]
  9. Morales, D.; Pardo, L.; Pardo, M.C.; Vajda, I. Rényi statistics for testing composite hypotheses in general exponential models. Statistics 2004, 38, 133–147. [Google Scholar] [CrossRef]
  10. Toma, A.; Broniatowski, M. Dual divergence estimators and tests: Robustness results. J. Multivar. Anal. 2011, 102, 20–36. [Google Scholar] [CrossRef] [Green Version]
  11. Katzur, A.; Kamps, U. Classification into Kullback–Leibler balls in exponential families. J. Multivar. Anal. 2016, 150, 75–90. [Google Scholar] [CrossRef]
  12. Barndorff-Nielsen, O. Information and Exponential Families in Statistical Theory; Wiley: Chichester, UK, 2014. [Google Scholar]
  13. Brown, L.D. Fundamentals of Statistical Exponential Families; Institute of Mathematical Statistics: Hayward, CA, USA, 1986. [Google Scholar]
  14. Pfanzagl, J. Parametric Statistical Theory; de Gruyter: Berlin, Germany, 1994. [Google Scholar]
  15. Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
  16. Huzurbazar, V.S. Exact forms of some invariants for distributions admitting sufficient statistics. Biometrika 1955, 42, 533–537. [Google Scholar] [CrossRef]
  17. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE 17th International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 3621–3624. [Google Scholar]
  18. Johnson, D.; Sinanovic, S. Symmetrizing the Kullback–Leibler distance. IEEE Trans. Inf. Theory 2001. Available online: https://hdl.handle.net/1911/19969 (accessed on 5 June 2021).
  19. Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
  20. Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  21. Vuong, Q.N.; Bedbur, S.; Kamps, U. Distances between models of generalized order statistics. J. Multivar. Anal. 2013, 118, 24–36. [Google Scholar] [CrossRef]
  22. Nielsen, F. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef] [Green Version]
  23. Avlogiaris, G.; Micheas, A.; Zografos, K. On local divergences between two probability measures. Metrika 2016, 79, 303–333. [Google Scholar] [CrossRef]
  24. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef] [Green Version]
  25. Matusita, K. Decision rules based on the distance, for problems of fit, two samples, and estimation. Ann. Math. Statist. 1955, 26, 631–640. [Google Scholar] [CrossRef]
  26. Matusita, K. On the notion of affinity of several distributions and some of its applications. Ann. Inst. Statist. Math. 1967, 19, 181–192. [Google Scholar] [CrossRef]
  27. Garren, S.T. Asymptotic distribution of estimated affinity between multiparameter exponential families. Ann. Inst. Statist. Math. 2000, 52, 426–437. [Google Scholar] [CrossRef]
  28. Beitollahi, A.; Azhdari, P. Exponential family and Taneja’s entropy. Appl. Math. Sci. 2010, 41, 2013–2019. [Google Scholar]
  29. Nielsen, F.; Nock, R. A closed-form expression for the Sharma–Mittal entropy of exponential families. J. Phys. A Math. Theor. 2012, 45, 032003. [Google Scholar] [CrossRef] [Green Version]
  30. Zografos, K.; Nadarajah, S. Expressions for Rényi and Shannon entropies for multivariate distributions. Statist. Probab. Lett. 2005, 71, 71–84. [Google Scholar] [CrossRef]
Figure 1. Illustration of the confidence regions C ˜ J (solid light grey line), C ˜ R A (solid dark grey line), C ˜ G A (solid black line), C ˜ H (dashed black line), C ˜ G J S (dotted black line), and R (rectangle) for the mean vector m = ( m 1 , m 2 ) t with level 90% and sample sizes n 1 , n 2 based on a realization α ^ = ( 0.0045 , 0.0055 ) t , respectively m ^ = ( 222.2 , 181.8 ) t of the MLE (circle).
Figure 1. Illustration of the confidence regions C ˜ J (solid light grey line), C ˜ R A (solid dark grey line), C ˜ G A (solid black line), C ˜ H (dashed black line), C ˜ G J S (dotted black line), and R (rectangle) for the mean vector m = ( m 1 , m 2 ) t with level 90% and sample sizes n 1 , n 2 based on a realization α ^ = ( 0.0045 , 0.0055 ) t , respectively m ^ = ( 222.2 , 181.8 ) t of the MLE (circle).
Entropy 23 00726 g001
Figure 2. Illustration of the confidence regions C ˜ J (solid light grey line), C ˜ R A (solid dark grey line), C ˜ G A (solid black line), C ˜ H (dashed black line), C ˜ G J S (dotted black line), and R (rectangle) for the mean vector m = ( m 1 , m 2 ) t with level 90% and sample sizes n 1 , n 2 based on a realization α ^ = ( 0.003 , 0.007 ) t , respectively m ^ = ( 333.3 , 142.9 ) t of the MLE (circle).
Figure 2. Illustration of the confidence regions C ˜ J (solid light grey line), C ˜ R A (solid dark grey line), C ˜ G A (solid black line), C ˜ H (dashed black line), C ˜ G J S (dotted black line), and R (rectangle) for the mean vector m = ( m 1 , m 2 ) t with level 90% and sample sizes n 1 , n 2 based on a realization α ^ = ( 0.003 , 0.007 ) t , respectively m ^ = ( 333.3 , 142.9 ) t of the MLE (circle).
Entropy 23 00726 g002
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bedbur, S.; Kamps, U. On Representations of Divergence Measures and Related Quantities in Exponential Families. Entropy 2021, 23, 726. https://doi.org/10.3390/e23060726

AMA Style

Bedbur S, Kamps U. On Representations of Divergence Measures and Related Quantities in Exponential Families. Entropy. 2021; 23(6):726. https://doi.org/10.3390/e23060726

Chicago/Turabian Style

Bedbur, Stefan, and Udo Kamps. 2021. "On Representations of Divergence Measures and Related Quantities in Exponential Families" Entropy 23, no. 6: 726. https://doi.org/10.3390/e23060726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop