Next Article in Journal
Assessing Search and Unsupervised Clustering Algorithms in Nested Sampling
Previous Article in Journal
Estimating Mixed Memberships in Directed Networks by Spectral Clustering
Previous Article in Special Issue
Index Coding with Multiple Interpretations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Cauchy Distribution in Information Theory

Independent Researcher, Princeton, NJ 08540, USA
Entropy 2023, 25(2), 346; https://doi.org/10.3390/e25020346
Submission received: 20 September 2022 / Revised: 5 November 2022 / Accepted: 2 February 2023 / Published: 13 February 2023
(This article belongs to the Special Issue Applications of Information Theory in Statistics)

Abstract

:
The Gaussian law reigns supreme in the information theory of analog random variables. This paper showcases a number of information theoretic results which find elegant counterparts for Cauchy distributions. New concepts such as that of equivalent pairs of probability measures and the strength of real-valued random variables are introduced here and shown to be of particular relevance to Cauchy distributions.

1. Introduction

Since the inception of information theory [1], the Gaussian distribution has emerged as the paramount example of a continuous random variable leading to closed-form expressions for information measures and extremality properties possessing great pedagogical value. In addition, the role of the Gaussian distribution as a ubiquitous model for analog information sources and for additive thermal noise has elevated the corresponding formulas for rate–distortion functions and capacity–cost functions to iconic status in information theory. Beyond discrete random variables, by and large, information theory textbooks confine their coverage and examples to Gaussian random variables.
The exponential distribution has also been shown [2] to lead to closed-form formulas for various information measures such as differential entropy, mutual information and relative entropy, as well as rate–distortion functions for Markov processes and the capacity of continuous-time timing channels with memory such as the exponential-server queue [3].
Despite its lack of moments, the Cauchy distribution also leads to pedagogically attractive closed-form expressions for various information measures. In addition to showcasing those, we introduce an attribute, which we refer to as the strength of a real-valued random variable, under which the Cauchy distribution is shown to possess optimality properties. Along with the stability of the Cauchy law, those properties result in various counterparts to the celebrated fundamental limits for memoryless Gaussian sources and channels.
To enhance readability and ease of reference, the rest of this work is organized in 120 items grouped into 17 sections, plus an appendix.
Section 2 presents the family of Cauchy random variables and their basic properties as well as multivariate generalizations, and the Rider univariate density which includes the Cauchy density as a special case and finds various information theoretic applications.
Section 3 gives closed-form expressions for the differential entropies of the univariate and multivariate densities covered in Section 2.
Introduced previously for unrelated purposes, the Shannon and η -transforms reviewed in Section 4 prove useful to derive several information theoretic results for Cauchy and related laws.
Applicable to any real-valued random variable and inspired by information theory, the central notion of strength is introduced in Section 5 along with its major properties. In particular, it is shown that convergence in strength is an intermediate criterion between convergence in probability and convergence in L q , q > 0 , and that differential entropy is continuous with respect to the addition of independent vanishing strength noise.
Section 6 shows that, for any ρ > 0 the maximal differential entropy density satisfying E log 1 + | Z | ρ θ can be obtained in closed form, but its shape (not just its scale) depends on the value of θ . In particular, the Cauchy density is the solution only if ρ = 2 , and θ = log 4 . In contrast, we show that, among all the random variables with a given strength, the centered Cauchy density has maximal differential entropy, regardless of the value of the constraint. This result suggests the definition of entropy strength of Z, as the strength of a Cauchy random variable whose differential entropy is the same as that of Z. Modulo a factor, entropy power is the square of entropy strength. Section 6 also gives a maximal differential entropy characterization of the standard spherical Cauchy multivariate density.
Information theoretic terminology for the logarithm of the Radon–Nikodym derivative, as well as its distribution, the relative information spectrum is given in Section 7. The relative information spectrum for Cauchy distributions is found and shown to depend on their location and scale through a single scalar. This is a rare property, not satisfied by most common families such as Gaussian, exponential, Laplace, etc. Section 8 introduces the notion of equivalent pairs of probability measures, which plays an important role not only in information theory but in statistical inference. Distinguishing P 1 from Q 1 has the same fundamental limits as distinguishing P 2 from Q 2 if ( P 1 , Q 1 ) and ( P 2 , Q 2 ) are equivalent pairs. Section 9 studies the interplay between f-divergences and equivalent pairs. A simple formula for the f-divergence between Cauchy distributions results from the explicit expression for the relative information spectrum found in Section 7. These results are then used to easily derive a host of explicit expressions for χ 2 -divergence, relative entropy, total variation distance, Hellinger divergence and Rényi divergence in Section 10, Section 11, Section 12, Section 13 and Section 14, respectively.
In addition to the Fisher information matrix of the Cauchy family, Section 15 finds a counterpart of de Bruijn’s identity [4] for convolutions with scaled Cauchy random variables, instead of convolutions with scaled Gaussian random variables as in the conventional setting.
Section 16 is devoted to mutual information. The mutual information between a Cauchy random variable and its noisy version contaminated by additive independent Cauchy noise exhibits a pleasing counterpart (modulo a factor of two) with the Gaussian case, in which the signal-to-noise ratio is now given by the ratio of strengths rather than variances. With Cauchy noise, Cauchy inputs maximize mutual information under an output strength constraint. The elementary fact that an output variance constraint translates directly into an input variance constraint does not carry over to input and output strengths, and indeed we identify non-Cauchy inputs that may achieve higher mutual information than a Cauchy input with the same strength. Section 16 also considers the dual setting in which the input is Cauchy, but the additive noise need not be. Lower bounds on the mutual information, attained by Cauchy noise, are offered. However, as the bounds do not depend exclusively on the noise strength, they do not rule out the possibility that a non-Cauchy noise with identical strength may be least favorable. If distortion is measured by strength, the rate–distortion function of a Cauchy memoryless source is shown to admit (modulo a factor of two) the same rate–distortion function as the memoryless Gaussian source with mean–square distortion, replacing the source variance by its strength. Theorem 17 gives a very general continuity result for mutual information that encompasses previous such results. While convergence in probability to zero of the input to an additive-noise transformation does not imply vanishing input-output mutual information, convergence in strength does under very general conditions on the noise distribution.
Some concluding observations about generalizations and open problems are collected in Section 17, including a generalization of the notion of strength.
Those definite integrals used in the main body are collected and justified in the Appendix A.

2. The Cauchy Distribution and Generalizations

In probability theory, the Cauchy (also known as Lorentz and as Breit–Wigner) distribution is the prime example of a real-valued random variable none of whose moments of order one or higher exists, and as such it is not encompassed by either the law of large numbers or the central limit theorem.
  • A real-valued random variable V is said to be standard Cauchy if its probability density function is
    f V ( x ) = 1 π 1 x 2 + 1 , x R .
    Furthermore, X is said to be Cauchy if there exist λ 0 and μ R such that X = λ V + μ , in which case
    f X ( x ) = | λ | π 1 ( x μ ) 2 + λ 2 , x R ,
    where μ and | λ | are referred to as the location (or median) and scale, respectively, of the Cauchy distribution. If μ = 0 , (2) is said to be centered Cauchy.
  • Since E [ max { 0 , V } ] = E [ max { 0 , V } ] = , the mean of a Cauchy random variable does not exist. Furthermore, E [ | V | q ] = for q 1 , and the moment generating function of V does not exist (except, trivially, at 0). The characteristic function of the standard Cauchy random variable is
    E e i ω V = e | ω | , ω R .
  • Using (3), we can verify that a Cauchy random variable has the curious property that adding an independent copy to it has the same effect, statistically speaking, as adding an identical copy. In addition to the Gaussian and Lévy distributions, the Cauchy distribution is stable: a linear combination of independent copies remains in the family, and is infinitely divisible: it can be expressed as an n-fold convolution for any n. It follows from (3) that if { V 1 , V 2 , } are independent, standard Cauchy, and a is a deterministic sequence with finite 1 -norm a 1 , then i = 1 a i V i has the same distribution as a 1 V . In particular, the time average of independent identically distributed Cauchy random variables has the same distribution as any of the random variables. The families { λ V , λ I } and { V + μ , μ I } , with I any interval of the real line, are some of the simplest parametrized random variables that are not an exponential family.
  • If Θ is uniformly distributed on [ π 2 , π 2 ] , then tan Θ is standard Cauchy. This follows since, in view of (1) and (A1), the standard Cauchy cumulative distribution function is
    F V ( x ) = 1 2 + 1 π arctan ( x ) , x R .
    Therefore, V has unit semi-interquartile length. The functional inverse of (4) is the standard Cauchy quantile function given by
    Q V ( t ) = tan π t 1 2 , t ( 0 , 1 ) .
  • If X 1 and X 2 are standard Gaussian with correlation coefficient ρ ( 1 , 1 ) , then X 1 / X 2 is Cauchy with scale 1 ρ 2 and location ρ . This implies that the reciprocal of a standard Cauchy random variable is also standard Cauchy.
  • Taking the cue from the Gaussian case, we say that a random vector is multivariate Cauchy if any linear combination of its components has a Cauchy distribution. Necessary and sufficient conditions for a characteristic function to be that of a multivariate Cauchy were shown by Ferguson [5]. Unfortunately, no general expression is known for the corresponding probability density function. This accounts for the fact that one aspect, in which the Cauchy distribution does not quite reach the wealth of information theoretic results attainable with the Gaussian distribution, is in the study of multivariate models of dependent random variables. Nevertheless, special cases of multivariate Cauchy distribution do admit some interesting information theoretic results as we will see below. The standard spherical multivariate Cauchy probability density function on R n is (e.g., [6])
    f V n ( x ) = Γ n + 1 2 π n + 1 2 1 + x 2 n + 1 2 ,
    where Γ ( · ) is the Gamma function. Therefore, V n = ( V 1 , , V n ) are exchangeable random variables. If X 0 , X 1 , , X n are independent standard normal, then the vector X 0 1 X n has the density in (6). With the aid of (A10), we can verify that any subset of k { 1 , , n 1 } components of V n is distributed according to V k . In particular, the marginals of (6) are given by (1). Generalizing (3), the characteristic function of (6) is
    E e i t V n = e t , t R n .
  • In parallel to Item 1, we may generalize (6) by dropping the restriction that it be centered at the origin and allowing ellipsoidal deformation, i.e., letting Z n = Λ 1 2 V n + μ with μ R n and a positive definite n × n matrix Λ . Therefore,
    f Z n ( x ) = Γ n + 1 2 π n + 1 2 det 1 2 ( Λ ) 1 + ( x μ ) Λ 1 ( x μ ) n + 1 2 .
    While ρ Z n is a Cauchy random variable for ρ R n { 0 } , (8) fails to encompass every multivariate Cauchy distribution—in particular, the important case of independent Cauchy random variables. Another reason the usefulness of the model in (8) is limited is that it is not closed under independent additions: if V n and V ¯ n are independent, each distributed according to (6); then, Λ 1 2 V n + Λ ¯ 1 2 V ¯ n , while multivariate Cauchy, does not have a density of the type in (8) unless Λ = α Λ ¯ for some α > 0 .
  • Another generalization of the (univariate) Cauchy distribution, which comes into play in our analysis, was introduced by Rider in 1958 [7]. With ρ > 0 and β ρ > 1 ,
    f V β , ρ ( x ) = κ β , ρ ( 1 + | x | ρ ) β , x R ,
    κ β , ρ = ρ Γ ( β ) 2 Γ 1 ρ Γ β 1 ρ .
    In addition to the ( β , ρ ) parametrization in (9), we may introduce scale and location parameters by means of λ V β , ρ + μ , just as we did in the Cauchy case ( β , ρ ) = ( 1 , 2 ) . Another notable special case is ν V ν + 1 2 , 2 , which is the centered Student-t random variable, itself equivalent to a Pearson type VII distribution.

3. Differential Entropy

9.
The differential entropy of a Cauchy random variable is
h ( λ V + μ ) = log | λ | + h ( V ) ,
h ( V ) = f V ( t ) log f V ( t ) d t = log ( 4 π ) ,
using (A3). Throughout this paper, unless the logarithm base is explicitly shown, it can be chosen by the reader as long as it is the same on both sides of the equation. For natural logarithms, the information measure unit is the nat.
10.
An alternative, sometimes advantageous, expression for the differential entropy of a real-valued random variable is feasible if its cumulative distribution function F X is continuous and strictly monotonic. Then, the quantile function is its functional inverse, i.e., F X ( Q X ( t ) ) = t for all t ( 0 , 1 ) , which implies that Q ˙ X ( t ) f X ( Q X ( t ) ) = 1 for all t ( 0 , 1 ) . Moreover, since X and Q X ( U ) with U uniformly distributed on [ 0 , 1 ] have identical distributions, we obtain
h ( X ) = E [ log f X ( X ) ] = E [ log f X ( Q X ( U ) ) ] = 0 1 log Q ˙ X ( t ) d t .
Since (4) is indeed continuous and strictly monotonic, we can verify that we recover (12) by means of (5), (13) and (A2).
11.
Despite not having finite moments, an independent identically distributed sequence of Cauchy random variables { Z i } is information stable in the sense that
1 n i = 1 n log f Z ( Z i ) h ( Z ) , a . s .
because of the strong law of large numbers.
12.
With V n distributed according to the standard spherical multivariate Cauchy density in (6), it is shown in [8] that
E log e 1 + V n 2 = ψ n + 1 2 + log e 4 + γ ,
where γ is the Euler–Mascheroni constant and ψ ( · ) is the digamma function. Therefore, the differential entropy of (6) is, in nats, (see also [9])
h ( V n ) = n + 1 2 E log e 1 + V n 2 + n + 1 2 log e π log e Γ n + 1 2
= n + 1 2 log e ( 4 π ) + γ + ψ n + 1 2 log e Γ n + 1 2 ,
whose growth is essentially linear with n: the conditional differential entropy
h ( V n + 1 | V n ) = h ( V n + 1 ) h ( V n ) is monotonically decreasing with
h ( V 2 | V 1 ) = 3 2 ( γ + ψ ( 3 2 ) ) + log e 4 = 2.306 . . .
lim n h ( V n + 1 | V n ) = 1 2 ( 1 + γ + log e ( 4 π ) ) = 2.054 . . .
13.
By the scaling law of differential entropy and its invariance to location, we obtain
h Λ 1 2 V n + μ = h ( V n ) + 1 2 log | det ( Λ ) | .
14.
Invoking (A6), we obtain a closed-form formula for the differential entropy, in nats, of the generalized Cauchy distribution (9) as
h ( V β , ρ ) = log e κ β , ρ + β E log e 1 + | V β , ρ | ρ
= log e κ β , ρ + β ψ ( β ) β ψ β 1 ρ ,
with κ β , ρ defined in (10).
15.
The Rényi differential entropy of order α ( 0 , 1 ) ( 1 , ) of an absolutely continuous random variable with probability density function f X is
h α ( X ) = 1 1 α log f X α ( t ) d t .
For Cauchy random variables, we obtain, with the aid of (A12),
h α ( λ V + μ ) = log | λ | + h α ( V ) ,
h α ( V ) = 1 2 α 1 α log π + 1 1 α log Γ ( α 1 2 ) Γ ( α ) , α > 1 2 ,
which is infinite for α ( 0 , 1 2 ] , converges to log ( 4 π ) (cf. (12)) as α 1 , and to log π , the reciprocal of the mode height, as α .
16.
Invoking (A13), the Rényi differential entropy of order α 1 β ρ , 1 ( 1 , ) of the generalized Cauchy distribution (9) is
h α ( V β , ρ ) = α 1 α log κ β , ρ + 1 1 α log 2 Γ β α 1 ρ Γ 1 ρ ρ Γ ( β α ) .

4. The Shannon- and η -Transforms

In this section, we recall the definitions of two notions introduced in [10] for the unrelated purpose of expressing the asymptotic singular value distribution of large random matrices.
17.
The Shannon transform of a nonnegative random variable X is the function V X : [ 0 , ) [ 0 , ) , defined by
V X ( θ ) = E log e 1 + θ X .
Unless V X ( θ ) = for all θ > 0 (e.g., if X has the log-Cauchy density 1 π x 1 1 + log 2 x , x > 0 ), or V X ( θ ) = 0 , θ 0 , (which occurs if X = 0 a.s.), the Shannon transform is a strictly concave continuous function from V X ( 0 ) = 0 , which grows without bound as θ .
18.
If V is standard Cauchy, then (A4) results in
V V 2 ( θ 2 ) = 2 log e 1 + | θ | ,
and the handy relationship
E log β 2 + λ 2 V 2 = 2 log | β | + | λ | .
19.
For the distribution in (9) with ( β , ρ ) = ( 2 , 2 ) , (A7) results in
V V 2 , 2 2 ( θ 2 ) = 2 log e 1 + | θ | 2 | θ | 1 + | θ | .
20.
The η -transform η X : [ 0 , ) ( 0 , 1 ] of a non-negative random variable is defined as the function
η X ( θ ) = E 1 1 + θ X = 1 θ V ˙ X ( θ ) ,
which is intimately related to the Cauchy–Stieltjes transform [11]. For example,
η V 2 ( θ 2 ) = 1 1 + | θ | ,
η V 2 , 2 2 ( θ 2 ) = 1 + 2 | θ | ( 1 + | θ | ) 2 .

5. Strength

The purpose of this section is to introduce an attribute which is particularly useful to compare random variables that do not have finite moments.
21.
The strength  ς ( Z ) [ 0 , + ] of a real-valued random variable Z is defined as
ς ( Z ) = inf ς > 0 : E log 1 + Z 2 ς 2 log 4 .
It follows that the only random variable with zero strength is Z = 0 , almost surely. If the inequality in (34) is not satisfied for any ς > 0 , then ς ( Z ) = . Otherwise, ς ( Z ) is the unique positive solution ς > 0 to
E log 1 + Z 2 ς 2 = log 4 .
If ς ( Z ) ς , then (35) holds with ≤.
22.
The set of probability measures whose strength is upper bounded by a given finite nonnegative constant,
A ς = P Z : ς ( Z ) ς ,
is convex: The set A 0 is a singleton as seen in Item 21, while, for 0 < ς < , we can express (36) as
A ς = P Z : E log 1 + Z 2 ς 2 log 4 .
Therefore, if P Z 0 A ς and P Z 1 A ς , we must have α P Z 1 + ( 1 α ) P Z 0 A ς .
23.
The peculiar constant in the definition of strength is chosen so that if V is standard Cauchy, then its strength is ς ( V ) = 1 because, in view of (29),
E log 1 + V 2 = log 4 .
24.
If Z = k R , a.s., then its strength is
ς ( Z ) = | k | 3 .
25.
The left side of (35) is the Shannon transform of Z 2 evaluated at ς 2 , which is continuous in ς 2 . If ς ( Z ) ( 0 , ) then, (35) can be written as
ς 2 ( Z ) = 1 V Z 2 1 ( log e 4 ) ,
where, on the right side, we have denoted the functional inverse of the Shannon transform. Clearly, the square root of the right side of (40) cannot be expressed as the expectation with respect to Z of any b : R R that does not depend on P Z . Nevertheless, thanks to (37), (36) can be expressed as
A ς = P Z : E b ς 2 Z 1 , w i t h b ς 2 ( x ) = log 4 1 + x 2 ς 2 .
26.
Theorem 1.
The strength of a real-valued random variable satisfies the following properties:
(a) 
ς ( λ Z ) = | λ | ς ( Z ) .
(b) 
ς 2 ( Z ) 1 3 E [ Z 2 ] ,
with equality if and only if | Z | is deterministic.
(c) 
If  0 < q < 2 , and Z q = E 1 q [ | Z | q ] < , then
ς ( Z ) κ q 1 q Z q , w i t h κ q = max x > 0 log 4 ( 1 + x 2 ) x q .
(d) 
If V is standard Cauchy, independent of X, then ς ( X + V ) is the solution to
V X 2 ( ς + 1 ) 2 = 2 log 2 1 + ς 1 ,
if it exists, otherwise, ς ( X + V ) = . Moreover, ≤ holds in (45) if ς ( X + V ) ς .
(e) 
2 log 2 min { 1 , ς ( Z ) } E log 1 + Z 2 2 log 2 max { 1 , ς ( Z ) } .
(f) 
If 0 < ς ( Z ) < , then
h ( Z ) = log ( 4 π ς ( Z ) ) D ( Z ς ( Z ) V ) ,
where V is standard Cauchy, and D ( X Y ) stands for the relative entropy with reference probability measure P Y and dominated measure P X .
(g) 
h ( Z ) < ς ( Z ) < E log 1 + Z 2 < .
(h) 
If V is standard Cauchy, then
ς ( Z ) < a n d h ( Z ) R D ( Z λ V ) < , f o r a l l λ > 0 .
(i) 
The finiteness of strength is sufficient for the finiteness of the entropy of the integer part of the random variable, i.e.,
H ( Z ) = ς ( Z ) = .
(j) 
If Z n Z in L q for any q ( 0 , 1 ] , then ς ( Z n ) ς ( Z ) .
(k) 
Z n 0 i . p . E log 1 + Z n 2 0 ς ( Z n ) 0 .
(l) 
If ς ( X n ) 0 , then ς ( Z + X n ) ς ( Z ) .
(m) 
If ς ( X n ) 0 , ς ( Z ) < and Z is independent of X n , then h ( Z + X n ) h ( Z ) .
Proof. 
For the first three properties, it is clear that they are satisfied if ς ( Z ) = 0 , i.e.,  Z = 0 almost surely.
(a)
If ς 2 ( 0 , ) is the solution to (35), then λ 2 ς 2 is a solution to (35) with λ Z taking the role of Z. If (35) has no solution, neither does its version in which λ Z takes the role of Z.
(b)
Jensen’s inequality applied to the left side of (35) results in 3 ς 2 E [ Z 2 ] . The strict concavity of log ( 1 + t ) implies that equality holds if and only if Z 2 is deterministic. If (35) has no solution, the same reasoning implies that E [ Z 2 ] = .
(c)
First, it is easy to check that, for q ( 0 , 2 ) , the function f q : ( 0 , ) ( 0 , ) given by f q ( t ) = t q log 4 ( 1 + t 2 ) attains its maximum κ q at a unique point. Assume ς ( Z ) ( 0 , ) . Since κ q t q log 4 ( 1 + t 2 ) for all t > 0 , letting t = | Z | / ς ( Z ) and taking expectations, (35) (choosing 4 as the logarithm base) results in
κ q ς q ( Z ) E | Z | q 1 ,
which is the same as (44). If ς ( Z ) = , then = E [ log ( 1 + Z 2 ) ] κ q E [ | Z | q ] .
(d)
Invoking (A4) with α 2 = ς 2 + x 2 and | sin β | = ς x 2 + ς 2 , we obtain
E log 1 + ( x + V ) 2 ς 2 = log ( 1 + ς ) 2 + x 2 ς 2
= log 1 + x 2 ( 1 + ς ) 2 2 log ς ς + 1 .
Substituting x by X and averaging over X, the result follows from the definition of strength.
(e)
The result holds trivially if either ς ( Z ) = 0 or ς ( Z ) = . Otherwise, we simply rewrite (35) as
2 log ( 2 ς ( Z ) ) = E log ς 2 ( Z ) + Z 2 ,
and upper/lower bound the right side by E log 1 + Z 2 .
(f)
D ( Z ς ( Z ) V ) = h ( Z ) + log ( ς ( Z ) π ) + E log 1 + Z 2 ς 2 ( Z )
= log ( 4 π ς ( Z ) ) h ( Z ) ,
where (55) and (56) follow from (2) and (35), respectively.
(g)
  • If ς ( Z ) < , then E log 1 + Z 2 < and h ( Z ) < follow from (46) and (47), respectively.
  • If E log ( 1 + Z 2 ) < , the dominated convergence theorem implies
    lim ς E log 1 + Z ς 2 = 0 .
    Excluding the case Z = 0 a.s. for which both E log ( 1 + Z 2 ) and ς ( Z ) are zero, we have
    lim ς 0 E log 1 + Z ς 2 = lim ς 0 V Z 2 1 ς 2 = .
    Since (35) is continuous in ς , it must have a finite solution in view of (57) and (58).
(h)
It is sufficient to assume λ = 1 for the condition on the right of (49) because the condition on the left holds if and only if it holds for α Z , for any α > 0 and D ( α Z α V ) = D ( Z V ) . If h ( Z ) < , then
D ( Z V ) = h ( Z ) + log π + E log 1 + Z 2 ,
which is finite unless either h ( Z ) = or E [ log ( 1 + Z 2 ) ] = . This establishes ⟹ in view of (48). To establish ⟸, it is enough to show that
D ( Z V ) < E log 1 + Z 2 < ,
in view of (48) and the fact that, according to (59), h ( Z ) > if both D ( Z V ) and E log 1 + Z 2 are finite. To show (60), we invoke the following variational representation of relative entropy (first noted by Kullback [12] for absolutely continuous random variables): If P Z P V , then
D ( Z V ) = max Q : Q P V E log d Q d P V ( Z ) ,
attained only at Q = P Z . Let Q be the absolutely continuous random variable with probability density function
q ( x ) = log e 2 4 | x | log e 2 | x | 1 { | x | 2 } + 1 8 1 { | x | < 2 } .
Then,
> D ( Z V ) > E log q ( Z ) f V ( Z ) = E 1 { | Z | 2 } log π log e 2 4 + log 1 + Z 2 log | Z | log e 2 | Z |
+ E 1 { | Z | < 2 } log π 8 + log 1 + Z 2
> 1 5 E 1 { Z 2 } log 1 + Z 2 + log 5 π 8
1 5 E log 1 + Z 2 + 4 5 log 5 log 8 π ,
where (65) holds since
4 5 log ( 1 + x 2 ) log ( π log e 2 ) + 2 log log e | x | + log | x | , | x | > 2 .
(i)
ς ( Z ) < E log ( 1 + Z 2 ) <
E log ( 1 + | Z | ) <
H ( Z ) < ,
where (68)–(70) follow from (48), log ( 1 + x 2 ) 2 log ( 1 + | x | ) , and p. 3743 in [13], respectively.
(j)
If ς ( Z ) = 0 , then Z = 0 a.e., and the result follows from (44). For all ( x , z ) R 2 ,
log e 1 + ( x + z ) 2 1 + z 2 log e 1 + 1 2 ( x 2 + | x | 4 + x 2 )
2 q | x | q ,
where (71) follows by maximizing the left side over z R . Denote the difference between the right side and the left side of (72) by f q ( x ) , an even function which satisfies f q ( 0 ) = 0 , and
f ˙ q ( x ) = 2 x q 1 2 4 + x 2 > 0 , x > 0 , 0 < q 1 .
Therefore, (72) follows. Assuming 0 < ς ( Z ) < , we have
E log ( 1 + Z n 2 ) E log ( 1 + Z 2 ) E log ( 1 + Z n 2 ) log ( 1 + Z 2 )
2 q E | Z n Z | q log e .
Now, because of the scaling property in (42), we may assume without loss of generality that ς ( Z ) = 1 . Thus, (74) and (75) result in
E log ( 1 + Z n 2 ) log 4 2 q E | Z n Z | q log e ,
which requires that ς ( Z n ) 1 , since, by assumption, the right side vanishes. Assume now that ς ( Z ) = , and therefore, E log ( 1 + Z 2 ) = . Inequality (75) remains valid in this case, implying that, as soon as the right side is finite (which it must be for all sufficiently large n), E log ( 1 + Z n 2 ) = , and therefore, ς ( Z n ) = in view of (48).
(k)
1st ⟸
     For any ϵ > 0 , Markov’s inequality results in
P [ | Z n | > ϵ ] = P log 1 + Z n 2 > log 1 + ϵ 2 E log 1 + Z n 2 log 1 + ϵ 2 .
First, we show that, for any α > 0 , we have
E log 1 + Z n 2 0 E log 1 + α Z n 2 0 .
The case 0 < α < 1 is trivial. The case α > 1 follows because E log 1 + Z n 2 0 implies
E log 1 + α Z n 2 = E log 1 + ( α 1 ) Z n 2 ,
where ≥ is obvious, and ≤ holds because
log 1 + α t 2 = log 1 + t 2 + log 1 + ( α 1 ) t 2 1 + t 2
log 1 + t 2 + log 1 + ( α 1 ) t 2 .
If ς ( Z n ) = infinitely often, so is E log 1 + Z n 2 in view of (48). Assume that lim sup ς ( Z n ) = ς ( 0 , ] , and ς ( Z n ) is finite for all sufficiently large. Then, there is a subsequence such that ς ( Z n i ) , and
log 4 = E log 1 + Z n i ς ( Z n i ) 2 E log 1 + Z n i λ 2 ,
for all sufficiently large i and λ < ς . Consequently, (78) implies that E log 1 + Z n 2 ¬ 0 .
2nd ⟸
   Suppose that E log 1 + Z n 2 0 . Therefore, there is a subsequence along which E log 1 + Z n i 2 > η > 0 . If η log 4 , then ς ( Z n i ) > 1 along the subsequence. Because of the continuity of the Shannon transform and the fact that it grows without bound as its argument goes to infinity (Item 25), if 0 < η < log 4 , we can find 1 < α < such that E log 1 + α Z n i 2 > log 4 , which implies ς ( Z n i ) > α 1 / 2 . Therefore, ς ( Z n ) 0 as we wanted to show.
(l)
We start by showing that
E log 1 + X n 2 0 E f ( X n ) 0 ,
where we have denoted the right side of (71) with arbitrary logarithm base by f ( x ) . Since f ˙ ( x ) = 2 log e 4 + x 2 , it is easy to verify that
0 f ( x ) log ( 1 + x 2 ) log 4 3 , x R ,
where the lower and upper bounds are attained uniquely at x = 0 and | x | = 1 2 , respectively. The lower bound results in ⟸ in (83). To show ⟹, decompose, for arbitrary ϵ > 0 ,
E f ( X n ) = E f ( X n ) 1 { | X n | < ϵ } + E f ( X n ) 1 { | X n | ϵ }
f ( ϵ ) + E f ( X n ) 1 { | X n | ϵ }
f ( ϵ ) + A ϵ E log 1 + X n 2 1 { | X n | ϵ }
f ( ϵ ) + A ϵ ϵ 3 ,
where
A ϵ = 1 + log 4 3 log ( 1 + ϵ 2 ) ,
(87) holds from the upper bound in (84), and the fact that (89) is decreasing in ϵ , and (88) holds for all sufficiently large n if E log 1 + X n 2 0 . Since the right side of (88) goes to 0 as ϵ 0 , (83) is established. Assume 0 < ς ( Z ) < . From the linearity property (42), we have ς ( Z + X n ) = ς ( Z ) · ς ( Z ¯ + X ¯ n ) with Z ¯ = ς 1 ( Z ) Z and X ¯ n = ς 1 ( Z ) X n which satisfies ς ( X ¯ n ) 0 . Therefore, we may restrict attention to ς ( Z ) = 1 without loss of generality. Following (71) and (74), and abbreviating Z n = Z + X n , we obtain
E log ( 1 + Z n 2 ) log 4 E log ( 1 + Z n 2 ) log ( 1 + Z 2 )
E f ( X n ) .
Thus, the desired result follows in view of (50) and (83). To handle the case ς ( Z ) = , we use the same reasoning as in the proof of (i) since (83) remains valid in that case.
(m)
If ς ( Z ) = 0 , then Z = 0 a.s., h ( Z ) = and h ( X n ) in view of Part (f). Assume henceforth that ς ( Z ) > 0 . Since h ( Z + X n ) h ( Z ) , it suffices to show
lim sup n h ( X n + Z ) h ( Z ) .
Under the assumptions, Part (l) guarantees that
ς ( X n + Z ) ς ( Z ) .
If V is a standard Cauchy random variable, then ς ( Z + X n ) V ς ( Z ) V in distribution as the characteristic function converges: e ς ( Z + X n ) | t | e ς ( Z ) | t | for all t. Analogously, according to Part (k), Z + X n D Z since X n 0 in probability. Since the strength of X n + Z is finite for all sufficiently large n, we may invoke (47) to express, for those n,
h ( X n + Z ) h ( Z ) = log ς ( Z + X n ) ς ( Z ) D ( Z + X n ς ( Z + X n ) V ) + D ( Z ς ( Z ) V ) .
The lower semicontinuity of relative entropy under weak convergence (which, in turn, is a corollary to the Donsker–Varadhan [14,15] variational representation of relative entropy) results in
lim inf n D ( Z + X n ς ( Z + X n ) V ) D ( Z ς ( Z ) V ) ,
because Z + X n D Z and ς ( Z + X n ) V D ς ( Z ) V . Therefore, (92) follows from (94) and (95).
   □
27.
In view of (42) and Item 23, ς ( λ V ) = | λ | if V is standard Cauchy. Furthermore, if X 1 and X 2 are centered independent Cauchy random variables, then their sum is centered Cauchy with
ς ( X 1 + X 2 ) = ς ( X 1 ) + ς ( X 2 ) .
More generally, it follows from Theorem 1-(d) that, if X 1 is centered Cauchy, and (96) holds for X 2 = α X and all α R , then X must be centered Cauchy. Invoking (52), we obtain
ς ( λ V + μ ) = | λ | 3 + 1 3 4 λ 2 + 3 μ 2 ,
which is also valid for λ = 0 as we saw in Item 24.
28.
If X is standard Gaussian, then ς 2 ( X ) = 0.171085 , and ς 2 ( σ X ) = σ 2 ς 2 ( X ) . Therefore, if X 1 and X 2 are zero-mean independent Gaussian random variables, then
ς 2 ( X 1 + X 2 ) = ς 2 ( X 1 ) + ς 2 ( X 2 ) .
Thus, in this case, ς ( X 1 + X 2 ) < ς ( X 1 ) + ς ( X 2 ) .
29.
It follows from Theorem 1-(d) that, with X independent of standard Cauchy V, we obtain ς ( X + V ) > ς ( X ) + ς ( V ) whenever X is such that
V X 2 ( 2 + ς ( X ) ) 2 > 2 log e 1 + ς ( X ) 2 + ς ( X ) .
An example is the heavy-tailed probability density function
f X ( x ) = 1 π log 4 ( 1 + x 2 ) 1 + x 2 ,
for which 7.0158 = ς ( X + V ) > ς ( X ) + ς ( V ) = 6.8457 .
30.
Using (A8), we can verify that, if X is zero-mean uniform with variance σ 2 , then
ς 2 ( X ) = 3 c 2 σ 2 = 0.221618 σ 2 ,
where c is the solution to log e ( 1 + c 2 ) + 2 c arctan ( c ) = 2 + log e 4 .
31.
We say that Z n 0 in strength if ς ( Z n ) 0 . Parts (j) and (k) of Theorem 1 show that this convergence criterion is intermediate between the traditional in probability and L q criteria. It is not equivalent to either one: If
Z n = 0 , with   probability 1 1 n ; 2 n , with   probability 1 n ,
then ς ( Z n ) 1 , while Z n 0 in probability. If, instead, Z n = 3 2 n , with probability 1 n , then Z n 0 in strength, but not in L q for any 0 < q .
32.
The assumption in Theorem 1-(m) that X n 0 in strength cannot be weakened to convergence in probability. Suppose that X n is absolutely continuous with probability density function
f X n ( t ) = n 1 , t 0 , 1 n ; 0 , t ( , 0 ) 1 n , 2 ; 1 n log e 2 t log e 2 t , t [ 2 , ) .
We have X n 0 in probability since, regardless of how small ϵ > 0 , P [ X n > ϵ ] = 1 n for all n 1 ϵ . Furthermore,
h ( X n + Z ) h ( X n ) = ,
because (103) is the mixture of a uniform and an infinite differential entropy probability density function, and differential entropy is concave. We conclude that h ( X n + Z ) h ( Z ) , since h ( Z ) < .
33.
The following result on the continuity of differential entropy is shown in [16]: if X and Z are independent, E [ | Z | ] < and E [ | X | ] < , then
lim ϵ 0 h ( ϵ X + Z ) = h ( Z ) .
This result is weaker than Theorem 1-(m) because finite first absolute moment implies finite strength as we saw in (44), and ϵ X 0 in L 1 if ϵ 0 , and therefore, it vanishes in strength too.
34.
If Z and V are centered and standard Cauchy, respectively, then min λ D ( Z λ V ) is achieved by λ = ς ( Z ) . Otherwise, in general, this does not hold. Since D ( Z λ V ) = V Z 2 λ 2 h ( Z ) + log e ( π λ ) , the minimum is attained at the solution to
η Z 2 1 λ 2 = 1 2 ,
where we have used the η -transform in (31). If Z = V 2 , 2 , recalling (32), (106), results in λ = 2 1 , while ς ( V 2 , 2 ) = 0.302 .
35.
Using (28) and the concavity of log ( 1 + x ) , we can verify that
ς ( X α ) α ς ( X 1 ) + ( 1 α ) ς ( X 0 ) , X α α P X 1 + ( 1 α ) P X 0 ,
if X 0 and X 1 are centered Cauchy, or, more generally, if X 0 = λ 0 X , X 1 = λ 1 X and V X 2 ( θ 2 ) is concave on θ . Not only is this property not satisfied if X = 1 but (107) need not hold in that case, as we can verify numerically for α = 0.1 , λ 1 = 1 and λ 0 > 20 .

6. Maximization of Differential Entropy

36.
Among random variables with a given second moment (resp. first absolute moment), differential entropy is maximized by the zero-mean Gaussian (resp. Laplace) distribution. More generally, among random variables with a given p-absolute moment μ , differential entropy is maximized by the parameter-p Subbotin (or generalized normal) distribution with p-absolute moment μ [17]
f X ( x ) = p 1 1 p 2 Γ ( 1 p ) μ 1 p e | x | p p μ , x R .
Among nonnegative random variables with a given mean, differential entropy is maximized by the exponential distribution. In those well-known solutions, the cost function is an affine function of the negative logarithm of the maximal differential entropy probability density function. Is there a cost function such that, among all random variables with a given expected cost, the Cauchy distribution is the maximal differential entropy solution? To answer this question, we adopt a more general viewpoint. Consider the following result, whose special case ρ = 2 was solved in [18] using convex optimization:
Theorem 2.
Fix ρ > 0 and θ > 0 .
max Z : E log e 1 + | Z | ρ θ h ( Z ) = h ( V β , ρ ) ,
where V β , ρ is defined in (9), the right side of (109) is given in (22), and β > ρ 1 is the solution to
θ = ψ ( β ) ψ β 1 ρ .
Therefore, the standard Cauchy distribution is the maximal differential entropy distribution provided that ρ = 2 and θ = log e 4 .
Proof. 
(a)
For every ρ > 0 and θ > 0 , there is a unique β > ρ 1 that satisfies (110) because the function of β on the right side is strictly monotonically decreasing, grows without bound as β 1 ρ , and goes to zero as β .
(b)
For any Z which satisfies E log e 1 + | Z | ρ θ , its relative entropy, in nats, with respect to V β , ρ is
D ( Z V β , ρ ) = h ( Z ) log e κ β , ρ + β E log e 1 + | Z | ρ
h ( Z ) log e κ β , ρ + β θ
= h ( Z ) log e κ β , ρ + β ψ ( β ) β ψ β 1 ρ
= h ( V β , ρ ) h ( Z ) ,
where (113) and (114) follow from (110) and (22), respectively. Since relative entropy is nonnegative, and zero only if both measures are identical, not only does (2) hold but any random variable other than V β , ρ achieves strictly lower differential entropy.
   □
37.
An unfortunate consequence stemming from Theorem 2 is that, while we were able to find out a cost function such that the Cauchy distribution is the maximal differential entropy distribution under an average cost constraint, this holds only for a specific value of the constraint. This behavior is quite different from the classical cases discussed in Item 36 for which the solution is, modulo scale, the same regardless of the value of the cost constraint. As we see next, this deficiency is overcome by the notion of strength introduced in Section 5.
38.
Theorem 3.
Strength constraint.The differential entropy of a real-valued random variable with strength ς ( Z ) is upper bounded by
h ( Z ) log 4 π ς ( Z ) .
If 0 < ς ( Z ) < , equality holds if and only if Z has a centered Cauchy density, i.e., Z = λ V for some λ > 0 .
Proof. 
(a)
If Z is not an absolutely continuous random variable, or more generally, h ( Z ) = such as in the case ς ( Z ) = 0 in which Z = 0 with probability one, then (115) is trivially satisfied.
(b)
If 0 < ς ( Z ) < and h ( Z ) > , then we invoke (47) to conclude that not only does (115) hold, but it is satisfied with equality if and only if Z = ς ( Z ) V .
   □
39.
The entropy power of a random variable Z is the variance of a Gaussian random variable whose differential entropy is h ( Z ) , i.e.,
N ( Z ) = 1 2 π e exp 2 h ( Z ) .
While the power of a Cauchy random variable is infinite, its entropy power is given by
N ( λ V + μ ) = 1 2 π e exp 2 h ( λ V + μ ) = 8 π λ 2 e .
In the same spirit as the definition of entropy power, Theorem 3 suggests the definition of N C ( Z ) , the entropy strength of Z, as the strength of a centered Cauchy random variable whose differential entropy is h ( Z ) , i.e., h ( Z ) = h N C ( Z ) V . Therefore,
N C ( Z ) = 1 4 π exp h ( Z )
= ς ( Z ) exp D Z ς ( Z ) V
ς ( Z ) ,
where (119) follows from (56), and (120) holds with equality if and only if Z is centered Cauchy. Note that, for all ( α , μ ) R 2 ,
N C ( α Z + μ ) = | α | N C ( Z ) .
Comparing (116) and (118), we see that entropy power is simply a scaled version of the entropy strength squared,
N ( Z ) = 8 π e N C 2 ( Z ) .
The entropy power inequality (e.g., [19,20]) states that, if X 1 and X 2 are independent real-valued random variables, then
N ( X 1 + X 2 ) N ( X 1 ) + N ( X 2 ) ,
regardless of whether they have moments. According to (122), we may rewrite the entropy power inequality (123) replacing each entropy power by the corresponding squared entropy strength. Therefore, the squared entropy strength of the sum of independent random variables satisfies
N C 2 ( X 1 + X 2 ) N C 2 ( X 1 ) + N C 2 ( X 2 ) .
It is well-known that equality holds in (123), and hence (124), if and only if both random variables are Gaussian. Indeed, if X 1 and X 2 are centered Cauchy with respective strengths ς 1 > 0 and ς 2 > 0 , then (124) becomes ς 1 + ς 2 2 > ς 1 2 + ς 2 2 .
40.
Theorem 3 implies that any random variable with infinite differential entropy has infinite strength. There are indeed random variables with finite differential entropy and infinite strength. For example, let Z [ 2 , ) be an absolutely continuous random variable with probability density function
f Z ( t ) = 0.473991 . . . log e 2 n , t n , n + 1 n , n { 2 , 3 , } ; 0 , elsewhere .
Then, h ( Z ) = 1.99258 . . . nats, while the entropy of the quantized version as well as the strength satisfy H ( Z ) = = ς ( Z ) .
41.
With the same approach, we may generalize Theorem 3 to encompass the full slew of the generalized Cauchy distributions in (9). To that end, fix ρ > 0 and define the ( ρ , θ ) -strength of a random variable as
ς ρ , θ ( Z ) = inf ς > 0 : E log e 1 + Z ς ρ θ .
Therefore, ς ρ , θ ( Z ) = ς ( Z ) for ( ρ , θ ) = ( 2 , log e 4 ) , and if ( β , ρ , θ ) satisfy (110), then ς ρ , θ ( V β , ρ ) = 1 . As in Item 25, if ς ρ , θ ( Z ) ( 0 , ) , we have
ς ρ , θ ρ ( Z ) = 1 V | Z | ρ 1 ( θ ) .
42.
Theorem 4.
Generalized strength constraint. Fix ρ > 0 and θ > 0 . The differential entropy of a real-valued random variable with ( ρ , θ ) -strength ς ρ , θ ( Z ) is upper bounded by
h ( Z ) log ς ρ , θ ( Z ) + h ( V β , ρ ) ,
where β is given by the solution to (110), V β , ρ has the generalized Cauchy density (9), and h ( V β , ρ ) is given in (21). If ς ρ , θ ( Z ) < , equality holds if and only if Z is a constant times  V β , ρ .
Proof. 
As with Theorem 3, in the proof, we may assume 0 < ς ρ , θ ( Z ) < to avoid trivialities. Then,
E log e 1 + Z ς ρ , θ ( Z ) ρ = θ ,
and, in nats,
D Z ς ρ , θ ( Z ) V β , ρ = h ( Z ) log e κ β , ρ ς ρ , θ ( Z ) + β E log e 1 + Z ς ρ , θ ( Z ) ρ
= h ( Z ) log e κ β , ρ ς ρ , θ ( Z ) + β θ
= h ( Z ) log e κ β , ρ ς ρ , θ ( Z ) + β ψ ( β ) β ψ β 1 ρ
= h ( Z ) + log e ς ρ , θ ( Z ) + h ( V β , ρ ) ,
where (130), (131), (132), and (133) follow from (9), (129), (110), and (22), respectively.    □
43.
In the multivariate case, we may find a simple upper bound on differential entropy based on the strength of the norm of the random vector.
Theorem 5.
The differential entropy of a random vector Z n is upper bounded by
h ( Z n ) n log ς ( Z n ) + n + 1 2 log ( 4 π ) log Γ n + 1 2 .
Proof. 
As in the proof of Theorem 3, we may assume that 0 < ς ( Z n ) < . As usual, V n denotes the standard spherical multivariate Cauchy density in (6). Since for α 0 , f α V n ( x n ) = | α | n f V n ( α 1 x n ) , we have
D ( Z n ς ( Z n ) V n ) = h ( Z n ) E log f ς ( Z n ) V n ( Z n )
= h ( Z n ) + n log ς ( Z n ) log Γ n + 1 2 π n + 1 2 + n + 1 2 E log 1 + Z n 2 ς 2 ( Z n )
= h ( Z n ) + n log ς ( Z n ) log Γ n + 1 2 + n + 1 2 log ( 4 π ) ,
where (136) and (137) follow from (6) and the definition of strength, respectively.    □
For n = 1 , Theorem 5 becomes the bound in (115). For n = 2 , 3 , , the right side of (15) is greater than log e 4 , and, therefore, ς ( Z n ) > 1 . Consequently, in the multivariate case, there is no Z n such that (134) is tight.
44.
To obtain a full generalization of Theorem 3 in the multivariate case, it is advisable to define the strength of a random n-vector as
ς ( Z n ) = inf ς > 0 : E log f V n ς 1 Z n h V n
= ς 2 , θ n ( Z n )
for θ n = ψ n + 1 2 + γ + log e 4 . To verify (139), note (15)–(17). Notice that ς ( λ V n ) = | λ | and for n = 1 , (138) is equal to (34). The following result provides a maximal differential entropy characterization of the standard spherical multivariate Cauchy density.
Theorem 6.
Let V n have the standard multivariate Cauchy density (6), Then,
h ( Z n ) n log ς ( Z n ) + h ( V n ) ,
where h ( V n ) is given in (17). If 0 < ς ( Z n ) < , equality holds in (140) if and only if Z n = λ V n for some λ 0 .
Proof. 
Assume 0 < ς ( Z n ) < . Then,
D Z n ς ( Z n ) V n = h ( Z n ) + n log ς ( Z n ) E log f V n ς 1 ( Z n ) Z n
= h ( Z n ) + n log ς ( Z n ) + h ( V n )
in view of (138). Hence, the difference between right and left sides of (140) is equal to zero if and only if Z n = λ V n for some λ 0 ; otherwise, it is positive.    □

7. Relative Information

45.
For probability measures P and Q on the same measurable space ( A , F ) , such that P Q , the logarithm of their Radon–Nikodym derivative is the relative information denoted by
ı P Q ( x ) = log d P d Q ( x ) .
46.
As usual, we may employ the notation ı X Y ( x ) to denote ı P X P Y ( x ) . The distributions of the random variables ı X Y ( X ) and ı X Y ( Y ) are referred to as relative information spectra (e.g., [21]). It can be shown that there is a one-to-one correspondence between the cumulative distributions of ı X Y ( X ) and ı X Y ( Y ) . For example, if they are absolutely continuous random variables with respective probability density functions f X Y and f ¯ X Y , then
f X Y ( α ) = exp ( α ) f ¯ X Y ( α ) , α R .
Obviously, the distributions of ı X Y ( X ) and d P X d P Y ( X ) determine each other. One caveat is that relative information may take the value . It can be shown that
P [ ı X Y ( X ) = ] = 0 ,
P [ ı X Y ( Y ) = ] = 1 E exp ( ı X Y ( X ) ) .
47.
The information spectra determine all measures of the distance between the respective probability measures of interest (e.g., [22,23]), including f-divergences and Rényi divergences. For example, the relative entropy (or Kullback–Leibler divergence) of the dominated measure P with respect to the reference measure Q is the average of the relative information when the argument is distributed according to P, i.e.,  D ( X Y ) = E [ ı X Y ( X ) ] . If P ¬ Q , then D ( P Q ) = .
48.
The information spectra also determine the fundamental trade-off in hypothesis testing. Let α ν ( P 1 , P 0 ) denote the minimal probability of deciding P 0 when P 1 is true subject to the constraint that the probability of deciding P 1 when P 0 is true is no larger than ν . A consequence of the Neyman–Pearson lemma is
α ν ( P 1 , P 0 ) = min γ R P ı P 1 P 0 ( Y 1 ) γ exp ( γ ) ν P ı P 1 P 0 ( Y 0 ) > γ ,
where Y 0 P 0 and Y 1 P 1 .
49.
Cauchy distributions are absolutely continuous with respect to each other and, in view of (2),
ı λ 1 V + μ 1 λ 0 V + μ 0 ( x ) = log | λ 1 | | λ 0 | + log ( x μ 0 ) 2 + λ 0 2 ( x μ 1 ) 2 + λ 1 2 .
50.
The following result, proved in Item 58, shows that the relative information spectrum corresponding to Cauchy distributions with respective scale/locations ( λ 1 , μ 1 ) and ( λ 0 , μ 0 ) depends on the four parameters through the single scalar
ζ ( λ 1 , μ 1 , λ 0 , μ 0 ) = λ 1 2 + λ 0 2 + ( μ 1 μ 0 ) 2 2 | λ 0 λ 1 | 1 ,
where equality holds if and only if ( λ 1 , μ 1 ) = ( λ 0 , μ 0 ) .
Theorem 7.
Suppose that λ 1 λ 0 0 , and V is standard Cauchy. Denote
Z = d P λ 1 V + μ 1 d P λ 0 V + μ 0 ( λ 1 V + μ 1 ) .
Then,
(a) 
E Z = ζ ( λ 1 , μ 1 , λ 0 , μ 0 ) ,
(b) 
Z has the same distribution as the random variable
ζ + ζ 2 1 cos Θ ,
where Θ is uniformly distributed on [ π , π ] and ζ = ζ ( λ 1 , μ 1 , λ 0 , μ 0 ) . Therefore, the probability density function of Z is
f Z ( z ) = 1 π 1 ζ 2 1 ( z ζ ) 2 ,
on the interval 0 < ζ ζ 2 1 < z < ζ + ζ 2 1 .
51.
The indefinite integral (e.g., see 2.261 in [24])
d x 2 ζ x x 2 1 = arcsin x ζ ζ 2 1
results, with X i = λ i V + μ i , i = 0 , 1 , in
P [ ı X 1 X 0 ( X 1 ) log t ] = 1 , ζ + ζ 2 1 t ; 1 2 + 1 π arcsin t ζ ζ 2 1 , ζ ζ 2 1 < t < ζ + ζ 2 1 ; 0 , 0 < t ζ ζ 2 1 .
52.
For future use, note that the endpoints of the support of (153) are their respective reciprocals. Furthermore,
f Z 1 z = z f Z ( z ) ,
which implies
f 1 Z ( z ) = 1 z f Z ( z ) .

8. Equivalent Pairs of Probability Measures

53.
Suppose that P 1 and Q 1 are probability measures on ( A 1 , F 1 ) such that P 1 Q 1 and P 2 and Q 2 are probability measures on ( A 2 , F 2 ) such that P 2 Q 2 . We say that ( P 1 , Q 1 ) and ( P 2 , Q 2 ) are equivalent pairs, and write ( P 1 , Q 1 ) ( P 2 , Q 2 ) , if the cumulative distribution functions of ı P 1 Q 1 ( X 1 ) and ı P 2 Q 2 ( X 2 ) are identical with X 1 P 1 and X 2 P 2 . Naturally, ≡ is an equivalence relationship. Because of the one-to-one correspondence indicated in Item 46, the definition of equivalent pairs does not change if we require equality of the information spectra under the dominated measure, i.e., that ı P 1 Q 1 ( Y 1 ) and ı P 2 Q 2 ( Y 2 ) be equally distributed Y 1 Q 1 and Y 2 Q 2 . Obviously, the requirement that the information spectra coincide is the same as requiring that the distributions of d P 1 d Q 1 ( Y 1 ) and d P 2 d Q 2 ( Y 2 ) are equal. As in Item 46, we also employ the notation ( X 1 , Y 1 ) ( X 2 , Y 2 ) to indicate ( P 1 , Q 1 ) ( P 2 , Q 2 ) if X 1 P 1 , X 2 P 2 , Y 1 Q 1 , and Y 2 Q 2 .
54.
Suppose that the output probability measures of a certain (random or deterministic) transformation are Q 0 and Q 1 when the input is distributed according to P 0 and P 1 , respectively. If ( P 0 , P 1 ) ( Q 0 , Q 1 ) , then the transformation is a sufficient statistic for deciding between P 0 and P 1 (i.e., the case of a binary parameter).
55.
If ( A , F ) is a measurable space on which the probability measures P X 1 P X 2 are defined, and ϕ : A B is a ( F , G ) -measurable injective function, then P ϕ ( X 1 ) P ϕ ( X 2 ) are probability measures on ( B , G ) and
ı X 1 X 2 ( x ) = ı ϕ ( X 1 ) ϕ ( X 2 ) ϕ ( x ) .
Consequently, ( X 1 , X 2 ) ( ϕ ( X 1 ) , ϕ ( X 2 ) ) .
56.
The most important special case of Item 55 is an affine transformation of an arbitrary real-valued random variable X, which enables the reduction of four-parameter problems into two-parameter problems: for all ( λ 2 , μ 1 , μ 2 ) R 3 and λ 1 0 ,
( λ 1 X + μ 1 , λ 2 X + μ 2 ) ( X , λ X + μ ) ,
with
λ = λ 2 λ 1 a n d μ = μ 2 μ 1 λ 1 ,
by choosing the affine function ϕ ( x ) = x μ 1 λ 1 .
57.
Theorem 8.
If X n R n is an even random vector, i.e., P X n = P X n , then
( X n + μ 1 , X n + μ 2 ) ( X n + μ 3 , X n + μ 4 ) ,
whenever | μ 1 μ 2 | = | μ 3 μ 4 | .
Proof. 
(a)
If μ 1 μ 2 = μ 3 μ 4 , then (161) holds even if X n is not even because the function x μ is injective, in particular, with μ = μ 3 μ 1 = μ 4 μ 2 .
(b)
If μ 1 μ 2 = μ 4 μ 3 , then
( X n + μ 1 , X n + μ 2 ) ( X n , X n + μ 2 μ 1 )
( X n , X n + μ 3 μ 4 )
( X n + μ 3 μ 4 , X n )
( X n + μ 3 μ 4 , X n )
( X n + μ 3 , X n + μ 4 ) ,
where (162) and (166) follow from Part (a), (164) follows because x + μ 3 μ 4 is injective, and (165) holds because X n is even.
   □
58.
We now proceed to prove Theorem 7.
Proof. 
Since λ V and λ V have identical distributions, we may assume for convenience that λ 1 > 0 and λ 0 > 0 . Furthermore, capitalizing on Item 56, we may assume λ 1 = 1 , μ 1 = 0 , λ 0 = λ , and μ 0 = μ , and then recover the general result letting λ = λ 0 λ 1 and μ = μ 0 μ 1 λ 1 . Invoking (A9) and (A10), we have
E d P V d P λ V + μ ( V ) = 1 λ E ( V μ ) 2 + λ 2 V 2 + 1
= 1 π λ ( t μ ) 2 + λ 2 ( t 2 + 1 ) 2 d t
= λ 2 + μ 2 + 1 2 λ ,
and we can verify that we recover (151) through the aforementioned substitution. Once we have obtained the expectation of Z = d P V d P λ V + μ ( V ) , we proceed to determine its distribution. Denoting the right side of (169) by ζ , we have
Z E [ Z ] = 1 λ λ 2 + ( V μ ) 2 1 + V 2 ζ
= 1 2 λ ( 1 λ 2 μ 2 ) ( V 2 1 ) 4 μ V 1 + V 2
= 1 2 λ ( 1 λ 2 μ 2 ) ( sin 2 Θ cos 2 Θ ) 4 μ sin Θ cos Θ
= 1 2 λ ( λ 2 + μ 2 1 ) cos 2 Θ 2 μ sin 2 Θ
= 1 2 λ ( λ 2 + μ 2 1 ) 2 + 4 μ 2 cos 2 Θ + ϕ λ , μ
= ζ 2 1 cos 2 Θ + ϕ λ , μ ,
where Θ is uniformly distributed on [ π , π ] . We have substituted V = tan Θ (see Item 4) in (172), and invoked elementary trigonometric identities in (173) and (174). Since the phase in (175) does not affect it, the distribution of Z is indeed as claimed in (152), and (153) follows because the probability density function of cos Θ is
f cos Θ ( t ) = 1 π 1 1 t 2 , | t | < 1 .
   □
59.
In general, it need not hold that ( X , Y ) ( Y , X ) —for example, if X and Y are zero-mean Gaussian with different variances. However, the class of scalar Cauchy distributions does satisfy this property since the result of Theorem 7 is invariant to swapping λ 1 λ 0 and μ 1 μ 0 . More generally, Theorem 7 implies that, if λ 1 λ 0 γ 1 γ 0 0 , then
( λ 1 V + μ 1 , λ 0 V + μ 0 ) ( γ 1 V + ν 1 , γ 0 V + ν 0 ) λ 1 2 + λ 0 2 + ( μ 1 μ 0 ) 2 | λ 0 λ 1 | = γ 1 2 + γ 0 2 + ( ν 1 ν 0 ) 2 | γ 0 γ 1 | .
Curiously, (177) implies that ( V , V + 1 ) ( V , 2 V + 1 ) .
60.
For location–dilation families of random variables, we saw in Item 56 how to reduce a four-parameter problem into a two-parameter problem since ( λ 1 V + μ 1 , λ 0 V + μ 0 ) ( V , λ V + μ ) with the appropriate substitution. In the Cauchy case, Theorem 7 reveals that, in fact, we can go one step further and turn it into a one-parameter problem. We have two basic ways of doing this:
(a)
( λ 1 V + μ 1 , λ 0 V + μ 0 ) ( V , V + μ ) with μ 2 = 2 ζ 2 .
(b)
( λ 1 V + μ 1 , λ 0 V + μ 0 ) ( V , λ V ) with either
λ = ζ ζ 2 1 < 1 , o r λ = ζ + ζ 2 1 > 1 ,
which are the solutions to ζ = λ 2 + 1 2 λ .

9. f -Divergences

This section studies the interplay of f-divergences and equivalent pairs of measures.
61.
If P Q and f : [ 0 , ) R is convex and right-continuous at 0, f-divergence is defined as
D f ( P Q ) = E f d P d Q ( Y ) , Y Q .
62.
The most important property of f-divergence is the data processing inequality
D f ( P X Q X ) D f ( P Y Q Y ) ,
where P Y and Q Y are the responses of a (random or deterministic) transformation to P X and Q X , respectively. If f is strictly convex at 1 and D f ( P X Q X ) < , then ( P X , Q X ) ( P Y , Q Y ) is necessary and sufficient for equality in (180).
63.
If ( P , Q ) ( Q , P ) , then D f ( P Q ) = D f ( P Q ) with the transform f ( t ) = t f ( 1 t ) , which satisfies f = f .
64.
Theorem 9.
If P 1 Q 1 and P 2 Q 2 , then
( P 1 , Q 1 ) ( P 2 , Q 2 ) D f ( P 1 Q 1 ) = D f ( P 2 Q 2 ) , f ,
where f stands for all convex right-continuous f : [ 0 , ) R .
Proof. 
As mentioned in Item 53, ( P 1 , Q 1 ) ( P 2 , Q 2 ) is equivalent to d P 1 d Q 1 ( Y 1 ) and d P 2 d Q 2 ( Y 2 ) having identical distributions with Y 1 Q 1 and Y 2 Q 2 .
According to (179), D f ( P Q ) is determined by the distribution of the random variable d P d Q ( Y ) , Y Q .
For t R , the function f t ( x ) = e t x , x 0 , is convex and right-continuous at 0, and D f t ( P Q ) is the moment generating function, evaluated at t, of the random variable d P d Q ( Y ) , Y Q . Therefore, D f t ( P 1 Q 1 ) = D f t ( P 2 Q 2 ) for all t implies that ( P 1 , Q 1 ) ( P 2 , Q 2 ) .
   □
65.
Since P Q is not necessary in order to define (finite) D f ( P Q ) , it is possible to enlarge the scope of Theorem 9 by defining ( P 1 , Q 1 ) ( P 2 , Q 2 ) dropping the restriction that P 1 Q 1 and P 2 Q 2 . For that purpose, let μ 1 and μ 2 be σ -finite measures on ( A 1 , F 1 ) and ( A 2 , F 2 ) , respectively, and denote p i = d P i d μ i , q i = d Q i d μ i , i = 1 , 2 . Then, we say ( P 1 , Q 1 ) ( P 2 , Q 2 ) if
(a)
when restricted to [ 0 , 1 ] , the random variables p 1 ( Y 1 ) q 1 ( Y 1 ) and p 2 ( Y 2 ) q 2 ( Y 2 ) have identical distributions with Y 1 Q 1 and Y 2 Q 2 ;
(b)
when restricted to [ 0 , 1 ] , the random variables q 1 ( X 1 ) p 1 ( X 1 ) and q 2 ( X 2 ) p 2 ( X 2 ) have identical distributions with X 1 P 1 and X 2 P 2 .
Note that those conditions imply that
(c)
Q 1 ( { ω A 1 : p 1 ( ω ) = q 1 ( ω ) } ) = Q 2 ( { ω A 2 : p 2 ( ω ) = q 2 ( ω ) } ) ;
(d)
Q 1 ( { ω A 1 : p 1 ( ω ) = 0 } ) = Q 2 ( { ω A 2 : p 2 ( ω ) = 0 } ) ;
(e)
P 1 ( { ω A 1 : q 1 ( ω ) = 0 } ) = P 2 ( { ω A 2 : q 2 ( ω ) = 0 } ) .
For example, if P 1 Q 1 and P 2 Q 2 , then ( P 1 , Q 1 ) ( P 2 , Q 2 ) . To show the generalized version of Theorem 9, it is convenient to use the symmetrized form
D f ( P Q ) = 0 p < q q f p q d μ + 0 q < p p f q p d μ + f ( 1 ) Q [ p = q ] .
66.
Suppose that there is a class C of probability measures on a given measurable space with the property that there exists a convex function g : ( 0 , ) R (right-continuous at 0) such that, if ( P 1 , Q 1 ) C 2 and ( P 2 , Q 2 ) C 2 , then
D g ( P 1 Q 1 ) = D g ( P 2 Q 2 ) ( P 1 , Q 1 ) ( P 2 , Q 2 ) .
In such case, Theorem 9 indicates that C 2 can be partitioned into equivalence classes such that, within every equivalence class, the value of D f ( P Q ) is constant, though naturally dependent on f. Throughout C 2 , the value of D g ( P Q ) determines the value of D f ( P Q ) , i.e., we can express D f ( P Q ) = ϑ f , g D g ( P Q ) , where ϑ f , g is a non-decreasing function. Consider the following examples:
(a)
Let C be the class of real-valued Gaussian probability measures with given variance σ 2 > 0 . Then,
D N μ 1 , σ 2 N μ 2 , σ 2 = ( μ 1 μ 2 ) 2 σ 2 log e .
Since Theorem 8 implies that ( N μ 1 , σ 2 , N μ 2 , σ 2 ) ( N μ 3 , σ 2 , N μ 4 , σ 2 ) as long as ( μ 1 μ 2 ) 2 = ( μ 3 μ 4 ) 2 , (184) indicates that (183) is satisfied with g ( t ) given by the right-continuous extension of t log t . Therefore, we can conclude that, regardless of f, D f N μ 1 , σ 2 N μ 2 , σ 2 depends on ( μ 1 , μ 2 , σ 2 ) only through ( μ 1 μ 2 ) 2 / σ 2 .
(b)
Let C be the collection of all Cauchy random variables. Theorem 7 reveals that (183) is also satisfied if g ( x ) = x 2 because, if X P and Y Q , then
E d P d Q ( X ) = E d P d Q ( Y ) 2 .
67.
An immediate consequence of Theorems 7 and 9 is that, for any valid f, the f-divergence between Cauchy densities is symmetric,
D f ( λ 1 V + μ 1 λ 0 V + μ 0 ) = D f ( λ 0 V + μ 0 λ 1 V + μ 1 ) .
This property does not generalize to the multivariate case. While, in view of Theorem 8,
( Λ 1 2 V n + μ 1 , Λ 1 2 V n + μ 2 ) ( Λ 1 2 V n + μ 2 , Λ 1 2 V n + μ 1 ) ,
in general, ( Λ 1 2 V n , V n ) ( V n , Λ 1 2 V n ) since the corresponding relative entropies do not coincide as shown in [8].
68.
It follows from Item 66 and Theorem 7 that any f-divergence between Cauchy probability measures D f ( λ 1 V + μ 1 λ 0 V + μ 0 ) is a monotonically increasing function of ζ ( λ 1 , μ 1 , λ 0 , μ 0 ) given by (149). The following result shows how to obtain that function from f.
Theorem 10.
With f Z given in (153),
D f ( λ 1 V + μ 1 λ 0 V + μ 0 ) = ζ ζ 2 1 ζ + ζ 2 1 f 1 z f Z ( z ) d z
= E f ζ + ζ 2 1 cos Θ 1
= ζ ζ 2 1 ζ + ζ 2 1 1 z f z f Z ( z ) d z .
where Θ is uniformly distributed on [ 0 , π ] in (189).
Proof. 
In view of (179) and the definition of Z in Theorem 7,
D f ( λ 1 V + μ 1 λ 0 V + μ 0 ) = E f 1 Z ,
thereby justifying (188) and (189) since we saw in Theorem 7 that Z has the distribution of ζ + ζ 2 1 cos Θ with Θ uniformly distributed on [ 0 , π ] . Item 52 results in (190). Alternatively, we can rely on Item 63 and substitute f by f on the right side of (188).    □
69.
Suppose now that we have two sequences of Cauchy measures with respective parameters ( λ 1 ( n ) , μ 1 ( n ) ) and ( λ 0 ( n ) , μ 0 ( n ) ) such that ζ ( λ 1 ( n ) , μ 1 ( n ) , λ 0 ( n ) , μ 0 ( n ) ) 1 . Then, Theorem 10 indicates that
D f λ 1 ( n ) V + μ 1 ( n ) λ 0 ( n ) V + μ 0 ( n ) f ( 1 ) .
The most common f-divergences are such that f ( 1 ) = 0 since in that case D f ( P Q ) 0 . In addition, adding the function α t α to f ( t ) does not change the value of D f ( P Q ) and with appropriately chosen α , we can turn f ( t ) into canonical form in which not only f ( 1 ) = 0 but f ( t ) 0 . In the special case in which the second measure is fixed, Theorem 9 in [25] shows that, if ess sup d P n d Q ( Y ) 1 with Y Q , then
lim n D f ( P n Q ) D g ( P n Q ) = lim t 1 f ( t ) g ( t ) ,
provided the limit on the right side exists; otherwise, the left side lies between the left and right limits at 1. In the Cauchy case, we can allow the second probability to depend on n and sharpen that result by means of Theorem 10. In particular, it can be shown that
lim n D f λ 1 ( n ) V + μ 1 ( n ) λ 0 ( n ) V + μ 0 ( n ) D g λ 1 ( n ) V + μ 1 ( n ) λ 0 ( n ) V + μ 0 ( n ) = f ˙ ( 0 ) + f ˙ ( 0 + ) g ˙ ( 0 ) + g ˙ ( 0 + )
provided the right side is not 0 0 .

10. χ 2 -Divergence

70.
With either f ( x ) = ( x 1 ) 2 or f ( x ) = x 2 1 , f-divergence is the χ 2 -divergence,
χ 2 ( P Q ) = E d P d Q ( X ) 1 , X P .
71.
If P and Q are Cauchy distributions, then (149), (151) and (195) result in
χ 2 ( λ 1 V + μ 1 λ 0 V + μ 0 ) = ζ ( λ 1 , μ 1 , λ 0 , μ 0 ) 1
= ( | λ 0 | | λ 1 | ) 2 + ( μ 1 μ 0 ) 2 2 | λ 0 λ 1 | ,
a formula obtained in Appendix D of [26] using complex analysis and the Cauchy integral formula. In addition, invoking complex analysis and the maximal group invariant results in [27,28], ref. [26] shows that any f-divergence between Cauchy distributions can be expressed as a function of their χ 2 divergence, although [26] left open how to obtain that function, which is given by Theorem 10 substituting ζ = 1 + χ 2 .

11. Relative Entropy

72.
The relative entropy between Cauchy distributions is given by
D ( λ 1 V + μ 1 λ 0 V + μ 0 ) = log ( | λ 0 | + | λ 1 | ) 2 + ( μ 1 μ 0 ) 2 4 | λ 0 λ 1 | ,
where λ 1 λ 0 0 . The special case λ 1 = λ 0 of (198) was found in Example 4 of [29]. The next four items give different simple justifications for (198). An alternative proof was recently given in Appendix C of [26] using complex analysis holomorphisms and the Cauchy integral formula. Yet another, much more involved, proof is reported in [30]. See also Remark 19 in [26] for another route invoking the Lévy–Khintchine formula and the Frullani integral.
73.
Since for absolutely continuous random variables D ( X Y ) = h ( X ) E [ log f Y ( X ) ] ,
D V λ V + μ = h ( V ) + log π | λ | + E log λ 2 + ( V μ ) 2
= log ( 4 | λ | ) + log ( 1 + | λ | ) 2 + μ 2 ,
where (200) follows from (12) and (A4) with α 2 = λ 2 + μ 2 and cos β = μ | α | .
Now, substituting λ = λ 0 λ 1 and μ = μ 0 μ 1 λ 1 , we obtain (198) since, according to Item 56, ( V , λ V + μ ) ( λ 1 V + μ 1 , λ 0 V + μ 0 ) .
74.
From the formula found in Example 4 of [29] and the fact that, according to (197), χ 2 = μ 2 2 λ 2 when λ 1 = λ 0 = λ , we obtain
D ( λ V + μ λ V ) = log 1 + μ 2 4 λ 2 = log 1 + 1 2 χ 2 .
Moreover, as argued in Item 60, (201) is also valid for the relative entropy between Cauchy distributions with λ 1 λ 0 as long as χ 2 is given in (197). Indeed, we can verify that the right side of (201) becomes (198) with said substitution.
75.
By the definition of relative entropy, and Theorem 7,
D ( λ 1 V + μ 1 λ 0 V + μ 0 ) = E log Z
= 1 2 π 0 2 π log ζ + ζ 2 1 cos θ d θ
= log 1 + ζ 2 ,
where (204) follows from (A14). Then, (198) results by plugging into (204) the value of ζ in (149).
76.
Evaluating (190) with f ( t ) = t log t results in (202).
77.
If V is standard Cauchy, independent of Cauchy V 1 and V 0 , then (198) results in
D ( λ V + ϵ V 1 λ V + ϵ V 0 ) = ϵ 2 4 λ 2 ( λ 1 λ 0 ) 2 + ( μ 1 μ 0 ) 2 log e + o ( ϵ 2 ) ,
where V 1 = λ 1 V + λ 1 and V 0 = λ 1 V + λ 1 , and V is an independent (or exact) copy of V. In contrast, the corresponding result in the Gaussian case in which X, X 1 , X 0 are independent Gaussian with means μ , μ 1 , μ 0 and variances σ 2 , σ 1 2 , σ 0 2 , respectively, is
D ( X + ϵ X 1 X + ϵ X 0 ) = ϵ 2 2 σ 2 ( μ 1 μ 0 ) 2 log e + o ( ϵ 2 ) .
In fact, it is shown in Lemma 1 of [31] that (206) holds even if X 1 and X 0 are not Gaussian but have finite variances. It is likely that (205) holds even if V 1 and V 0 are not Cauchy, but have finite strengths.
78.
An important information theoretic result due to Csiszár [32] is that if Q 1 Q 2 and P is such that
E ı Q 1 Q 2 ( X ) = D ( Q 1 Q 2 ) , X P ,
then the following Pythagorean identity holds
D ( P Q 2 ) = D ( P Q 1 ) + D ( Q 1 Q 2 ) .
Among other applications, this result leads to elegant proofs of minimum relative entropy results. For example, the closest Gaussian to a given P with a finite second moment has the same first and second moments as P. If we let Q 1 and Q 2 be centered Cauchy with strengths λ 1 and λ 2 , respectively, then the orthogonality condition (207) becomes, with the aid of (148) and (198),
V X 2 λ 2 1 V X 2 λ 1 1 = 2 log e 1 + λ 1 λ 2 2 log e 2 .
If, in addition, P is centered Cauchy, we can use (28) to verify that (209) holds only in the trivial cases in which either λ 1 = λ 2 or P = Q 1 . For non-Cauchy P, (208) may indeed be satisfied with λ 1 λ 2 . For example, using (30), if X = V 2 , 2 , then (209), and therefore (208), holds with ( λ 1 , λ 2 ) = ( 2 , 0.35459 ) .
79.
Mutually absolutely continuous random variables may be such that
D ( X Z ) < = D ( Z X ) .
An easy example is that of Gaussian X and Cauchy Z, or, if we let X be Cauchy, (210) holds with Z having the very heavy-tailed density function in (62).
80.
While relative entropy is lower semi-continuous, it is not continuous. For example, using the Cauchy distribution, we can show that relative entropy is not stable against small contamination of a Gaussian random variable: if X is Gaussian independent of V, then no matter how small λ 0 ,
D ( λ | V | + X λ | V | + X ) = .

12. Total Variation Distance

81.
With f ( x ) = | x 1 | , f-divergence becomes the total variation distance (with range [0,2]). Moreover, we have the following representation:
Theorem 11.
If P Q and ( P , Q ) ( Q , P ) , then
1 2 | P Q | = 2 P [ Z > 1 ] P [ Z 1 ] ,
with Z = d P d Q ( X ) , X P .
Proof. 
1 2 | P Q | = max A F P ( A ) Q ( A )
= P ω : d P d Q ( ω ) > 1 Q ω : d P d Q ( ω ) > 1
= P ω : d P d Q ( ω ) > 1 P ω : d Q d P ( ω ) > 1
= P [ Z > 1 ] P [ Z < 1 ]
where (215) and (216) follow from ( P , Q ) ( Q , P ) and P Q , respectively.    □
82.
Example 15 of [33] shows that the total variation distance between centered Cauchy distributions is
P λ 1 V P λ 0 V = 4 π arctan | | λ 1 | | λ 0 | | 2 | λ 0 λ 1 |
= 4 π arctan 1 2 χ 2 ( P λ 1 V P λ 0 V )
in view of (197). Since any f-divergence between Cauchy distributions depends on the parameters only through the corresponding χ 2 -divergence, (217)–(218) imply the general formula
P λ 1 V + μ 1 P λ 0 V + μ 0 = 4 π arctan 1 2 χ 2 ( P λ 1 V + μ 1 P λ 0 V + μ 0 ) .
Alternatively, applying Theorem 11 to the case of Cauchy random variables, note that, in this case, Z is an absolutely continuous random variable with density function (153). Therefore, P [ Z 1 ] = 1, and
P [ Z > 1 ] = 1 π 1 ζ + ζ 2 1 1 2 z ζ z 2 1 d z
= 1 2 + 1 π arctan 1 2 χ 2 ,
where (221) follows from (154) and the identity arcsin δ 1 + δ = arctan δ specialized to δ = 1 2 χ 2 = 1 2 ( ζ 1 ) . Though more laborious (see [26]), (219) can also be verified by direct integration.

13. Hellinger Divergence

83.
The Hellinger divergence, H α ( P Q ) of order α ( 0 , 1 ) ( 1 , ) , is the f α -divergence with
f α ( t ) = t α 1 α 1 .
Notable special cases are
H 2 ( P Q ) = χ 2 ( P Q ) ,
lim α 1 H α ( P Q ) = D ( P Q ) ,
H 1 2 ( P Q ) = 2 H 2 ( P Q ) ,
where H 2 ( P Q ) is known as the squared Hellinger distance.
84.
For Cauchy random variables, Theorem 10 yields
H α ( λ 1 V + μ 1 λ 0 V + μ 0 ) = 1 α 1 E [ Z α ] 1
= P α 1 ( ζ ) 1 α 1 ,
where ζ is as given in (149), and we have used (A15) and P α ( · ) denotes the Legendre function of the first kind, which satisfies P α = P α 1 (see 8.2.1. in [34]).

14. Rényi Divergence

85.
For absolutely continuous probability measures P and Q, with corresponding probability density functions p and q, the Rényi divergence of order α [ 0 , 1 ) ( 1 , ) is [35]
D α ( P Q ) = 1 α 1 log p α ( t ) q 1 α ( t ) d t .
Note that, if ( P 1 , Q 1 ) ( P 2 , Q 2 ) , then D α ( P 1 Q 1 ) = D α ( P 2 Q 2 ) . Moreover, although Rényi divergence of order α is not an f-divergence, it is in one-to-one correspondence with the Hellinger divergence of order α :
D α ( P Q ) = 1 α 1 log 1 + ( α 1 ) H α ( P Q ) .
86.
An extensive table of order- α Rényi divergences for various continuous random variables can be found in [36]. An addition to that list for Cauchy random variables can be obtained plugging (227) into (229):
D α ( λ 1 V + μ 1 λ 0 V + μ 0 ) = log P α 1 ( ζ ) α 1
= 1 α 1 log P α 1 λ 1 2 + λ 0 2 + ( μ 1 μ 0 ) 2 2 | λ 0 λ 1 | ,
for α ( 0 , 1 ) ( 1 , ) .
87.
Suppose that λ ( 0 , 1 ) . Then, (A16) yields
D 1 2 ( V λ V ) = 2 log 2 λ π K 1 λ 2 ,
where K ( · ) stands for the complete elliptical integral of the first kind in (A18). As indicated in Item 60, to obtain D 1 2 ( λ 1 V + μ 1 λ 0 V + μ 0 ) , we just need to substitute λ by ζ ζ 2 1 in (232), with ζ given by (149).
88.
Notice that, specializing (86) to ( α , μ 0 , μ 1 , λ 0 , λ 1 ) = ( 1 2 , 0 , 0 , λ , 1 ) , (232) results in the identity
P 1 2 1 2 λ + λ 2 = 2 λ π K 1 λ 2 , λ ( 0 , 1 ) .
Writing the complete elliptical integral of the first kind and the Legendre function of the first kind as special cases of the Gauss hypergeometric function, González [37] noticed the simpler identity (see also 8.13.8 in [34])
P 1 2 λ = 2 π K 1 λ 2 , λ ( 0 , 1 ) .
We can view (233) and (234) as complementary of each other since they constrain the argument of the Legendre function to belong to ( 1 , ) and ( 0 , 1 ) , respectively.
89.
Since P 1 ( z ) = z , particularizing (230), we obtain
D 2 ( λ 1 V + μ 1 λ 0 V + μ 0 ) = log ζ = log λ 1 2 + λ 0 2 + ( μ 1 μ 0 ) 2 2 | λ 0 λ 1 | .
90.
Since P 2 ( z ) = 1 2 ( 3 z 2 1 ) , for Cauchy random variables, we obtain
D 3 ( P Q ) = 1 2 log 1 + 3 χ 2 ( P Q ) + 3 2 χ 4 ( P Q ) .
91.
For Cauchy random variables, the Rényi divergence for integer order 4 or higher can be obtained through (235), (236) and the recursion (dropping ( P Q ) for typographical convenience)
( n + 1 ) exp ( n + 1 ) D n + 2 = ( 2 n + 1 ) ζ exp n D n + 1 n exp ( n 1 ) D n ,
which follows from (230) and the recursion of the Legendre polynomials
( n + 1 ) P n + 1 ( z ) = ( 2 n + 1 ) z P n ( z ) n P n 1 ( z ) ,
which, in fact, also holds for non-integer n (see 8.5.3 in [34]).
92.
The Chernoff information
C ( P Q ) = sup λ ( 0 , 1 ) ( 1 λ ) D λ ( P Q )
satisfies C ( P Q ) = C ( Q P ) regardless of ( P , Q ) . If, as in the case of Cauchy measures, ( P , Q ) ( Q , P ) , then Chernoff information is equal to the Bhattacharyya distance:
C ( P Q ) = 1 2 D 1 2 ( P Q ) = log 1 p ( t ) q ( t ) d t = log 1 H 2 ( P Q ) ,
where H 2 ( P Q ) is the squared Hellinger distance, which is the f-divergence with f ( t ) = 1 2 ( 1 t ) 2 . Together with Item 87, (240) gives the Chernoff information for Cauchy distributions. While it involves the complete elliptical integral function, its simplicity should be contrasted with the formidable expression for Gaussian distributions, recently derived in [38]. The reason (240) holds is that the supremum in (239) is achieved at λ = 1 2 . To see this, note that
f ( λ ) = ( 1 λ ) D λ ( P Q ) = λ D 1 λ ( Q P )
= λ D 1 λ ( P Q )
= f ( 1 λ ) ,
where (241) reflects the skew-symmetry of Rényi divergence, and (242) holds because ( P , Q ) ( Q , P ) . Since f ( λ ) : λ [ 0 , 1 ] is concave and its own mirror image, it is maximized at λ = 1 2 .

15. Fisher’s Information

93.
The score function of the standard Cauchy density (1) is
ρ V ( x ) = log e f V ( x ) = log e ( 1 + x 2 ) = 2 x 1 + x 2 .
Then, ρ V ( V ) is a zero-mean random variable with second moment equal to Fisher’s information
J ( V ) = E ρ V 2 ( V ) = 1 π 4 t 2 ( 1 + t 2 ) 3 d t = 1 2 ,
where we have used (A11). Since Fisher’s information is invariant to location and scales as J ( X ) = α 2 J ( α X ) , we obtain
J ( λ V + μ ) = 1 2 λ 2 .
Together with (117), the product of entropy power and Fisher information is 4 π e , thereby abiding by Stam’s inequality [4], 1 N ( X ) J ( X ) .
94.
Introduced in [39], Fisher’s information of a density function (245) quantifies its similarity with a slightly shifted version of itself. A more general notion is the Fisher information matrix of a random transformation P Y | X : R k Y satisfying the regularity condition
D ( P Y | X = α P Y | X = θ ) = o ( α θ ) .
Then, the Fisher information matrix of P Y | X at θ has coefficients
J i j ( θ , P Y | X ) = E α i ı P Y | X = α P Y | X = θ ( Y θ ) α j ı P Y | X = α P Y | X = θ ( Y θ ) | α θ ,
and satisfies (with relative entropy in nats)
D ( P Y | X = α P Y | X = θ ) = 1 2 ( α θ ) J ( θ , P Y | X ) ( α θ ) + o ( α θ 2 ) .
For the Cauchy family, the parametrization vector has two components, location and strength, namely, θ = ( μ , λ ) . The regularity condition (247) is satisfied in view of (205), and we can use the closed-form expression in (205) to obtain
J 11 ( θ , P Y | X ) = J 22 ( θ , P Y | X ) = 1 2 λ 2 ,
J 12 ( θ , P Y | X ) = J 21 ( θ , P Y | X ) = 0 .
95.
The relative Fisher information is defined as
J ( P Q ) = E ı P Q ( X ) 2 , X P .
Although the purpose of this definition is to avoid some of the pitfalls of the classical definition of Fisher’s information, not only do equivalent pairs fail to have the same relative Fisher information but, unlike relative entropy or f-divergence, relative Fisher information is not transparent to injective transformations. For example, J ( X Y ) = λ 2 J ( λ X λ Y ) . Centered Cauchy random variables illustrate this fact since
J ( V λ V ) = ( 4 + λ ) ( λ 1 ) 2 2 λ ( 1 + λ ) 2 and J ( λ V V ) = ( 4 λ + 1 ) ( λ 1 ) 2 2 λ 2 ( 1 + λ ) 2 .
96.
de Bruijn’s identity [4] states that, if N N 0 , 1 is independent of X, then, in nats,
d d t h X + t N = 1 2 J X + t N , t > 0 .
As well as serving as the key component in the original proofs of the entropy power inequality, the differential equation in (254) provides a concrete link between Shannon theory and its prehistory. As we show in Theorem 12, it turns out that there is a Cauchy counterpart of de Bruijn’s identity (254). Before stating the result, we introduce the following notation for a parametrized random variable Y t (to be specified later):
log e f Y t ( y ) = y log e f Y t ( y ) = f Y t 1 ( y ) y f Y t ( y ) ,
2 log e f Y t ( y ) = t log e f Y t ( y ) = f Y t 1 ( y ) t f Y t ( y ) ,
J ( Y t ) = E log e f Y t ( Y t ) 2 ,
K ( Y t ) = E 2 log e f Y t ( Y t ) 2 ,
i.e., J ( Y t ) and K ( Y t ) are the Fisher information with respect to location and with respect to dilation, respectively (corresponding to the coefficients J 11 and J 22 of the Fisher information matrix when θ = ( μ , λ ) as in Item 94. The key to (254) is that Y t = X + t N , N N 0 , 1 satisfies the partial differential equation
2 y 2 f Y t ( y ) = t f Y t ( y ) .
Theorem 12.
Suppose that X is independent of standard Cauchy V. Then, in nats,
d 2 d t 2 h X + t V = J X + t V K X + t V , t > 0 .
Proof. 
Equation (259) does not hold in the current case in which Y t = X + t V , and
f Y t ( y ) = t π E 1 t 2 + ( X y ) 2 .
However, some algebra (the differentiation/integration swaps can be justified invoking the bounded convergence theorem) indicates that the convolution with the Cauchy density satisfies the Laplace partial differential equation
2 y 2 f Y t ( y ) = 2 t 2 f Y t ( y ) = 2 t π E 3 ( X y ) 2 t 2 ( t 2 + ( X y ) 2 ) 3 .
The derivative of the differential entropy of Y t is, in nats,
d d t h Y t = t f Y t ( y ) d y log e f Y t ( y ) t f Y t ( y ) d y
= t f Y t ( y ) d y log e f Y t ( y ) t f Y t ( y ) d y .
Taking another derivative, the left side of (260) becomes
d 2 d t 2 h Y t = 2 t 2 f Y t ( y ) log e f Y t ( y ) d y t f Y t ( y ) t log e f Y t ( y ) d y
= 2 y 2 f Y t ( y ) log e f Y t ( y ) d y f Y t 1 ( y ) t f Y t ( y ) 2 d y
= 2 y 2 f Y t ( y ) log e f Y t ( y ) d y K ( Y t )
= J ( Y t ) K ( Y t ) ,
where
  • (265) ⟸ the first term on the right side of (264) is zero;
  • (266) ⟸ (262);
  • (267) ⟸ (258);
  • (268) ⟸ integration by parts, exactly as in [4] (or p. 673 of [19]).
       □
97.
Theorem 12 reveals that the increasing function f X ( t ) = h X + t V is concave (which does not follow from the concavity of differential entropy functional of the density). In contrast, it was shown by Costa [40] that the entropy power N X + t N , with N N 0 , 1 is concave in t.

16. Mutual Information

98.
Most of this section is devoted to an additive noise model. We begin with the simplest case in which X C is centered Cauchy independent of W C , also centered Cauchy with ς ( W C ) > 0 . Then, (11) yields
I ( X C ; X C + W C ) = h ( X C + W C ) h ( W C )
= log 4 π ( ς ( X C ) + ς ( W C ) ) log 4 π ς ( W C )
= log 1 + ς ( X C ) ς ( W C ) ,
thereby establishing a pleasing parallelism with Shannon’s formula [1] for the mutual information between a Gaussian random variable and its sum with an independent Gaussian random variable. Aside from a factor of 1 2 , in the Cauchy case, the role of the variance is taken by the strength. Incidentally, as shown in [2], if N is standard exponential on ( 0 , ) , an independent X on [ 0 , ) can be found so that X + N is exponential, in which case the formula (271) also applies because the ratio of strengths of exponentials is equal to the ratio of their means. More generally, if input and noise are independent non-centered Cauchy, their locations do not affect the mutual information, but they do affect their strengths, so, in that case, (271) holds provided that the strengths are evaluated for the centered versions of the Cauchy random variables.
99.
It is instructive, as well as useful in the sequel, to obtain (271) through a more circuitous route. Since Y C = X C + W C is centered Cauchy with strength ς ( Y C ) = ς ( X C ) + ς ( W C ) , the information density (e.g., [41]) is defined as
ı X C ; Y C ( x ; y ) = log d P X C Y C d ( P X C × P Y C ) ( x , y )
= log f Y C | X C ( y | x ) f Y C ( y )
= log ς ( Y C ) ς ( W C ) + log 1 + y 2 ς 2 ( Y C ) log 1 + ( y x ) 2 ς 2 ( W C ) .
Averaging with respect to ( X C , Y C ) = ( X C , X C + W C ) , we obtain
I ( X C ; Y C ) = E ı X C ; Y C ( X C ; Y C )
= log ς ( Y C ) ς ( W C ) + log 4 log 4 = log 1 + ς ( X C ) ς ( W C ) .
100.
If the strengths of output Y = X + N and independent noise N are finite and their differential entropies are not , we can obtain a general representation of the mutual information without requiring that either input or noise be Cauchy. Invoking (56) and I ( X ; X + N ) = h ( X + N ) h ( N ) , we have
I ( X ; Y ) = log N C ( Y ) N C ( N )
= log ς ( Y ) ς ( N ) + D ( N ς ( N ) V ) D ( Y ς ( Y ) V ) ,
since, as we saw in (49), the finiteness of the strengths guarantees the finiteness of the relative entropies in (278). We can readily verify the alternative representation in which strength is replaced by standard deviation, and the standard Cauchy V is replaced by standard normal W:
I ( X ; Y ) = 1 2 log N ( Y ) N ( N )
= log σ ( Y ) σ ( N ) + D ( N σ ( N ) W ) D ( Y σ ( Y ) W ) .
A byproduct of (278) is the upper bound
I ( X ; Y ) log ς ( Y ) N C ( N )
= log ς ( Y ) ς ( N ) + D ( N ς ( N ) V ) ,
where (281) follows from N C ( Y ) ς ( Y ) , and (282) follows by dropping the last term on the right side of (278). Note that (281) is the counterpart of the upper bound given by Shannon [1] in which the standard deviation of Y takes the place of the strength in the numerator, and the square root of the noise entropy power takes the place of the entropy strength in the denominator. Shannon gave his bound three years before Kullback and Leibler introduced relative entropy in [42]. The counterpart of (282) with analogous substitutions of strengths by standard deviations was given by Pinsker [43], and by Ihara [44] for continuous-time processes.
101.
We proceed to investigate the maximal mutual information between the (possibly non-Cauchy) input and its additive Cauchy-noise contaminated version.
Theorem 13.
Maximal mutual information: output strength constraint. For any η ς ( W C ) > 0 ,
max X : ς ( X + W C ) η I ( X ; X + W C ) = log η ς ( W C ) ,
where W C is centered Cauchy independent of X. The maximum in (283) is attained uniquely by the centered Cauchy distribution with strength η ς ( W C ) .
Proof. 
For centered Cauchy noise, the upper bound in (282) simplifies to
I ( X ; X + W C ) log ς ( X + W C ) ς ( W C ) ,
which shows ≤ in (283). If the input is centered Cauchy X C with strength η ς ( W C ) , then ς ( X C + W C ) = η , and I ( X C ; X C + W C ) is equal to the right side in view of (271).
   □
102.
In the information theory literature, the maximization of mutual information over the input distribution is usually carried out under a constraint on the average cost E [ b ( X ) ] for some real-valued function b . Before we investigate whether the optimization in (283) can be cast into that conventional paradigm, it is instructive to realize that the maximization of mutual information in the case of input-independent additive Gaussian noise can be viewed as one in which we allow any input such that the output variance is constrained, and because the output variance is the sum of input and noise variances that the familiar optimization over variance constrained inputs obtains. Likewise, in the case of additive exponential noise and random variables taking nonnegative values, if we constrain the output mean, automatically we are constraining the input mean. In contrast, the output strength is not equal to the sum of Cauchy noise strength and the input strength, unless the input is Cauchy. Indeed, as we saw in Theorem 1-(d), the output strength depends not only on the input strength but on the shape of its probability density function. Since the noise is Cauchy, (45) yields
ς ( X + W C ) η ς 2 , θ ( X ) ς ( W C ) + η , w i t h θ = 2 log 2 η η + ς ( W C )
E log ς ( W C ) + η + X 2 2 log 2 η ,
which is the same input constraint found in [45] (see also Lemma 6 in [46] and Section V in [47]) in which η affects not only the allowed expected cost but the definition of the cost function itself. If X is centered Cauchy with strength η ς ( W C ) , then (286) is satisfied with equality, in keeping with the fact that that input achieves the maximum in (283). Any alternative input with the same strength that produces output strength lower than or equal to η can only result in lower mutual information. However, as we saw in Item 29, we can indeed find input distributions with strength η ς ( W C ) that can produce output strength higher than η . Can any of those input distributions achieve I ( X ; Y ) > log η ς ( W C ) ? The answer is affirmative. If we let X = V β , 2 , defined in (9), we can verify numerically that, for β [ 0.8 , 1 ) ,
I ( X ; X + V ) > log ς ( X ) + 1 .
We conclude that, at least for θ ς ( W C ) 1 , ς ( V 0.8 , 2 ) = ( 1 , 3.126 ) , the capacity–input–strength function satisfies
C ( θ ) = max X : ς ( X ) θ I ( X ; X + W C ) > log 1 + θ ς ( W C ) .
103.
Although not always acknowledged, the key step in the maximization of mutual information over the input distribution for a given random transformation is to identify the optimal output distribution. The results in Items 101 and 102 point out that it is mathematically more natural to impose constraints on the attributes of the observed noisy signal than on the transmitted noiseless signal. In the usual framework of power constraints, both formulations are equivalent as an increase in the gain of the receiver antenna (or a decrease in the front-end amplifier thermal noise) of κ dB has the same effect as an increase of κ dB in the gain of the transmitter antenna (or increase in the output power of the transmitted amplifier). When, as in the case of strength, both formulations lead to different solutions, it is worthwhile to recognize that what we usually view as transmitter/encoder constraints also involve receiver features.
104.
Consider a multiaccess channel Y i = X 1 i + X 2 i + W i , where W i is a sequence of strength ς ( W ) independent centered Cauchy random variables. While the capacity region is unknown if we place individual cost or strength constraints on the transmitters, it is easily solvable if we impose an output strength constraint. In that case, the capacity region is the triangle
C η = ( R 1 , R 2 ) [ 0 , ) 2 : R 1 + R 2 log η ς ( W ) ,
where η > ς ( W ) is the output strength constraint. To see this, note (a) the corner points are achievable thanks to Theorem 13; (b) if the transmitters are synchronous, a time-sharing strategy with Cauchy distributed inputs satisfies the output strength constraint in view of (107); (c) replacing the independent encoders by a single encoder which encodes both messages would not be able to achieve higher rate sum. It is also possible to achieve (289) using the successive decoding strategy invented by Cover [48] and Wyner [49] for the Gaussian multiple-access channel: fix α ( 0 , 1 ) ; to achieve R 1 = α log η ς ( W ) and R 2 = ( 1 α ) log η ς ( W ) , we let the transmitters use random coding with sequences of independent Cauchy random variables with respective strengths
ς 1 = η ς α ( W ) η 1 α > 0 ,
ς 2 = ς α ( W ) η 1 α ς ( W ) > 0 ,
which abide by the output strength constraint since ς 1 + ς 2 + ς ( W ) = η , and
R 1 = log 1 + ς 1 ς 2 + ς ( W ) ,
R 2 = log 1 + ς 2 ς ( W ) ,
a rate-pair which is achievable by successive decoding by using a single-user decoder for user 1, which treats the codeword transmitted by user 2 as noise; upon decoding the message of user 1, it is re-encoded and subtracted from the received signal, thereby presenting a single-user decoder for user 2 with a signal devoid of any trace of user 1 (with high probability).
105.
The capacity per unit energy of the additive Cauchy-noise channel Y i = X i + λ V i , where { V i } is an independent sequence of standard Cauchy random variables, was shown in [29] to be equal to ( 4 λ 2 ) 1 log e , even though the capacity-cost function of such a channel is unknown. A corollary to Theorem 13 is that the capacity per unit output strength of the same channel is
C O = 1 λ max η λ λ η log η λ = log e λ e .
By only considering Cauchy distributed inputs, the capacity per unit input strength is lower bounded by
C I max γ > 0 1 γ log 1 + γ λ = log e λ
but is otherwise unknown as it is not encompassed by the formula in [29].
106.
We turn to the scenario, dual to that in Theorem 13, in which the input is Cauchy but the noise need not be. As Shannon showed in [1], if the input is Gaussian, among all noise distributions with given second moment, independent Gaussian noise is the least favorable. Shannon showed that fact applying the entropy power inequality to the numerator on the right side of (279), and then further weakened the resulting lower bound by replacing the noise entropy power in the denominator by its variance. Taking a cue from this simple approach, we apply the entropy strength inequality (124) to (277) to obtain
I ( X C ; X C + W ) = 1 2 log N C 2 ( Y ) N C 2 ( W )
1 2 log N C 2 ( X C ) + N C 2 ( W ) N C 2 ( W )
= 1 2 log 1 + ς 2 ( X C ) N C 2 ( W )
1 2 log 1 + ς 2 ( X C ) ς C 2 ( W ) ,
where (299) follows from N C 2 ( W ) ς C 2 ( W ) . Unfortunately, unlike the case of Gaussian input, this route falls short of showing that Cauchy noise of a given strength is least favorable because the right side of (299) is strictly smaller than the Cauchy-input Cauchy-noise mutual information in (271). Evidently, while the entropy power inequality is tight for Gaussian random variables, it is not for Cauchy random variables as we observed in Item 39. For this approach to succeed showing that, under a strength constraint, the least favorable noise is centered Cauchy we would need that, if W is independent of standard Cauchy V, then N C ( V + W ) N C ( W ) 1 . (See Item 119c-(a).)
107.
As in Item 102, the counterpart in the Cauchy-input case is more challenging due to the fact that, unlike variance, the output strength need not be equal to the sum of input and noise strength. The next two results give lower bounds which, although achieved by Cauchy noise, do not just depend on the noise distribution through its strength.
Theorem 14.
If X C is centered Cauchy, independent of W with 0 < ς ( W ) < , denote Y = X C + W . Then,
I ( X C ; X C + W ) log ς ( Y ) ς ( W ) log ς ( W ) ς ( Y ) ς ( X C ) ,
with equality if W is centered Cauchy.
Proof. 
Let us abbreviate ς = ς ( Y ) ς ( X C ) . Consider the following chain:
D ( Y ς ( Y ) V ) D ( W ς ( W ) V ) = D X C + W X C + ς V D ( W ς ( W ) V )
D W ς V D ( W ς ( W ) V )
= log ς ( W ) ς + E log ς 2 + W 2 ς 2 ( W ) + W 2
log ς ( W ) ς ,
where
  • (301) ⟸ X C is centered Cauchy;
  • (302) ⟸ relative entropy data processing theorem applied to a random transformation that consists of the addition of independent “noise” X C ;
  • (303) ⟸ both relative entropies are finite since ς ( W ) < ;
  • (304) ⟸ the elementary observation
    log ς 2 + t 2 ς 2 ( W ) + t 2 0 , ς < ς ( W ) ; 2 log ς ς ( W ) , ς ς ( W ) .
The desired bound (300) now follows in view of (278). It holds with equality in W being centered Cauchy as, in that case, ς ( Y ) = ς ( X C ) + ς ( W C ) .    □
Although the lower bound in Theorem 14 is achieved by a centered Cauchy, it does not rule out the existence of W such that ς ( W ) = ς ( W C ) and I ( X C ; X C + W ) < I ( X C ; X C + W C ) .
108.
For the following lower bound, it is advisable to assume for notational simplicity and without loss of generality that ς ( X C ) = 1 . To remove that restriction, we may simply replace W by ς ( X C ) W .
Theorem 15.
Let V be standard Cauchy independent of W. Then,
I ( V ; V + W ) log 1 + 1 λ ( W ) ,
where λ ( W ) is the solution to
E log ( 2 + λ ) 2 + W 2 λ 2 + W 2 = 2 log 1 + 1 λ .
Equality holds in (306) if W is a centered Cauchy random variable, in which case, λ ( W ) = ς ( W ) .
Proof. 
It can be shown that, if P X Y = P X P Y | X = P Y P X | Y and Q Y | X is an auxiliary random transformation such that P X Q Y | X = Q Y Q X | Y where Q Y is the response of Q Y | X to P X , then
I ( X ; Y ) = D ( P X | Y Q X | Y | P Y ) + E [ ı X ; Y ¯ ( X ; Y ) ] ,
where ( X , Y ) P X P Y | X and the information density ı X ; Y ¯ corresponds to the joint probability measure P X Q Y | X . We can participate decomposition of mutual information to the case where P X = P V , P Y | X = = P W + , Q Y | X = = P W c + x where Wc is centered Cauchy with strenght λ > 0. Then, P X Q Y | X is the joint distribution of V and V + WC, and
ı X ; Y ¯ ( x ; y = log λ 1 + λ log ( λ 2 + ( y x ) 2 ) + log ( ( 1 + λ ) 2 + y 2 ) .
Taking expectation with respect to ( x , y ) = ( V , V + t ) , and invoking (52), we obtain
E [ ı X ; Y ¯ ( V ; V + t ) ] = log λ 1 + λ + E [ log ( 1 + λ ) 2 + ( V + t ) 2 λ 2 + t 2 ]
= log λ 1 + λ + log ( 2 + λ ) 2 + t 2 λ 2 + t 2 .
Finally, taking expectation with respect to t = W , we obtain
E [ ı X ; Y ¯ ( V ; V + W ) ] = E [ log ( 2 + λ ) 2 + t 2 λ 2 + W 2 ] log ( 1 + 1 λ ) .
If λ = λ ( W ) , namely, the solution to (307), then (306) follows as a result of (108). If W = ς ( W ) V , then the solution to (307) is λ ( W ) = ς ( W ) , and the equality in (306) can be seen by specializing (271) to ( ς ( X C ) , ς ( W C ) ) = ( 1 , ς ( W ) ) .    □
109.
As we just saw, if W is centered Cauchy, then the solution to (307) satisfies λ ( W ) = ς ( W ) . On the other hand, we have
0.302 . . = ς ( V 2 , 2 ) < λ ( V 2 , 2 ) = 0.349
4.961 . . . = λ ( W ) < ς ( W ) = 5.845
if W has the probability density function in (100).
110.
As the proof indicates, at the expense of additional computation, we may sharpen the lower bound in Theorem 15 to show
I ( V ; V + W ) max λ > 0 E log ( 2 + λ ) 2 + W 2 λ 2 + W 2 log 1 + 1 λ ,
which is attained at the solution to
λ 2 + λ η W 2 1 ( 2 + λ ) 2 η W 2 1 λ 2 + 1 2 λ + 2 = 0 .
111.
Theorem 16.
The rate–distortion function of a memoryless source whose distribution is centered Cauchy with strength ς ( X ) such that the time-average of the distortion strength is upper bounded by D is given by
R ( D ) = log ς ( X ) D , 0 < D < ς ( X ) ; 0 , 0 < D ς ( X ) .
Proof. 
If D ς ( X ) , reproducing the source by ( 0 , , 0 ) results in time-average of the distortion strength equal to 1 n i = 1 n ς ( X i ) = ς ( X ) . Therefore, R ( D ) = 0 . If 0 < D < ς ( X ) , we proceed to determine the minimal I ( X ; X ^ ) among all P X ^ | X such that ς ( X X ^ ) D . For any such random transformation,
I ( X ; X ^ ) = h ( X ) h ( X | X ^ )
= h ( X ) h ( X X ^ | X ^ )
h ( X ) h ( X X ^ )
= log 4 π ς ( X ) h ( X X ^ )
log 4 π ς ( X ) log 4 π ς ( X X ^ )
log ς ( X ) D ,
where (320) holds because conditioning cannot increase differential entropy, and (322) follows from Theorem 3 applied to Z = X X ^ . The fact that there is an allowable P X ^ | X that achieves the lower bound with equality is best seen by letting X = X ^ + Z , where Z and X ^ are independent centered Cauchy random variables with ς ( Z ) = D and ς ( X ^ ) = ς ( X ) D . Then, P X ^ | X P X = P X | X ^ P X ^ is such that the X marginal is indeed centered Cauchy with strength ς ( X ) , and ς ( X X ^ ) = D . Recalling (271),
I ( X ^ ; X ) = log 1 + ς ( X ) D ς ( Z ) = log ς ( X ) D ,
and the lower bound in (323) can indeed be satisfied with equality. We are not finished yet since we need to justify that the rate–distortion function is indeed
R ( D ) = min P X ^ | X : ς ( X X ^ ) D I ( X ; X ^ ) ,
which does not follow from the conventional memoryless lossy compression theorem with average distortion because, although the distortion measure is separable, it is not the average of a function with respect to the joint probability measure P X X ^ . This departure from the conventional setting does not impact the direct part of the theorem (i.e., ≤ in (325)), but it does affect the converse and in particular the proof of the fact that the n-version of the right side of (325) single-letterizes. To that end, it is sufficient to show that the function of D on the right side of (325) is convex (e.g., see pp. 316–317 in [19]). In the conventional setting, this follows from the convexity of the mutual information in the random transformation since, with a distortion function d ( · , · ) , we have
E [ d ( X , X ^ α ) ] = α E [ d ( X , X ^ 1 ) ] + ( 1 α ) E [ d ( X , X ^ 0 ) ] ,
where ( X , X ^ 1 ) P X P X ^ | X 1 , ( X , X ^ 0 ) P X P X ^ | X 0 , and ( X , X ^ α ) α P X P X ^ | X 1 + ( 1 α ) P X P X ^ | X 0 . Unfortunately, as we saw in Item 35, strength is not convex on the probability measure so, in general, we cannot claim that
ς ( X X ^ α ) α ς ( X X ^ 1 ) + ( 1 α ) ς ( X X ^ 0 ) .
The way out of this quandary is to realize that (327) is only needed for those P X ^ | X 0 and P X ^ | X 1 that attain the minimum on the right side of (325) for different distortion bounds D 0 and D 1 . As we saw earlier in this proof, those optimal random transformations are such that X X ^ 0 and X X ^ 1 are centered Cauchy. Fortuitously, as we noted in (107), (327) does indeed hold when we restrict attention to mixtures of centered Cauchy distributions.    □
Theorem 16 gives another example in which the Shannon lower bound to the rate–distortion function is tight. In addition to Gaussian sources with mean–square distortion, other examples can be found in [50]. Another interesting aspect of the lossy compression of memoryless Cauchy sources under strength distortion measure is that it is optimally successively refinable in the sense of [51,52]. As in the Gaussian case, this is a simple consequence of the stability of the Cauchy distribution and the fact that the strength of the sum of independent Cauchy random variables is equal to the sum of their respective strengths (Item 27).
112.
The continuity of mutual information can be shown under the following sufficient conditions
Theorem 17.
Suppose that X n is a sequence of real-valued random variables that vanishes in strength, Z is independent of X n , h ( Z ) > and 0 < ς ( Z ) < . Then,
lim n I ( X n ; X n + Z ) = 0 .
Proof. 
Under the assumptions, h ( Z ) R . Therefore, I ( X n ; X n + Z ) = h ( X n + Z ) h ( Z ) , and (328) follows from Theorem 1-(m).    □
113.
The assumption h ( Z ) > is not superfluous for the validity of Theorem 17 even though it was not needed in Theorem 1-(m). Suppose that Z is integer valued, and X n = ( n L ) 1 ( 0 , 1 2 ) where L { 2 , 3 , } has probability mass function
P L ( k ) = 0.986551 . . . k log 2 2 k , k = 2 , 3 ,
Then, I ( X n ; X n + Z ) = H ( X n ) = H ( L ) = , while E [ | X n | ] = 0.328289 . . . n , and therefore, ς ( X n ) 0 .
114.
In the case in which V n and W n are standard spherical multivariate Cauchy random variables with densities in (6), it follows from (7) that λ X V n + λ W W n has the same distribution as ( | λ X | + | λ W | ) V n . Therefore,
I V n ; λ X V n + λ W W n = h λ X V n + λ W W n h λ W W n
= n log 1 + | λ X | | λ W | ,
where we have used the scaling law h ( α X n ) = n log | α | + h ( X n ) . There is no possibility of a Cauchy-counterpart of the celebrated log-determinant formula for additive Gaussian vectors (e.g., Theorem 9.2.1 in [41]) because, as pointed out in Item 7, Λ 1 2 V n + Λ ¯ 1 2 W n is not distributed according to the ellipsoidal density in (8) unless Λ and Λ ¯ are proportional, in which case the setup reverts to that in (330).
115.
To conclude this section, we leave aside additive noise models and consider the mutual information between a partition of the components of the standard spherical multivariate Cauchy density (6). If I J = , then (17) yields
I { V i , i I } ; { V j , j J } = h | I | + h | J | h | I | + | J | ,
where h n stands for the right side of (17). For example, if i j , then, in nats,
I ( V i ; V j ) = 2 h ( V 1 ) h ( V 1 , V 2 )
= 2 log e ( 4 π ) 3 2 log e ( 4 π ) + γ + ψ ( 3 2 ) log e Γ ( 3 2 )
= log e ( 8 π ) 3 = 0.22417
More generally, the dependence index among the n random variables in the standard spherical multivariate Cauchy density is (see also [9,53]), in nats,
D ( P V n P V 1 × × P V n ) = n h ( V 1 ) h ( V n )
= n 1 2 log e ( 4 π ) + log e Γ n + 1 2 n + 1 2 γ + ψ n + 1 2
= n 2 log e ( 8 π ) + k = 1 n 2 log e ( 2 k 1 ) n + 1 2 k 1 , n e v e n ; n 1 2 log e ( 4 π ) + k = 1 n 1 2 log e k n + 1 2 k , n o d d .
116.
The shared information of n random variables is a generalization of mutual information introduced in [54] for deriving the fundamental limit of interactive data exchange among agents who have access to the individual components and establish a dialog to ensure that all of them find out the value of the random vector. The shared information of X n is defined as
S ( X n ) = min Π 1 | Π | 1 D P X n = 1 | Π | P X ( I ) ,
where X ( J ) = { X i , i J } , with J I = { 1 , , n } , and the minimum is over all partitions of I :
Π = { I , = 1 , , | Π | } , w i t h = 1 | Π | I = I , I I j = , j ,
such that | Π | > 1 . If we divide (338) by n 1 , we obtain the shared information of n random variables distributed according to the standard spherical multivariate Cauchy model. This is a consequence of the following result, which is of independent interest.
Theorem 18.
If X n are exchangeable random variables, any subset of which have finite differential entropy, then for any partition Π of { 1 , , n } ,
1 | Π | 1 D P X n = 1 | Π | P X ( I ) 1 n 1 D ( P X n P X 1 × × P X n ) .
Proof. 
Fix any partition Π with | Π | = L { 2 , , n 1 } chunks. Denote by n the number of chunks in Π with cardinality { 1 , , n 1 } . Therefore,
= 1 n 1 n = L , a n d = 1 n 1 n = n .
By exchangeability, any chunk of cardinality k has the same differential entropy, which we denote by h k . Then,
D P X n = 1 | Π | P X ( I ) = h n + = 1 n 1 n h ,
and the difference of the left minus the right sides of (340) multiplied by ( n 1 ) ( L 1 ) is readily seen to equal
( n 1 ) h n + ( n 1 ) = 1 n 1 n h + ( L 1 ) h n ( L 1 ) n h 1
= ( n 1 ) n 1 n ( L 1 ) h 1 + ( L n ) h n + ( n 1 ) = 2 n 1 n h
( n 1 ) n 1 n ( L 1 ) + = 2 n 1 ( n ) n h 1 + L n + = 2 n 1 ( 1 ) n h n
= 0
where
  • (344) ⟸ for all { 2 , , n 1 } ,
    h 1 n 1 h n + n n 1 h 1 ,
    since h 1 , , h n is a concave sequence, i.e., 2 h k h k 1 + h k + 1 as a result of the sub-modularity of differential entropy.
  • (345) ⟸ (341).
       □
Naturally, the same proof applies to n discrete exchangeable random variables with finite joint entropy.

17. Outlook

117.
We have seen that a number of key information theoretic properties pertaining to the Gaussian law are also satisfied in the Cauchy case. Conceptually, those extensions shed light on the underlying reason the conventional Gaussian results hold. Naturally, we would like to explore how far beyond the Cauchy law those results can be expanded. As far as the maximization of differential entropy is concerned, the essential step is to redefine strength tailoring it to the desired law: Fix a reference random variable W with probability density function f W and finite differential entropy h ( W ) R , and define the W-strength of a real valued random variable Z as
ς W ( Z ) = inf ς > 0 : E log f W Z ς h W .
For example,
(a)
For α > 0 , ς W ( α W ) = α ;
(b)
if W is standard normal, then ς W 2 ( Z ) = E [ Z 2 ] ;
(c)
if V is standard Cauchy, then ς V ( Z ) = ς ( Z ) ;
(d)
if W is standard exponential, then ς W ( Z ) = E [ Z ] if Z 0 a.s., otherwise, ς W ( Z ) = ;
(e)
if W is standard ( μ = 1 ) Subbotin (108) with p > 0 , then, ς W p ( Z ) = E [ | Z | p ] ;
(f)
if W has the Rider distribution in (9), then ς W ( Z ) = ς ρ , θ ( Z ) defined in (126) for θ chosen as in (110);
(g)
if W is uniformly distributed on [ 1 , 1 ] , ς W ( Z ) = e s s sup | Z | ;
(h)
if W is standard Rayleigh, then ς W ( Z ) = inf ς > 0 : E Z 2 ς 2 log e Z 2 2 ς 2 2 + γ if Z 0 a.s., otherwise, ς W ( Z ) = .
The pivotal Theorems 3 and 4 admit the following generalization.
Theorem 19.
Suppose h ( W ) R and ς > 0 . Then,
max Z : ς W ( Z ) ς h ( Z ) = h ( W ) + log ς .
Proof. 
Fix any Z in the feasible set. For any σ ς W ( Z ) such that E log f W Z σ h W , we have
0 D ( σ 1 Z W ) = h ( Z ) + log σ E log f W Z σ
h ( Z ) + log σ + h ( W ) .
Therefore, h ( Z ) h ( W ) + log ς W ( Z ) , by definition of ς W ( Z ) , thereby establishing ≤ in (348). Equality holds since ς W ( ς W ) = ς .    □
A corollary to Theorem 19 is a very general form of the Shannon lower bound for the rate–distortion function of a memoryless source Z such that the distortion is constrained to have W-strength not higher than D, namely,
R ( D ) h ( Z ) h ( W ) log D .
Theorem 19 finds an immediate extension to the multivariate case
max Z n : ς W n ( Z n ) ς h ( Z n ) = h ( W n ) + n log ς ,
where, for W n with h ( W n ) R , we have defined
ς W n ( Z n ) = inf ς > 0 : E log f W n ς 1 Z n h W n .
For example, if W n is zero-mean multivariate Gaussian with positive definite covariance Σ , then ς W n 2 ( Z n ) = 1 n E Z n Σ 1 Z n .
118.
One aspect in which we have shown that Cauchy distributions lend themselves to simplification unavailable in the Gaussian case is the single-parametrization of their likelihood ratio, which paves the way for a slew of closed-form expressions for f-divergences and Rényi divergences. It would be interesting to identify other multiparameter (even just scale/location) families of distributions that enjoy the same property. To that end, it is natural, though by no means hopeful, to study various generalizations of the Cauchy distribution such as the Student-t random variable, or more generally, the Rider distribution in (9). The information theoretic study of general stable distributions is hampered by the fact that they are characterized by their characteristic functions (e.g., p. 164 in [55]), which so far, have not lent themselves to the determination of relative entropy or even differential entropy.
119.
Although we cannot expect that the cornucopia of information theoretic results in the Gaussian case can be extended to other domains, we have been able to show that a number of those results do find counterparts in the Cauchy case. Nevertheless, much remains to be explored. To name a few,
(a)
The concavity of the entropy-strength N C ( X + t V ) —a counterpart of Costa’s entropy power inequality [40] would guarantee the least favorability of Cauchy noise among all strength-constrained noises as well as the entropy strength inequality
N C ( X + t V ) N C ( t V ) + N C ( X ) .
(b)
Information theoretic analyses quantifying the approach to normality in the central limit theorem are well-known (e.g., [56,57,58]). It would be interesting to explore the decrease in the relative entropy (relative to the Cauchy law) of independent sums distributed according to a law in the domain of attraction of the Cauchy distribution [55].
(c)
Since de Bruijn’s identity is one of the ancestors of the i-mmse formula of [59], and we now have a counterpart of de Bruijn’s identity for convolutions with scaled Cauchy, it is natural to wonder if there may be some sort of integral representation of the mutual information between a random variable and its noisy version contaminated by additive Cauchy noise. In this respect, note that counterparts for the i-mmse formula for models other than additive Gaussian noise have been found in [60,61,62].
(d)
Mutual information is robust against the addition of small non-Gaussian contamination in the sense that its effects are the same as if it were Gaussian [63]. The proof methods rely on Taylor series expansions that require the existence of moments. Any Cauchy counterparts (recall Item 77) would require substantially different methods.
(e)
Pinsker [41] showed that Gaussian processes are information stable imposing only very mild assumptions. The key is that, modulo a factor, the variance of the information density is upper bounded by its mean, the mutual information. Does the spherical multivariate Cauchy distribution enjoy similar properties?
112.
Although not surveyed here, there are indeed a number of results in the engineering literature advocating Cauchy models in certain heavy-tailed infinite-variance scenarios (see, e.g., [45] and the references therein.) At the end, either we abide by the information theoretic maxim that “there is nothing more practical than a beautiful formula”, or we pay heed to Poisson, who after pointing out in [64] that Laplace’s proof of the central limit theorem broke down for what we now refer to as the Cauchy law, remarked that “Mais nous ne tiendrons pas compte de ce cas particulier, quil nous suffira d’avoir remarqué à cause de sa singularité, et qui ne se recontre sans doute pas dans la pratique”.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Definite Integrals

0 x 1 1 + t 2 d t = arctan ( x ) ,
1 2 1 2 log cos ( π t ) d t = log 1 2 ,
log ( 1 + t 2 ) 1 + t 2 d t = π log 4 ,
log ( α 2 2 α t cos β + t 2 ) 1 + t 2 d t = π log ( 1 + α 2 + 2 α | sin β | ) ,
log ( 1 + t 2 ) 1 + ( ξ t κ ) 2 d t = π ξ log κ 2 + ξ + 1 2 2 log ξ , ξ > 0 ,
κ β , ρ log e ( 1 + | t | ρ ) ( 1 + | t | ρ ) β d t = ψ ( β ) ψ β 1 ρ , β ρ > 1 ,
log e 1 + θ 2 t 2 1 + t 2 2 = π log e ( 1 + | θ | ) | θ | 1 + | θ | ,
α α log e ( t 2 + ς 2 ) d t = 4 ς arctan α ς 4 α + 2 α log e α 2 + ς 2 ,
t 2 ( 1 + t 2 ) 2 d t = π 2 ,
1 ( 1 + t 2 ) 2 d t = π 2 ,
t 2 ( 1 + t 2 ) 3 d t = π 8 ,
1 ( β 2 + t 2 ) ν d t = π β 1 2 ν Γ ν 1 2 Γ ν , ν > 1 2 ,
0 1 ( 1 + t ρ ) ν d t = Γ ν 1 ρ Γ 1 + 1 ρ Γ ν , ν > 1 ρ > 0 ,
0 π log α + β cos θ d θ = π log α 2 + 1 2 α 2 β 2 , α | β | > 0 ,
0 π log β + β 2 1 cos θ α d θ = π P α ( β ) , β > 0 ,
0 d t 1 + t 2 β 2 + t 2 = K 1 β 2 , β ( 0 , 1 ) ,
where
  • (A2) is a special case of 4.384.21 in [24];
  • (A3) is a special case of (A4);
  • (A4) is 4.296.2 in [24];
  • (A5) follows from (A4) by change of variable;
  • (A6), with κ β , ρ defined in (10) and ψ ( · ) denoting the digamma function, follows from 4.256 in [24] by change of variable x = ( 1 + t p ) 1 2 n and n = m p ;
  • (A7) is a special case of 4.295.25 in [24];
  • (A8) follows from 2.733.1 in [24];
  • (A9)–(A10) follow from 3.252.6 in [24];
  • (A11) can be obtained by integration by parts and (A10);
  • (A12), with Γ ( · ) denoting the gamma function, is a special case of 3.251.11 in [24];
  • (A13) can be obtained from 3.251.11 in [24] by change of variable;
  • (A14) is 4.224.9 in [24];
  • (A15) is 8.822.1 in [24] with P α ( x ) the Legendre function of the first kind, which is a solution to
    d d x 1 x 2 d u ( x ) d x + α ( α + 1 ) u ( x ) = 0 ;
  • (A16) is a special case of 3.152.1 in [24] with the complete elliptic integral of the first kind defined as 8.112.1 in [24], namely,
    K ( k ) = 0 π 2 d α 1 k 2 sin 2 α , | k | < 1 .
    Note that mathematica defines the complete elliptic integral function EllipticK such that
    K ( k ) = EllipticK k 2 1 k 2 1 k 2 , | k | < 1 .

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
  2. Verdú, S. The exponential distribution in information theory. Probl. Inf. Transm. 1996, 32, 86–95. [Google Scholar]
  3. Anantharam, V.; Verdú, S. Bits through queues. IEEE Trans. Inf. Theory 1996, 42, 4–18. [Google Scholar] [CrossRef]
  4. Stam, A. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control. 1959, 2, 101–112. [Google Scholar] [CrossRef]
  5. Ferguson, T.S. A representation of the symmetric bivariate Cauchy distribution. Ann. Math. Stat. 1962, 33, 1256–1266. [Google Scholar] [CrossRef]
  6. Fang, K.T.; Kotz, S.; Ng, K.W. Symmetric Multivariate and Related Distributions; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  7. Rider, P.R. Generalized Cauchy distributions. Ann. Inst. Stat. Math. 1958, 9, 215–223. [Google Scholar] [CrossRef]
  8. Bouhlel, N.; Rousseau, D. A generic formula and some special cases for the Kullback–Leibler divergence between central multivariate Cauchy distributions. Entropy 2022, 24, 838. [Google Scholar] [CrossRef]
  9. Abe, S.; Rajagopal, A.K. Information theoretic approach to statistical properties of multivariate Cauchy-Lorentz distributions. J. Phys. A Math. Gen. 2001, 34, 8727–8731. [Google Scholar] [CrossRef]
  10. Tulino, A.M.; Verdú, S. Random matrix theory and wireless communications. Found. Trends Commun. Inf. Theory 2004, 1, 1–182. [Google Scholar] [CrossRef]
  11. Widder, D.V. The Stieltjes transform. Trans. Am. Math. Soc. 1938, 43, 7–60. [Google Scholar] [CrossRef]
  12. Kullback, S. Information Theory and Statistics; Dover: New York, NY, USA, 1968; Originally published in 1959 by JohnWiley. [Google Scholar]
  13. Wu, Y.; Verdú, S. Rényi information dimension: Fundamental limits of almost lossless analog compression. IEEE Trans. Inf. Theory 2010, 56, 3721–3747. [Google Scholar] [CrossRef]
  14. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
  15. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, III. Commun. Pure Appl. Math. 1977, 29, 369–461. [Google Scholar] [CrossRef]
  16. Lapidoth, A.; Moser, S.M. Capacity bounds via duality with applications to multiple-antenna systems on flat-fading channels. IEEE Trans. Inf. Theory 2003, 49, 2426–2467. [Google Scholar] [CrossRef]
  17. Subbotin, M.T. On the law of frequency of error. Mat. Sb. 1923, 31, 296–301. [Google Scholar]
  18. Kapur, J.N. Maximum-Entropy Models in Science and Engineering; Wiley-Eastern: New Delhi, India, 1989. [Google Scholar]
  19. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 2006. [Google Scholar]
  20. Dembo, A.; Cover, T.M.; Thomas, J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory 1991, 37, 1501–1518. [Google Scholar] [CrossRef]
  21. Han, T.S. Information Spectrum Methods in Information Theory; Springer: Heidelberg, Germany, 2003. [Google Scholar]
  22. Vajda, I. Theory of Statistical Inference and Information; Kluwer: Dordrecht, The Netherlands, 1989. [Google Scholar]
  23. Deza, E.; Deza, M.M. Dictionary of Distances; Elsevier: Amsterdam, The Netherlands, 2006. [Google Scholar]
  24. Gradshteyn, I.S.; Ryzhik, I.M. Table of Integrals, Series, and Products, 7th ed.; Academic Press: Burlington, MA, USA, 2007. [Google Scholar]
  25. Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  26. Nielsen, F.; Okamura, K. On f-divergences between Cauchy distributions. In Proceedings of the International Conference on Geometric Science of Information, Paris, France, 21–23 July 2021; pp. 799–807. [Google Scholar]
  27. Eaton, M.L. Group Invariance Applications in Statistics. In Proceedings of the Regional Conference Series in Probability and Statistics; Institute of Mathematical Statistics: Hayward, CA, USA, 1989; Volume 1. [Google Scholar]
  28. McCullagh, P. On the distribution of the Cauchy maximum-likelihood estimator. Proc. R. Soc. London. Ser. A Math. Phys. Sci. 1993, 440, 475–479. [Google Scholar]
  29. Verdú, S. On channel capacity per unit cost. IEEE Trans. Inf. Theory 1990, 36, 1019–1030. [Google Scholar] [CrossRef]
  30. Chyzak, F.; Nielsen, F. A closed-form formula for the Kullback–Leibler divergence between Cauchy distributions. arXiv 2019, arXiv:1905.10965. [Google Scholar]
  31. Verdú, S. Mismatched estimation and relative entropy. IEEE Trans. Inf. Theory 2010, 56, 3712–3720. [Google Scholar] [CrossRef]
  32. Csiszár, I. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
  33. Sason, I.; Verdú, S. Bounds among f-divergences. arXiv 2015, arXiv:1508.00335. [Google Scholar]
  34. Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables; US Government Printing Office: Washington, DC, USA, 1964; Volume 55. [Google Scholar]
  35. Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability; Neyman, J., Ed.; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  36. Gil, M.; Alajaji, F.; Linder, T. Rényi divergence measures for commonly used univariate continuous distributions. Inf. Sci. 2013, 249, 124–131. [Google Scholar] [CrossRef]
  37. González, M. Elliptic integrals in terms of Legendre polynomials. Glasg. Math. J. 1954, 2, 97–99. [Google Scholar] [CrossRef] [Green Version]
  38. Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef]
  39. Fisher, R.A. Theory of statistical estimation. Math. Proc. Camb. Math. Soc. 1925, 22, 700–725. [Google Scholar] [CrossRef]
  40. Costa, M.H.M. A new entropy power inequality. IEEE Trans. Inf. Theory 1985, 31, 751–760. [Google Scholar] [CrossRef]
  41. Pinsker, M.S. Information and Information Stability of Random Variables and Processes; Holden-Day: San Francisco, CA, USA, 1964; Originally published in Russian in 1960. [Google Scholar]
  42. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  43. Pinsker, M.S. Calculation of the rate of message generation by a stationary random process and the capacity of a stationary channel. Dokl. Akad. Nauk 1956, 111, 753–766. [Google Scholar]
  44. Ihara, S. On the capacity of channels with additive non-Gaussian noise. Inf. Control. 1978, 37, 34–39. [Google Scholar] [CrossRef]
  45. Fahs, J.; Abou-Faycal, I.C. A Cauchy input achieves the capacity of a Cauchy channel under a logarithmic constraint. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 3077–3081. [Google Scholar]
  46. Rioul, O.; Magossi, J.C. On Shannon’s formula and Hartley’s rule: Beyond the mathematical coincidence. Entropy 2014, 16, 4892–4910. [Google Scholar] [CrossRef]
  47. Dytso, A.; Egan, M.; Perlaza, S.; Poor, H.; Shamai, S. Optimal inputs for some classes of degraded wiretap channels. In Proceedings of the 2018 IEEE Information Theory Workshop, Guangzhou, China, 25–29 November 2018; pp. 1–7. [Google Scholar]
  48. Cover, T.M. Some advances in broadcast channels. In Advances in Communication Systems; Viterbi, A.J., Ed.; Academic Press: New York, NY, USA, 1975; Volume 4, pp. 229–260. [Google Scholar]
  49. Wyner, A.D. Recent results in the Shannon theory. IEEE Trans. Inf. Theory 1974, 20, 2–9. [Google Scholar] [CrossRef]
  50. Berger, T. Rate Distortion Theory; Prentice-Hall: Englewood Cliffs, NJ, USA, 1971. [Google Scholar]
  51. Koshelev, V.N. Estimation of mean error for a discrete successive approximation scheme. Probl. Inf. Transm. 1981, 17, 20–33. [Google Scholar]
  52. Equitz, W.H.R.; Cover, T.M. Successive refinement of information. IEEE Trans. Inf. Theory 1991, 37, 269–274. [Google Scholar] [CrossRef]
  53. Kotz, S.; Nadarajah, S. Multivariate t-Distributions and Their Applications; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  54. Csiszár, I.; Narayan, P. The secret key capacity of multiple terminals. IEEE Trans. Inf. Theory 2004, 50, 3047–3061. [Google Scholar] [CrossRef]
  55. Kolmogorov, A.N.; Gnedenko, B.V. Limit Distributions for Sums of Independent Random Variables; Addison-Wesley: Reading, MA, USA, 1954. [Google Scholar]
  56. Barron, A.R. Entropy and the central limit theorem. Ann. Probab. 1986, 14, 336–342. [Google Scholar] [CrossRef]
  57. Artstein, S.; Ball, K.; Barthe, F.; Naor, A. Solution of Shannon’s problem on the monotonicity of entropy. J. Am. Math. Soc. 2004, 17, 975–982. [Google Scholar] [CrossRef]
  58. Tulino, A.M.; Verdú, S. Monotonic decrease of the non-Gaussianness of the sum of independent random variables: A simple proof. IEEE Trans. Inf. Theory 2006, 52, 4295–4297. [Google Scholar] [CrossRef]
  59. Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean–square error in Gaussian channels. IEEE Trans. Inf. Theory 2005, 51, 1261–1282. [Google Scholar] [CrossRef]
  60. Guo, D.; Shamai, S.; Verdú, S. Mutual information and conditional mean estimation in Poisson channels. IEEE Trans. Inf. Theory 2008, 54, 1837–1849. [Google Scholar] [CrossRef]
  61. Jiao, J.; Venkat, K.; Weissman, T. Relations between information and estimation in discrete-time Lévy channels. IEEE Trans. Inf. Theory 2017, 63, 3579–3594. [Google Scholar] [CrossRef]
  62. Arras, B.; Swan, Y. IT formulae for gamma target: Mutual information and relative entropy. IEEE Trans. Inf. Theory 2018, 64, 1083–1091. [Google Scholar] [CrossRef]
  63. Pinsker, M.S.; Prelov, V.; Verdú, S. Sensitivity of channel capacity. IEEE Trans. Inf. Theory 1995, 41, 1877–1888. [Google Scholar] [CrossRef] [Green Version]
  64. Poisson, S.D. Sur la probabilité des résultats moyens des observations. In Connaisance des Tems, ou des Mouvemens Célestes a l’usage des Astronomes, et des Navigateurs, pour l’an 1827; Bureau des longitudes: Paris, France, 1824; pp. 273–302. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Verdú, S. The Cauchy Distribution in Information Theory. Entropy 2023, 25, 346. https://doi.org/10.3390/e25020346

AMA Style

Verdú S. The Cauchy Distribution in Information Theory. Entropy. 2023; 25(2):346. https://doi.org/10.3390/e25020346

Chicago/Turabian Style

Verdú, Sergio. 2023. "The Cauchy Distribution in Information Theory" Entropy 25, no. 2: 346. https://doi.org/10.3390/e25020346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop