An Axiomatic Characterization of Mutual Information

Fullwood, James

doi:10.3390/e25040663

Open AccessArticle

An Axiomatic Characterization of Mutual Information

by

James Fullwood

School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China

Entropy 2023, 25(4), 663; https://doi.org/10.3390/e25040663

Submission received: 12 March 2023 / Revised: 10 April 2023 / Accepted: 12 April 2023 / Published: 15 April 2023

(This article belongs to the Section Information Theory, Probability and Statistics)

Download Review Reports Versions Notes

Abstract

:

We characterize mutual information as the unique map on ordered pairs of discrete random variables satisfying a set of axioms similar to those of Faddeev’s characterization of the Shannon entropy. There is a new axiom in our characterization, however, which has no analog for Shannon entropy, based on the notion of a Markov triangle, which may be thought of as a composition of communication channels for which conditional entropy acts functorially. Our proofs are coordinate-free in the sense that no logarithms appear in our calculations.

Keywords:

shannon theory; information measures; mutual information

1. Introduction

Axiomatic characterizations of information measures go back to the seminal work of Shannon [1], providing conceptual insights into their meaning as well as justification for the analytic formulae involved in their definitions. Various characterizations for Shannon entropy, relative entropy, Renyi and Tsallis entropies, von Neumann and Segal entropies, quantum relative entropy, as well as other generalized information measures have appeared in the literature [2,3,4,5,6,7,8,9], and a review of such enterprises in the classical (i.e., non-quantum) setting appears in the survey of Csiszár [10]. More recently, functorial characterizations of information measures from a categorical viewpoint have appeared in the works of Baez, Fritz, and Leinster [11,12], as well as our work with Parzygnat [13], who have proven a functorial characterization of the von Neumann entropy [14]. An axiomatic approach to entropy in the theory of biodiversity is the subject of the recent book [15] by Leinster.

In spite of the breadth of the aforementioned results, the mutual information of a pair of random variables seems to be missing from the story. While an operational characterization of mutual information in the context of algorithmic information theory appears in [16], to the best of our knowledge, an axiomatic characterization in the vein of those surveyed by Csiszár in [10] is absent in the literature. It is then the goal of the present work to introduce mutual information into the axiomatic framework.

Our main result is Theorem 1, where we prove that the mutual information

I (X, Y)

of an ordered pair of random variables is the unique function (up to an arbitrary multiplicative factor) on pairs of random variables satisfying the following axioms:

Continuity: If $(X_{n}, Y_{n}) \to (X, Y)$ , then $I (X, Y) = {lim}_{n \to \infty} I (X_{n}, Y_{n})$ .
Strong Additivity: Given a random variable $X : Ω \to X$ with probability mass function $p : X \to [0, 1]$ , and a collection of pairs of random variables $(Y^{x}, Z^{x})$ indexed by $X$ , then

$I (⨁_{x \in X} p (x) (Y^{x}, Z^{x})) = I (X, X) + \sum_{x \in X} p (x) I (Y^{x}, Z^{x}) .$
Symmetry: $I (X, Y) = I (Y, X)$ for every pair of random variables $(X, Y)$ .
Invariance Under Pullbacks: If $π : Ω^{'} \to Ω$ is a measure-preserving function, then for every pair of random variables $(X, Y)$ with common domain $Ω$ ,

$I (X, Y) = I (X \circ π, Y \circ π) .$
Weak Functoriality: For every Markov triangle $(X, Y, Z)$ ,

$I (X, Z) = I (X, Y) + I (Y, Z) - I (Y, Y) .$
Vacuity: If C is a constant random variable, then $I (X, C) = 0$ .

The fact that mutual information satisfies Axioms 1, 3, and 6 is well known to anybody familiar with mutual information. As we work at the level of random variables as opposed to simple probability distributions (which we do for wider applicability of our results), Axiom 4 is a reflection of the fact that mutual information only depends on probabilities. For Axiom 2, we define a convex structure on pairs of random variables in such a way that the strong additivity of Shannon entropy is generalized to our context. Axiom 5 is defined in terms of the notion of a Markov triangle, a concept we define based on the notion of a “coalescable” composition of communication channels which was introduced in [13]. Intuitively, a Markov triangle may be thought of as a composition of noisy channels over which the associated conditional entropy is additive. Moreover, such axioms are sharp in the sense that if any of the axioms are removed, then mutual information may not be characterized. In particular, the joint entropy

H (X, Y)

satisfies Axioms 1–5, while the conditional entropy

H (Y | X)

satisfies all the axioms except the symmetry Axiom 3 (note that since

H (X, X) = 0

, Axiom 2 in the case of conditional entropy becomes convex linearity).

In the spirit of the axiomatic approach, we note that logarithms are absent from all calculations in this paper.

2. Mutual Information

Let

(Ω, Σ, μ)

be a probability space, where

Ω

is thought of as the set of all possible outcomes of a data generating process or experiment,

Σ

is a

σ

-algebra of measurable subsets of

Ω

, and

μ

is a probability measure.

Definition 1.

A finite random variable is a surjective function

X : Ω \to X

such that

X

is a finite set and

X^{- 1} (x) \in Σ

for all

x \in X

. In such a case, the set

X

is often referred to as the support, or alphabet associated with X. The probability mass function of X is the function

p : X \to [0, 1]

given by

p (x) = μ (X^{- 1} (x)),

and the Shannon entropy of X is the non-negative real number

H (X)

given by

H (X) = - \sum_{x \in X} p (x) log (p (x)) .

The collection of all finite random variables on Ω will be denoted

FRV (Ω)

.

Definition 2.

Let

(X, Y) \in FRV (Ω) \times FRV (Ω)

be an ordered pair of random variables with supports

X

and

Y

respectively.

The joint distribution function of $(X, Y)$ is the function $ϑ : X \times Y \to [0, 1]$ given by

$ϑ (x, y) = μ (X^{- 1} (x) \cap Y^{- 1} (y)),$
The joint entropy of $(X, Y)$ is the non-negative real number given by

$H (X, Y) = - \sum_{x \in X} \sum_{y \in Y} ϑ (x, y) log (ϑ (x, y)),$
The mutual information of $(X, Y)$ is the real number $I (X, Y)$ given by

$I (X, Y) = H (X) + H (Y) - H (X, Y) .$

Remark 1.

With every pair of random variables

(X, Y)

one may associate a probability transition matrix

p (y | x)

given by

p (y | x) = \frac{ϑ (x, y)}{p (x)},

where

p : X \to [0, 1]

is the probability mass function of X. As such, one may view

(X, Y)

as a noisy channel

X ⇝ Y

together with the prior distribution p on its set of inputs.

We now list some well-known properties of mutual information which will be useful for our purposes (see, e.g., [17] for proofs).

Proposition 1.

Mutual information satisfies the following properties.

i.: $I (X, Y) \geq 0$ for all $(X, Y) \in FRV (Ω) \times FRV (Ω)$ .
ii.: $I (X, Y) = I (Y, X)$ for all $(X, Y) \in FRV (Ω) \times FRV (Ω)$ .
iii.: $I (X, X) = H (X)$ for all $X \in FRV (Ω)$ .
iv.: $I (X, C) = 0$ for every constant random variable $C \in FRV (Ω)$ .

Definition 3.

The canonical product on

FRV (Ω)

is the map

P : FRV (Ω) \times FRV (Ω) \to FRV (Ω)

, given by

P (X, Y) (ω) = (X (ω), Y (ω)) \in X \times Y

for all

ω \in Ω

.

Proposition 2.

Let

(X, Y) \in FRV (Ω) \times FRV (Ω)

. Then the following statements hold.

i.: The probability mass function of $P (X, Y)$ is the joint distribution function $ϑ (x, y)$ . In particular, $H (X, Y) = H (P (X, Y))$ .
ii.: $I (X, P (X, Y)) = H (X)$ .

Proof.

i.: Let $ν : X \times Y \to [0, 1]$ denote the probability mass function of $P (X, Y)$ . Then for all $(x, y) \in X \times Y$ we have $P {(X, Y)}^{- 1} (x, y) = X^{- 1} (x) \cap Y^{- 1} (y)$ , thus

$ν (x, y) = μ (P {(X, Y)}^{- 1} (x, y)) = μ (X^{- 1} (x) \cap Y^{- 1} (y)) = ϑ (x, y),$

as desired.
ii.: The statement follows from the fact that $H (X, P (X, Y)) = H (X, Y)$ . □

3. Convexity

We now generalize the notion of a convex combination of probability distributions to the setting of pairs of random variables, which will be used to extend the notion of strong additivity for Shannon entropy to mutual information.

Notation 1.

We use the notation

X ∐ Y

to denote the disjoint union of the sets X and Y.

Definition 4.

Let

X

be a finite set, and let

p : X \to [0, 1]

be a probability distribution on

X

. Then

⨁_{x \in X} p (x) (Ω, Σ, μ)

is the probability space associated with the triple

(X \times Ω, X \times Σ, p \times μ)

. Now suppose

Y^{x} \in FRV (Ω)

is a collection of random variables indexed by

X

, and let

q^{x} : Y^{x} \to [0, 1]

denote the probability mass function of

Y^{x}

. The p-weighted convex sum

⨁_{x \in X} p (x) Y^{x} \in FRV (X \times Ω)

is the random variable given by

(⨁_{x \in X} p (x) Y^{x}) (\tilde{x}, ω) = Y^{\tilde{x}} (ω) .

(1)

It then follows that the probability mass function of

⨁_{x \in X} p (x) Y^{x}

is a function of the form

r : ∐_{x \in X} Y^{x} \to [0, 1]

, and using the fact that

∐_{x \in X} Y^{x}

is canonically isomorphic to the set

\{(x, y) | x \in X and y \in Y^{x}\},

it follows that r is then given by

r (x, y) = p (x) q^{x} (y)

.

A reformulation of the strong additivity property for Shannon entropy in terms of the convex structure just introduced for random variables is given using the following proposition.

Proposition 3.

Let

X

be a finite set, let

p : X \to [0, 1]

be a probability distribution on

X

, and suppose

Y^{x} \in FRV (Ω)

is a collection of random variables indexed by

X

. Then

H (⨁_{x \in X} p (x) Y^{x}) = H (p) + \sum_{x \in X} p (x) H (Y^{x}),

(2)

where

H (p)

is the Shannon entropy of the probability distribution p.

Proposition 4.

Let

X

be a finite set, let

p : X \to [0, 1]

be a probability distribution on

X

, and suppose

(Y^{x}, Z^{x}) \in FRV (Ω) \times FRV (Ω)

is a collection of pairs of random variables indexed by

X

. Then

⨁_{x \in X} p (x) P (Y^{x}, Z^{x}) = P (⨁_{x \in X} p (x) Y^{x}, ⨁_{x \in X} p (x) Z^{x})

(3)

Proof.

Let

(\tilde{x}, ω) \in X \times Ω

. Then

\begin{matrix} (⨁_{x \in X} p (x) P (Y^{x}, Z^{x})) (\tilde{x}, ω) \overset{(1)}{=} P (Y^{\tilde{x}}, Z^{\tilde{x}}) (ω) & = & (Y^{\tilde{x}} (ω), Z^{\tilde{x}} (ω)) \\ \overset{(1)}{=} & (⨁_{x \in X} p (x) Y^{x} (\tilde{x}, ω), ⨁_{x \in X} p (x) Z^{x} (\tilde{x}, ω)) \\ = & P (⨁_{x \in X} p (x) Y^{x}, ⨁_{x \in X} p (x) Z^{x}) (\tilde{x}, ω), \end{matrix}

thus Equation (3) holds. □

In light of Proposition 4, we make the following definition.

Definition 5.

Let

X

be a finite set, let

p : X \to [0, 1]

be a probability distribution on

X

, and suppose

(Y^{x}, Z^{x}) \in FRV (Ω) \times FRV (Ω)

is a collection of pairs of random variables indexed by

X

. The p-weighted convex sum

⨁_{x \in X} p (x) (Y^{x}, Z^{x}) \in FRV (X \times Ω) \times FRV (X \times Ω)

is defined to be the the ordered pair

(⨁_{x \in X} p (x) Y^{x}, ⨁_{x \in X} p (x) Z^{x})

.

Proposition 5

(Strong Additivity of Mutual Information). Let

X

be a finite set, let

p : X \to [0, 1]

be a probability distribution on

X

, and suppose

(Y^{x}, Z^{x}) \in FRV (Ω) \times FRV (Ω)

is a collection of pairs of random variables indexed by

X

. Then

I (⨁_{x \in X} p (x) (Y^{x}, Z^{x})) = H (p) + \sum_{x \in X} p (x) I (Y^{x}, Z^{x}),

where

H (p)

is the Shannon entropy of the probability distribution p.

Proof.

Indeed,

\begin{matrix} I (⨁_{x \in X} p (x) (Y^{x}, Z^{x})) & = & I (⨁_{x \in X} p (x) Y^{x}, ⨁_{x \in X} p (x) Z^{x}) \\ = & H (⨁_{x \in X} p (x) Y^{x}) + H (⨁_{x \in X} p (x) Z^{x}) - H (⨁_{x \in X} p (x) Y^{x}, ⨁_{x \in X} p (x) Z^{x}) \\ \overset{(3)}{=} & H (⨁_{x \in X} p (x) Y^{x}) + H (⨁_{x \in X} p (x) Z^{x}) - H (⨁_{x \in X} p (x) P (Y^{x}, Z^{x})) \\ \overset{(2)}{=} & 2 H (p) + \sum_{x \in X} p (x) (H (Y^{x}) + H (Z^{x})) - (H (p) + \sum_{x \in X} p (x) H (Y^{x}, Z^{x})) \\ = & H (p) + \sum_{x \in X} p (x) (H (Y^{x}) + H (Z^{x}) - H (Y^{x}, Z^{x})) \\ = & H (p) + \sum_{x \in X} p (x) I (Y^{x}, Z^{x}), \end{matrix}

as desired. □

4. Continuity

Definition 6.

Let

X_{n} \in FRV (Ω)

be a sequence of random variables, and let

p_{n} : X_{n} \to [0, 1]

be the associated sequence of probability mass functions. Then

X_{n}

is said to weakly converge (or converge in distribution) to the random variable

X \in FRV (Ω)

with probability mass function

p : X \to [0, 1]

if the following conditions hold.

i.: There exists an $N \in N$ for which $X_{n} = X$ for all $n \geq N$ .
ii.: For all $x \in X$ we have $lim_{n \to \infty} p_{n} (x) = p (x)$ , i.e., $p_{n} \to p$ pointwise.

In such a case, we write

X_{n} \to X

. If

(X_{n}, Y_{n}) \in FRV (Ω) \times FRV (Ω)

is a sequence of pairs of random variables, then

(X_{n}, Y_{n})

is said to weakly converge to

(X, Y) \in FRV (Ω) \times FRV (Ω)

if

P (X_{n}, Y_{n}) \to P (X, Y)

.

Proposition 6.

Shannon entropy is continuous, i.e., if

X_{n} \to X

, then

H (X) = lim_{n \to \infty} H (X_{n}) .

Proof.

This result is standard, see, e.g., [3] or [11]. □

Proposition 7.

Mutual information is continuous, i.e., if

(X_{n}, Y_{n}) \to (X, Y)

, then

I (X, Y) = lim_{n \to \infty} I (X_{n}, Y_{n}) .

Proof.

Suppose

(X_{n}, Y_{n}) \to (X, Y)

, so that

X_{n} \to X

,

Y_{n} \to Y

and

H (X_{n}, Y_{n}) \to H (X, Y)

. We then have

I (X, Y) = H (X) + H (Y) - H (X, Y) = lim_{n \to \infty} (H (X_{n}) + H (Y_{n}) - H (X_{n}, Y_{n})) = lim_{n \to \infty} I (X_{n}, Y_{n}),

as desired. □

5. Markov Triangles

In this section, we define the notion of a Markov triangle, a concept based on the notion of a “coalescable” composition of communication channels which was introduced in [13]. Such a notion will be crucial for our characterization of mutual information.

Definition 7.

Let

X \in FRV (Ω)

be a random variable with probability mass function

p : X \to [0, 1]

, and let

x \in X

. Then for any random variable

Y \in FRV (Ω)

, the conditional distribution function of Y given

X = x

is the function

q^{x} : Y \to [0, 1]

given by

q^{x} (y) = \{\begin{matrix} \frac{ϑ (x, y)}{p (x)} if p (x) \neq 0 \\ 0 otherwise . \end{matrix}

From here on, the value

q^{x} (y)

will be denoted

q (y | x)

. The conditional entropy of Y given X is the non-negative real number

H (Y | X)

given by

H (Y | X) = \sum_{x \in X} p (x) H (q^{x}),

where

H (q^{x})

is the Shannon entropy of the distribution

q^{x}

on Y.

Proposition 8.

Let

(X, Y)

be a pair of random variables. Then

I (X, Y) = I (Y, Y) - H (Y | X) .

(4)

Proof.

Since

I (Y, Y) = H (Y)

, the statement follows from the well-known fact that

I (X, Y) = H (Y) - H (Y | X)

, the proof of which may be found in any information theory text (e.g., [17]). □

Definition 8.

Let

(X, Y, Z)

be a triple of random variables with supports

X

,

Y

and

Z

respectively, and let

q (y | x)

,

p (z | y)

and

r (z | x)

denote the associated conditional distribution functions. Then

(X, Y, Z)

is said to form a Markov triangle if there exists a function

h : Z \times X \to Y

such that for all

(z, x) \in Z \times X

we have

r (z | x) = p (z | h (z, x)) q (h (z, x) | x) .

In such a case, h is said to be a mediator function for the triple

(X, Y, Z)

.

Remark 2.

A Markov triangle

(X, Y, Z)

with supports

X

,

Y

, and

Z

may be thought of as a composition of noisy channels

X \overset{f}{⇝} Y \overset{g}{⇝} Z

such that if

z \in Z

is the output of the channel

g \circ f

, and one is given the information that the associated input was

x \in X

, then the output at the intermediary stage

Y

was necessarily

y = h (z, x)

(where h is the associated mediator function). In other words, if

P

is a symbol for general probabilities, then

P (x, z | y) = P (x | y) P (z | y),

thus the Markov triangle condition says that X and Z are conditionally independent given Y. As compositions of deterministic channels always satisfy this property, Markov triangles are a generalization of compositions of deterministic channels. While Markov triangles play a crucial role in our characterization of mutual information and also the characterizations of conditional entropy and information loss in [13], their broader significance in the study of information measures has yet to be determined.

Proposition 9.

Suppose

(X, Y, Z)

is a Markov triangle. Then

I (X, Z) = I (X, Y) + I (Y, Z) - I (Y, Y) .

In particular,

I (X, Z) \leq I (X, Y) + I (Y, Z)

.

Before giving a proof of Proposition 9, we first need the following lemma.

Lemma 1.

Suppose

(X, Y, Z)

is a Markov triangle. Then

H (Z | X) = H (Z | Y) + H (Y | X) .

(5)

Proof.

The statement is simply a reformulation of Theorem 2 in [13]. □

Proof of Proposition 9.

Suppose

(X, Y, Z)

is a Markov triangle. Then

\begin{matrix} I (X, Z) \overset{(4)}{=} I (Z, Z) - H (Z | X) & \overset{(5)}{=} & I (Z, Z) - (H (Z | Y) + H (Y | X)) \\ = & I (Y, Y) - H (Y | X) + I (Z, Z) - H (Z | Y) - I (Y, Y) \\ \overset{(4)}{=} & I (X, Y) + I (Y, Z) - I (Y, Y), \end{matrix}

as desired. □

Proposition 10.

Let

X, Y \in FRV (Ω)

be random variables with probability mass functions

p : X \to [0, 1]

and

q : Y \to [0, 1]

respectively. Then the following statements hold.

i.: The triple $(X, P (X, Y), Y)$ is a Markov triangle.
ii.: If $f : X \to X^{'}$ is a bijection, then the triple $(X, f \circ X, Y)$ is a Markov triangle.
iii.: If $g : Y \to Y^{'}$ is a bijection, then the triple $(X, Y, g \circ Y)$ is a Markov triangle.

Proof.

i.: Let $r (y | x)$ be the conditional distribution associated with $(X, Y)$ , let $p (y | (\tilde{x}, \tilde{y}))$ be the conditional distribution associated with $(P (X, Y), Y)$ , and let $q ((\tilde{x}, \tilde{y}) | x)$ be the conditional distribution function associated with $(X, P (X, Y))$ . Then for all $y \in Y$ and $x \in X$ we have

$r (y | x) = \sum_{(\tilde{x}, \tilde{y}) \in X \times Y} p (y | (\tilde{x}, \tilde{y})) q ((\tilde{x}, \tilde{y}) | x) = p (y | (x, y)) q ((x, y) | x),$

where the second equality comes from the fact that $p (y | (\tilde{x}, \tilde{y})) = 0$ unless $\tilde{y} = y$ and $q ((\tilde{x}, \tilde{y}) | x) = 0$ unless $x = \tilde{x}$ . It then follows that the function $h : Y \times X \to X \times Y$ given by $h (y, x) = (x, y)$ is a mediator function for $(X, P (X, Y), Y)$ , thus $(X, P (X, Y), Y)$ is a Markov triangle.
ii.: Let $r (y | x)$ be the conditional distribution associated with $(X, Y)$ , let $p (y | x^{'})$ be the conditional distribution associated with $(f \circ X, Y)$ , and let $q (x^{'} | x)$ be the conditional distribution function associated with $(X, f \circ X)$ . Then for all $y \in Y$ and $x \in X$ we have

$r (y | x) = \sum_{x^{'} \in X^{'}} p (y | x^{'}) q (x^{'} | x) = p (y | f (x)) q (f (x) | x),$

where the second equality comes from the fact that $q (x^{'} | x) = 0$ unless $x^{'} = f (x)$ . It then follows that the function $h : Y \times X \to X^{'}$ given by $h (y, x) = f (x)$ is a mediator function for $(X, f \circ X, Y)$ , thus $(X, f \circ X, Y)$ is a Markov triangle.
iii.: Let $r (y^{'} | x)$ be the conditional distribution associated with $(X, g \circ Y)$ , let $p (y^{'} | y)$ be the conditional distribution associated with $(Y, g \circ Y)$ , and let $q (y | x)$ be the conditional distribution associated with $(X, Y)$ . Then for all $y^{'} \in Y^{'}$ and $x \in X$ we have

$r (y^{'} | x) = \sum_{y \in Y} p (y^{'} | y) q (y | x) = p (y^{'} | g^{- 1} (y^{'})) q (g^{- 1} (y^{'}) | x),$

where the second equality comes from the fact that $p (y^{'} | y) = 0$ unless $y = g^{- 1} (y^{'})$ . It then follows that the function $h : Y^{'} \times X \to X^{'}$ given by $h (y^{'}, x) = g^{- 1} (y^{'})$ is a mediator function for $(X, Y, g \circ Y)$ , thus $(X, Y, g \circ Y)$ is a Markov triangle. □

6. Characterization Theorem

We now state and prove our characterization theorem for mutual information.

Definition 9.

Let

(Ω, Σ, μ)

and

(Ω^{'}, Σ^{'}, μ^{'})

be probability spaces. A map

π : Ω^{'} \to Ω

is said to be measure-preserving if for all

σ \in Σ

we have

π^{- 1} (σ) \in Σ^{'}

and

μ^{'} (π^{- 1} (σ)) = μ (σ) .

Definition 10.

Let F be a map that sends pairs of random variables to the real numbers.

F is said to be continuous if

$F (X, Y) = lim_{n \to \infty} F (X_{n}, Y_{n})$

(6)

whenever $(X_{n}, Y_{n}) \to (X, Y)$ .
F is said to be strongly additive if given a random variable X with probability mass function $p : X \to [0, 1]$ , and a collection of pairs of random variables $(Y^{x}, Z^{x})$ indexed by $X$ , then

$F (⨁_{x \in X} p (x) (Y^{x}, Z^{x})) = F (X, X) + \sum_{x \in X} p (x) F (Y^{x}, Z^{x}) .$

(7)
F is said to be symmetric if $F (X, Y) = F (Y, X)$ for every pair of random variables $(X, Y)$ .
F is said to be invariant under pullbacks if for every pair of random variables $(X, Y) \in FRV (Ω) \times FRV (Ω)$ and every measure-preserving map $π : Ω^{'} \to Ω$ we have

$F (X, Y) = F (X \circ π, Y \circ π) .$

(8)
F is said to be weakly functorial if for every Markov triangle $(X, Y, Z)$ we have

$F (X, Z) = F (X, Y) + F (Y, Z) - F (Y, Y) .$

(9)

Remark 3.

The terminology “weakly functorial” comes from viewing (9) from a category-theoretic perspective. In particular, with a pair of random variables

(X, Y)

one may associate a noisy channel

X \overset{f}{⇝} Y

where

X = Supp (X)

and

Y = Supp (Y)

, so that a Markov triangle

(X, Y, Z)

then corresponds to a composition

X \overset{f}{⇝} Y \overset{g}{⇝} Z

with

Z = Supp (Z)

. If

FinPS

denotes the category of noisy channels and

B R

denotes the category with one object whose morphisms are the real numbers (with a composition corresponding to addition), then a map

F : FinPS \to B R

is a functor if

F (g \circ f) = F (g) + F (f) .

(10)

Rewriting (10) in terms of the pairs of random variables for which the morphisms f, g, and

g \circ f

are associated with, then the functoriality condition (10) reads

F (X, Z) = F (X, Y) + F (Y, Z),

thus the condition

F (X, Z) \leq F (X, Y) + F (Y, Z)

is a weaker form of functoriality. For more on information measures from a category-theoretic perspective, see [11,12,13,14].

Theorem 1

(Axiomatic Characterization of Mutual Information). Let F be a map that sends pairs of random variables to the non-negative real numbers, and suppose F satisfies the following conditions.

1.: F is continuous.
2.: F is strongly additive.
3.: F is symmetric.
4.: F is weakly functorial.
5.: F is invariant under pullbacks.
6.: $F (X, C) = 0$ for every constant random variable C.

Then F is a non-negative multiple of mutual information. Conversely, mutual information satisfies conditions 1–6.

Before giving a proof, we first need several lemmas. The first lemma states that a map F on pairs of random variables which is continuous and invariant under pullbacks, only depends on the underlying probability mass functions of the random variables.

Lemma 2.

Let F be a map from pairs of random variables to the real numbers, which is continuous and invariant under pullbacks, and suppose

(X, Y) \in FRV (Ω) \times FRV (Ω)

and

(X^{'}, Y^{'}) \in FRV (Ω^{'}) \times FRV (Ω^{'})

are such that the associated joint distribution functions

ϑ : X \times Y \to [0, 1]

and

ϑ^{'} : X \times Y \to [0, 1]

are equal. Then

F (X, Y) = F (X^{'}, Y^{'})

.

Proof.

Let

π : Ω \times Ω^{'} \to Ω

and

π^{'} : Ω \times Ω^{'} \to Ω^{'}

be the natural projections. Since both the natural projections are measure-preserving, we have

F (X, Y) = F (X \circ π, Y \circ π)

,

F (X^{'}, Y^{'}) = F (X^{'} \circ π^{'}, Y^{'} \circ π^{'})

, and moreover, from the assumption that

ϑ = ϑ^{'}

it follows that the joint distribution functions associated with

(X \circ π, Y \circ π)

and

(X^{'} \circ π^{'}, Y^{'} \circ π^{'})

are equal. It then follows that if

(X_{n}, Y_{n})

is the constant sequence given by

X_{n} = X^{'} \circ π^{'}

and

Y_{n} = Y^{'} \circ π^{'}

for all

n \in N

, then

(X_{n}, Y_{n}) \to (X \circ π, Y \circ π)

(since

P (X_{n}, Y_{n}) \to P (X \circ π, Y \circ π)

). We then have

F (X, Y) \overset{(8)}{=} F (X \circ π, Y \circ π) \overset{(6)}{=} lim_{n \to \infty} F (X_{n}, Y_{n}) = F (X^{'} \circ π^{'}, Y^{'} \circ π^{'}) \overset{(8)}{=} F (X^{'}, Y^{'}),

as desired. □

Lemma 3.

Let X be a random variable with probability mass function

p : X \to [0, 1]

, let

f : X \to Y

be a bijection, and suppose C is a constant random variable. Then the following statements hold.

i.: The triples $(X, f \circ X, C)$ and $(f \circ X, X, C)$ are both Markov triangles.
ii.: Let F be a map that sends pairs of random variables to real numbers, and suppose F is symmetric and weakly functorial. Then

F (X, C) - F (f \circ X, C) + F (f \circ X, f \circ X) = F (f \circ X, C) - F (X, C) + F (X, X) .

(11)

Proof.

i.: The statement follows from item ii of Proposition 10.
ii.: By item 3, the triples $(X, f \circ X, C)$ and $(f \circ X, X, C)$ are both Markov triangles, thus the weak functoriality of F yields

$F (X, C) = F (X, f \circ X) + F (f \circ X, C) - F (f \circ X, f \circ X),$

(12)

and

$F (f \circ X, C) = F (f \circ X, X) + F (X, C) - F (X, X) .$

(13)

And since F is symmetric $F (X, f \circ X) = F (f \circ X, X)$ , thus Equations (12) and (13) imply Equation (11), as desired. □

The next lemma is Baez, Fritz, and Leinster’s reformulation of Faddeev’s characterization of Shannon entropy [3], which they use in their characterization of the information loss associated with a deterministic mapping [11]. This lemma will allow us to relate

F (X, X)

to the Shannon entropy

H (X)

.

Lemma 4.

Let

S

be a map that sends finite probability distributions to the non-negative real numbers, and suppose

S

satisfies the following conditions.

i.: $S$ is continuous, i.e., if $p_{n} : X \to [0, 1]$ is a convergent sequence of probability distributions on a finite set $X$ (i.e., if ${lim}_{n \to \infty} p_{n} (x)$ exists for all $x \in X$ ), then

$S (lim_{n \to \infty} p_{n}) = lim_{n \to \infty} S (p_{n}) .$
ii.: $S (1) = 0$ for the distribution $1 : {★} \to [0, 1]$ .
iii.: If $q : Y \to [0, 1]$ is a probability distribution on a finite set $Y$ and $f : X \to Y$ is a bijection, then $S (q) = S (q \circ f)$ .
iv.: If $p : X \to [0, 1]$ is a probability distribution on a finite set $X$ , and $q^{x} : Y^{x} \to [0, 1]$ is a collection of finite probability distributions indexed by $X$ , then

$S (⨁_{x \in X} p (x) q^{x}) = S (p) + \sum_{x \in X} p (x) S (q^{x}),$

where $⨁_{x \in X} p (x) q^{x} : ∐_{x \in X} Y^{x} \to [0, 1]$ is the finite distribution given by $(⨁_{x \in X} p (x) q^{x})$ $(\tilde{x}, y_{\tilde{x}}) = p (\tilde{x}) q^{\tilde{x}} (y_{\tilde{x}})$ .

Then

S

is a non-negative multiple of Shannon entropy.

Lemma 5.

Let F be a map that sends pairs of random variables to the non-negative real numbers satisfying conditions 1–6 of Theorem 1, and let

E

be the map on random variables given by

E (X) = F (X, X) .

Then

E

is a non-negative multiple of Shannon entropy.

Proof.

Let

ϕ

be the map that takes a random variable to its probability mass function, let

σ

be a section (so that

ϕ \circ σ

is the identity), and let

S = E \circ σ

. Since F is invariant under pullbacks (condition 5 of Theorem 1) Lemma 2 holds, thus the map

S

is independent of the choice of a section

σ

of

ϕ

, and as such, it follows that

E = S \circ ϕ

. We now show that

S

satisfies items i–iv of Lemma 4, which then implies

E (X)

is a non-negative multiple of the Shannon entropy

H (X)

.

i.: Let $p_{n} : X \to [0, 1]$ be a sequence of probability distributions on a finite set $X$ , and suppose $lim_{n \to \infty} p_{n} = p$ . It then follows that $X_{n} = σ (p_{n})$ weakly converges to $X = σ (p)$ , thus

$\begin{matrix} S (lim_{n \to \infty} p_{n}) & = & S (p) = (E \circ σ) (p) = E (X) = F (X, X) = lim_{n \to \infty} F (X_{n}, X_{n}) \\ = & lim_{n \to \infty} E (X_{n}) = lim_{n \to \infty} E (σ (p_{n})) = lim_{n \to \infty} S (p_{n}) \end{matrix}$

where the fifth equality follows from the continuity assumption on F (condition 1 of Theorem 1).
ii.: Let $1 : {★} \to [0, 1]$ be a point mass distribution, so that $σ (1) = C$ with C a constant random variable. Then $S (1) = E (σ (1)) = E (C) = F (C, C) = 0$ , where the last equality follows from condition 6 of Theorem 1, i.e., that $F (X, C) = 0$ for every constant random variable C.
iii.: Let X be a random variable with probability mass function $p : X \to [0, 1]$ , and suppose $f : X \to Y$ is a bijection. Since F is symmetric and weakly functorial (conditions 3 and 4 of Theorem 1), the hypotheses of item ii Lemma 3 are satisfied, so that Equation (11) holds, i.e., for any constant random variable C we have

$F (X, C) - F (f \circ X, C) + F (f \circ X, f \circ X) = F (f \circ X, C) - F (X, C) + F (X, X) .$

And since $F (X, C) = F (f \circ X, C) = 0$ by condition 6 of Theorem 1, it follows that $F (X, X) = F (f \circ X, f \circ X)$ . Now let $q : Y \to [0, 1]$ be the probability mass function of $f \circ X$ , so that $q = p \circ f^{- 1}$ . We then have

$S (p) = E (X) = F (X, X) = F (f \circ X, f \circ X) = E (f \circ X) = S (q) = S (p \circ f^{- 1}),$

thus $S$ satisfies item iii of Faddeev’s Theorem.
iv.: Let X be a random variable with probability mass function $p : X \to [0, 1]$ , $Y^{x}$ a collection of random variables indexed by $X$ , and let $q^{x} : Y^{x} \to [0, 1]$ be the associated probability mass functions for all $x \in X$ . Then $⨁_{x \in X} p (x) Y^{x}$ has probability mass function $⨁_{x \in X} p (x) q^{x}$ , thus

$\begin{matrix} S (⨁_{x \in X} p (x) q^{x}) = E (⨁_{x \in X} p (x) Y^{x}) & = & F (⨁_{x \in X} p (x) Y^{x}, ⨁_{x \in X} p (x) Y^{x}) \\ \overset{(3)}{=} & F (⨁_{x \in X} p (x) (Y^{x}, Y^{x})) \\ = & F (X, X) + \sum_{x \in X} p (x) F (Y^{x}, Y^{x}) \\ = & E (X) + \sum_{x \in X} p (x) E (Y^{x}) \\ = & S (p) + \sum_{x \in X} p (x) S (q^{x}), \end{matrix}$

where the fourth equality follows from the strong additivity of F, i.e., condition 2 of Theorem 1. It then follows that $S$ satisfies item iv of Faddeev’s Theorem, as desired. □

The next lemma is the analog of property iii of Lemma 4 for information measures on pairs of random variables.

Lemma 6.

Let

X, Y \in FRV (Ω)

be random variables with probability mass functions

p : X \to [0, 1]

and

q : Y \to [0, 1]

respectively, and suppose F is a map on pairs of random variables to the real numbers which is symmetric, weakly functorial, and

F (X, C) = 0

for every constant random variable C. If

f : X \to X^{'}

and

g : Y \to Y^{'}

are bijections, then

F (X, Y) = F (f \circ X, g \circ Y) .

(14)

Proof.

Since f is a bijection,

(X, f \circ X, X)

is a Markov triangle by item ii of Proposition 10, thus

F (X, X) = F (X, f \circ X) + F (f \circ X, X) - F (f \circ X, f \circ X) .

(15)

From the proof of Lemma 5 it follows that if F is weakly functorial, symmetric and

F (X, C) = 0

for every constant random variable C,

F (f \circ X, f \circ X) = F (X, X)

. Moreover by the symmetry of F we have

F (X, f \circ X) = F (f \circ X, X)

, thus Equation (15) implies

F (X, X) = F (X, f \circ X)

.

Now consider the triples

(f \circ X, X, g \circ Y)

and

(X, Y, g \circ Y)

, which are both Markov triangles by items ii and iii of Proposition 10. The weakly functorial assumption on F then yields

\begin{matrix} F (f \circ X, g \circ Y) & = & F (f \circ X, X) + F (X, g \circ Y) - F (X, X) \\ = & F (f \circ X, X) + (F (X, Y) + F (Y, g \circ Y) - F (Y, Y)) - F (X, X), \end{matrix}

and since

F (f \circ X, X) = F (X, X)

and

F (Y, g \circ Y) = F (Y, Y)

, it follows that

F (X, Y) = F (f \circ X, g \circ Y)

, as desired. □

The next lemma, together with the fact that

(X, P (X, Y), Y)

is a Markov triangle (by Proposition 10) is the crux of the proof, as we will soon see.

Lemma 7.

Let F be a map from pairs of random variables to the real numbers satisfying conditions 1–6 of Theorem 1, and let

(X, Y)

be a pair of random variables. Then

F (X, P (X, Y)) = F (X, X)

and

F (P (X, Y), Y) = F (Y, Y)

.

Proof.

Let

p : X \to [0, 1]

and

q : Y \to [0, 1]

be the probability mass functions of X and Y respectively, and for all

x \in X

, let

Y^{x}

be a random variable with probability mass function

q^{x} : Y \to [0, 1]

given by

q^{x} (y) = q (y | x)

, so that

q^{x}

is the conditional distribution of Y given

X = x

. By pulling back to larger sample spaces if necessary, we can assume without loss of generality that each

Y^{x} \in F R V (Ω)

for some fixed

Ω

. We also let

C^{x} \in F R V (Ω)

be the constant random variable supported on

{x}

for all

x \in X

, we let

f : ∐_{x \in X} {x} \to X

and

g : ∐_{x \in X} Y \to X \times Y

be the canonical bijections, and we let

π : X \times Ω \to Ω

be the natural projection. It then follows that

f \circ ⨁_{x \in X} p (x) C^{x}

and

X \circ π

both have probability mass function

p : X \to [0, 1]

, and also, that

g \circ ⨁_{x \in X} p (x) Y^{x}

and

P (X, Y) \circ π

both have probability mass function equal to the joint distribution function

ϑ : X \times Y \to [0, 1]

associated with

(X, Y)

, thus Lemma 2 yields

F (X \circ π, P (X, Y) \circ π) = F (f \circ ⨁_{x \in X} p (x) C^{x}, g \circ ⨁_{x \in X} p (x) Y^{x}) .

(16)

We then have

\begin{matrix} F (X, P (X, Y)) \overset{(8)}{=} F (X \circ π, P (X, Y) \circ π) & \overset{(16)}{=} & F (f \circ ⨁_{x \in X} p (x) C^{x}, g \circ ⨁_{x \in X} p (x) Y^{x}) \\ \overset{(14)}{=} & F (⨁_{x \in X} p (x) C^{x}, ⨁_{x \in X} p (x) Y^{x}) \\ \overset{(3)}{=} & F (⨁_{x \in X} p (x) (C^{x}, Y^{x})) \\ \overset{(7)}{=} & F (X, X) + \sum_{x \in X} p (x) F (C^{x}, Y^{x}) \\ = & F (X, X), \end{matrix}

where the last equality follows from the fact that

F (C, X) = 0

for every constant random variable C (since F is symmetric and

F (X, C) = 0

for every constant random variable C).

As for

F (P (X, Y), Y)

, first note that

F (Y, P (Y, X)) = F (Y, Y)

by what what we have just proved. We then have

F (P (X, Y), Y) = F (Y, P (X, Y)) = F (Y, P (Y, X)) = F (Y, Y),

where the first and second equalities follow from symmetry and invariance under pullbacks. □

Proof of Theorem 1

Suppose F is a map from pairs of random variables to the non-negative real numbers satisfying conditions 1–6 of Theorem 1. According to Lemma 5, there exists a constant

c \geq 0

such that

F (X, X) = c H (X)

for all random variables X. Now let

(X, Y)

be an arbitrary pair of random variables. According to Proposition 10, the triple

(X, P (X, Y), Y)

is a Markov triangle, thus

\begin{matrix} F (X, Y) & \overset{(9)}{=} & F (X, P (X, Y)) + F (P (X, Y), Y) - F (P (X, Y), P (X, Y)) \\ = & F (X, X) + F (Y, Y) - c H (P (X, Y)) \\ = & c H (X) + c H (Y) - c H (X, Y) \\ = & c I (X, Y), \end{matrix}

where the second equality follows from Lemma 7 and Lemma 5, and the third equality follows from Lemma 5 and item 2 of Proposition 2, thus F is a non-negative multiple of mutual information.

Conversely, mutual information satisfies condition 1 by Proposition 7, condition 2 by Proposition 5, condition 3 by item 1 of Proposition 1, condition 4 by Proposition 9, condition 5 by the fact that mutual information only depends on probabilities, and condition 6 by item 1 of Proposition 1. □

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Arthur J. Parzygnat for many useful discussions.

Conflicts of Interest

The author declares no conflict of interest.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst。 Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef] [Green Version]
Aczél, J.; Daróczy, Z. On measures of information and their characterizations. In Mathematics in Science and Engineering; Academic Press: New York, NY, USA; London, UK, 1975; Volume 115, p. xii + 234. [Google Scholar]
Faddeev, D.K. On the concept of entropy of a finite probabilistic scheme. Uspehi Mat. Nauk (N.S.) 1956, 11, 227–231. [Google Scholar]
Furuichi, S. On uniqueness theorems for Tsallis entropy and Tsallis relative entropy. IEEE Trans. Inform. Theory 2005, 51, 3638–3645. [Google Scholar] [CrossRef] [Green Version]
Rényi, A. On measures of entropy and information. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Petz, D. Characterization of the relative entropy of states of matrix algebras. Acta Math. Hung. 1992, 59, 449–455. [Google Scholar] [CrossRef]
Ohya, M.; Petz, D. Quantum Entropy and Its Use. In Texts and Monographs in Physics; Springer: Berlin/Heidelberg, Germany, 1993; p. viii + 335. [Google Scholar] [CrossRef]
Leinster, T. A short characterization of relative entropy. J. Math. Phys. 2019, 60, 023302. [Google Scholar] [CrossRef] [Green Version]
Ebanks, B.R.; Kannappan, P.; Sahoo, P.K.; Sander, W. Characterizations of sum form information measures on open domains. Aequationes Math. 1997, 54, 1–30. [Google Scholar] [CrossRef]
Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
Baez, J.C.; Fritz, T.; Leinster, T. A characterization of entropy in terms of information loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
Baez, J.C.; Fritz, T. A Bayesian characterization of relative entropy. Theory Appl. Categ. 2014, 29, 422–457. [Google Scholar]
Fullwood, J.; Parzygnat, A.J. The Information Loss of a Stochastic Map. Entropy 2021, 23, 1021. [Google Scholar] [CrossRef] [PubMed]
Parzygnat, A.J. A functorial characterization of von Neumann entropy. Cah. Topol. Géom. Différ. Catég. 2022, 63, 89–128. [Google Scholar]
Leinster, T. Entropy and Diversity: The Axiomatic Approach; Cambridge University Press: Cambridge, UK, 2021; p. 442. [Google Scholar] [CrossRef]
Romashchenko, A.; Zimand, M. An operational characterization of mutual information in algorithmic information theory. J. ACM 2019, 66, 42. [Google Scholar] [CrossRef] [Green Version]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Series in Telecommunications and Signal Processing; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fullwood, J. An Axiomatic Characterization of Mutual Information. Entropy 2023, 25, 663. https://doi.org/10.3390/e25040663

AMA Style

Fullwood J. An Axiomatic Characterization of Mutual Information. Entropy. 2023; 25(4):663. https://doi.org/10.3390/e25040663

Chicago/Turabian Style

Fullwood, James. 2023. "An Axiomatic Characterization of Mutual Information" Entropy 25, no. 4: 663. https://doi.org/10.3390/e25040663

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Axiomatic Characterization of Mutual Information

Abstract

1. Introduction

2. Mutual Information

3. Convexity

4. Continuity

5. Markov Triangles

6. Characterization Theorem

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI