Next Article in Journal
Numerical Analysis on Natural Convection Heat Transfer in a Single Circular Fin-Tube Heat Exchanger (Part 2): Correlations for Limiting Cases
Previous Article in Journal
Impact of General Anesthesia Guided by State Entropy (SE) and Response Entropy (RE) on Perioperative Stability in Elective Laparoscopic Cholecystectomy Patients—A Prospective Observational Randomized Monocentric Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency

by
Nicholas Carrara
1,* and
Kevin Vanslette
2,*
1
Department of Physics, University at Albany, Albany, NY 12222, USA
2
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
*
Authors to whom correspondence should be addressed.
Entropy 2020, 22(3), 357; https://doi.org/10.3390/e22030357
Submission received: 2 February 2020 / Revised: 17 March 2020 / Accepted: 17 March 2020 / Published: 19 March 2020

Abstract

:
Using first principles from inference, we design a set of functionals for the purposes of ranking joint probability distributions with respect to their correlations. Starting with a general functional, we impose its desired behavior through the Principle of Constant Correlations (PCC), which constrains the correlation functional to behave in a consistent way under statistically independent inferential transformations. The PCC guides us in choosing the appropriate design criteria for constructing the desired functionals. Since the derivations depend on a choice of partitioning the variable space into n disjoint subspaces, the general functional we design is the n-partite information (NPI), of which the total correlation and mutual information are special cases. Thus, these functionals are found to be uniquely capable of determining whether a certain class of inferential transformations, ρ ρ , preserve, destroy or create correlations. This provides conceptual clarity by ruling out other possible global correlation quantifiers. Finally, the derivation and results allow us to quantify non-binary notions of statistical sufficiency. Our results express what percentage of the correlations are preserved under a given inferential transformation or variable mapping.

1. Introduction

The goal of this paper is to quantify the notion of global correlations as it pertains to inductive inference. This is achieved by designing a set of functionals from first principles to rank entire probability distributions ρ according to their correlations. Because correlations are relationships defined between different subspaces of propositions (variables), the ranking of any distribution ρ , and hence the type of correlation functional one arrives at, depends on the particular choice of “split” or partitioning of the variable space. Each choice of “split” produces a unique functional for quantifying global correlations, which we call the n-partite information (NPI).
The term correlation may be defined colloquially as being a relation between two or more “things”. While we have a sense of what correlations are, how do we quantify this notion more precisely? If correlations have to do with “things” in the real world, are correlations themselves “real?” Can correlations be “physical?” One is forced to address similar questions in the context of designing the relative entropy as a tool for updating probability distributions in the presence of new information (e.g., “What is information?”) [1]. In the context of inference, correlations are broadly defined as being statistical relationships between propositions. In this paper we adopt the view that whatever correlations may be, their effect is to influence our beliefs about the natural world. Thus, they are interpreted as the information which constitutes statistical dependency. With this identification, the natural setting for the discussion becomes inductive inference.
When one has incomplete information, the tools one must use for reasoning objectively are probabilities [1,2]. The relationships between different propositions x and y are quantified by a joint probability density, p ( x , y ) = p ( x | y ) p ( y ) = p ( x ) p ( y | x ) , where the conditional distribution p ( y | x ) quantifies what one should believe about y given information about x, and vice-versa for p ( x | y ) . Intuitively, correlations should have something to do with these conditional dependencies.
In this paper, we seek to quantify a global amount of correlation for an entire probability distribution. That is, we desire a scalar functional I [ ρ ] for the purpose of ranking distributions ρ according to their correlations. Such functionals are not unique since many examples, e.g., covariance, correlation coefficient [3], distance correlation [4], mutual information [5], total correlation [6], maximal-information coefficient [7], etc., measure correlations in different ways. What we desire is a principled approach to designing a family of measures I [ ρ ] according to specific design criteria [8,9,10].
The idea of designing a functional for ranking probability distributions was first discussed in Skilling [9]. In his paper, Skilling designs the relative entropy as a tool for ranking posterior distributions, ρ , with respect to a prior, φ , in the presence of new information that comes in the form of constraints (15) (see Section 2.1.3 for details). The ability of the relative entropy to provide a ranking of posterior distributions allows one to choose the posterior that is closest to the prior while still incorporating the new information that is provided by the constraints. Thus, one can choose to update the prior in the most minimalist way possible. This feature is part of the overall objectivity that is incorporated into the design of relative entropy and in later versions is stated as the guiding principle [11,12,13].
Like relative entropy, we desire a method for ranking joint distributions with respect to their correlations. Whatever the value of our desired quantifier I [ ρ ] gives for a particular distribution ρ , we expect that if we change ρ through some generic transformation ( ) , ρ ρ = ρ + δ ρ , that our quantifier also changes I [ ρ ] I [ ρ ] = I [ ρ ] + δ I , and that this change of I [ ρ ] reflects the change in the correlations, i.e., if ρ changes in a way that increases the correlations, then I [ ρ ] should also increase. Thus, our quantifier should be an increasing functional of the correlations, i.e., it should provide a ranking of ρ ’s.
The type of correlation functional I [ ρ ] one arrives at depends on a choice of the splits within the proposition space X , and thus the functional we seek is I [ ρ ] I [ ρ , X ] . For example, if one has a proposition space X = X 1 × × X N , consisting of N variables, then one must specify which correlations the functional I [ ρ , X ] should quantify. Do we wish to quantify how the variable X 1 is correlated with the other N 1 variables? Or do we want to study the correlations between all of the variables? In our design derivation, each of these questions represent the extremal cases of the family of quantifiers I [ ρ , X ] , the former being a bi-partite correlation (or mutual information) functional and the latter being a total correlation functional.
In the main design derivation we will focus on the the case of total correlation which is designed to quantify the correlations between every variable subspace X i in a set of variables X = X 1 × × X N . We suggest a set of design criteria (DC) for the purpose of designing such a tool. These DC are guided by the Principle of Constant Correlations (PCC), which states that “the amount of correlations in ρ should not change unless required by the transformation, ( ρ , X ) ( ρ , X ) .” This implies our design derivation requires us to study equivalence classes of [ ρ ] within statistical manifolds Δ under the various transformations of distributions ρ that are typically performed in inference tasks. We will find, according to our design criteria, that the global quantifier of correlations we desire in this special case is equivalent to the total correlation [6].
Once one arrives at the TC as the solution to the design problem in this article, one can then derive special cases such as the mutual information [5] or, as we will call them, any n-partite information (NPI), which measures the correlations shared between generic n-partitions of the proposition space. The NPI and the mutual information (or bi-partite information) can be derived using the same principles as the TC except with one modification, as we will discuss in Section 5.
The special case of NPI when n = 2 is the bipartite (or mutual) information, which quantifies the amount of correlations present between two subsets of some proposition space X . Mutual information (MI) as a measure of correlation has a long history, beginning with Shannon’s seminal work on communication theory [14] in which he first defines it. While Shannon provided arguments for the functional form of his entropy [14], he did not provide a derivation of (MI). Despite this, there has still been no principled approach to the design of MI or for the total correlation TC. Recently however, there has been an interest in characterizing entropy through a category theoretic approach (see the works of Baez et al. [15]). The approach by Baez et al. shows that a particular class of functors from the category FinStat, which is a finite set equipped with a probability distribution, are scalar multiples of the entropy [15]. The papers by Baudot et al. [16,17,18] also take a category theoretical approach however their results are more focused on the topological properties of information theoretic quantities. Both Baez et al. and Baudot et al. discuss various information theoretic measures such as the relative entropy, mutual information, total correlation, and others.
The idea of designing a tool for the purpose of inference and information theory is not new. Beginning in [2], Cox showed that probabilities are the functions that are designed to quantify “reasonable expectation” [19], of which Jaynes [20] and Caticha [10] have since improved upon as “degrees of rational belief”. Inspired by the method of maximum entropy [20,21,22], there have been many improvements on the derivation of entropy as a tool designed for the purpose of updating probability distributions in the decades since Shannon [14]. Most notably they are by Shore and Johnson [8], Skilling [9], Caticha [11], and Vanslette [12,13]. The entropy functionals in [11,12,13] are designed to follow the Principle of Minimal Updating (PMU), which states, for the purpose of enforcing objectivity, that “a probability distribution should only be updated to the extent required by the new information.” In these articles, information is defined operationally ( ) as that which induces the updating of the probability distributions, φ ρ .
An important consequence of deriving the various NPI as tools for ranking is their immediate application to the notion of statistical sufficiency. Sufficiency is a concept that dates back to Fisher, and some would argue Laplace [23], both of whom were interested in finding statistics that contained all relevant information about a sample. Such statistics are called sufficient, however this notion is only a binary label, so it does not quantify an amount of sufficiency. Using the result of our design derivation, we can propose a new definition of sufficiency in terms of a normalized NPI. Such a quantity gives a sense of how close a set of functions are to being sufficient statistics. This topic will be discussed in Section 6.
In Section 2 we will lay out some mathematical preliminaries and discuss the general transformations in statistical manifolds we are interested in. Then in Section 3, we will state and discuss the design criteria used to derive the functional form of TC and the NPI in general. In Section 4 we will complete the proof of the results from Section 3. In Section 5 we discuss the n-partite (NPI) special cases of TC of which the bipartite case is the mutual information, which is discussed in Section 5.2. In Section 6 we will discuss sufficiency and its relation to the Neyman-Pearson lemma [24]. It should be noted that throughout this article we will be using a probabilistic framework in which x X denotes propositions of a probability distribution rather than a statistical framework in which x denotes random numbers.

2. Mathematical Preliminaries

The arena of any inference task consists of two ingredients, the first of which is the subject matter, or what is often called the universe of discourse. This refers to the actual propositions that one is interested in making inferences about. Propositions tend to come in two classes, either discrete or continuous. Discrete proposition spaces will be denoted by calligraphic uppercase Latin letters, X , and the individual propositions will be lowercase Latin letters x i X indexed by some variable i = { 1 , , | X | } , where | X | is the number of distinct propositions in X . In this paper we will mostly work in the context of continuous propositions whose spaces will be denoted by bold faced uppercase Latin letters, X , and whose elements will simply be lowercase Latin letters with no indices, x X . Continuous proposition spaces have a much richer structure than discrete spaces (due to the existence of various differentiable structures, the ability to integrate, etc.) and help to generalize concepts such as relative entropy and information geometry [10,25,26] (Common examples of discrete proposition spaces are the results of a coin flip or a toss of a die, while an example of a continuous proposition space is the position of a particle [27].).
The second ingredient that one needs to define for general inference tasks is the space of models, or the space of probability distributions which one wishes to assign to the underlying proposition space. These spaces can often be given the structure of a manifold, which in the literature is called a statistical manifold [10]. A statistical manifold Δ , is a manifold in which each point ρ Δ is an entire probability distribution, i.e., Δ is a space of maps from subsets of X to the interval [ 0 , 1 ] , ρ : P ( X ) [ 0 , 1 ] . The notation P ( X ) denotes the power set of X , which is the set of all subsets of X , and has cardinality equal to | P ( X ) | = 2 | X | .
In the simplest cases, when the underlying propositions are discrete, the manifold is finite dimensional. A common example that is used in the literature is the three-sided die, whose distribution is determined by three probability values ρ = { p 1 , p 2 , p 3 } . Due to positivity, p i 0 , and the normalization constraint, i p i = 1 , the point ρ lives in the 2-simplex. Likewise, a generic discrete statistical manifold with n possible states is an ( n 1 ) -simplex. In the continuum limit, which is often the case explored in physics, the statistical manifold becomes infinite dimensional and is defined as (Throughout the rest of the paper, we use the Greek ρ to represent a generic distribution in Δ , and we use the Latin p ( x ) to refer to an individual density.),
Δ = p ( x ) | p ( x ) 0 , d x p ( x ) = 1 .
When the statistical manifold is parameterized by the densities p ( x ) , the zeroes always lie on the boundary of the simplex. In this representation the statistical manifolds have a trivial topology; they are all simply connected. Without loss of generality, we assume that the statistical manifolds we are interested in can be represented as (1), so that Δ is simply connected and does not contain any holes. The space Δ in this representation is also smooth.
The symbol ρ defines what we call a state of knowledge about the underlying propositions X . It is, in essence, the quantification of our degrees of belief about each of the possible propositions x X [2]. The correlations present in any distribution ρ necessarily depend on the conditional relationships between various propositions. For instance, consider the binary case of just two proposition spaces X and Y , so that the joint distribution factors,
p ( x , y ) = p ( x ) p ( y | x ) = p ( y ) p ( x | y ) .
The correlations present in p ( x , y ) will necessarily depend on the form of p ( x | y ) and p ( y | x ) since the conditional relationships tell us how one variable is statistically dependent on the other. As we will see, the correlations defined in Equation (2) are quantified by the mutual information. For situations of many variables however, the global correlations are defined by the total correlation, which we will design first. All other measures which break up the joint space into conditional distributions (including (2)) are special cases of the total correlation.

2.1. Some Classes of Inferential Transformations

There are four main types of transformations we will consider that one can enact on a state of knowledge ρ ρ . They are: coordinate transformations, entropic updating (This of course includes Bayes rule as a special case [28,29]), marginalization, and products. This set of transformations is not necessarily exhaustive, but is sufficient for our discussion in this paper. We will indicate whether or not each of these types of transformations can presumably cause changes to the amount of global correlations, or not, by evaluating the response of the statistical manifold under these transformations. Our inability to describe how much the amount of correlation changes under these transformations motivates the design of such an objective global quantifier.
The types of transformations we will explore can be identified either with maps from a particular statistical manifold to itself, Δ Δ (type I), to a subset of the original manifold Δ Δ Δ (type II), or from one statistical manifold to another, Δ Δ (type III and IV).

2.1.1. Type I: Coordinate Transformations

Type I transformations are coordinate transformations. A coordinate transformation f : X X , is a special type of transformation of the proposition space X that respects certain properties. It is essentially a continuous version of a reparameterization (A reparameterization is an isomorphism between discrete proposition spaces, g : X Y which identifies for each proposition x i X , a unique proposition y i Y so that the map g is a bijection.). For one, each proposition x X must be identified with one and only one proposition x X and vice versa. This means that coordinate transformations must be bijections on proposition space. The reason for this is simply by design, i.e., we would like to study the transformations that leave the proposition space invariant. A general transformation of type I on Δ which takes X to X = f ( X ) , is met with the following transformation of the densities,
p ( x ) I p ( x ) where p ( x ) d x = p ( x ) d x .
Like we already mentioned, the coordinate transforming function f : X X must be a bijection in order for (3) to hold, i.e., the map f 1 : X X is such that f f 1 = id X and f 1 f = id X . While the densities p ( x ) and p ( x ) are not necessarily equal, the probabilities defined in (3) must be (according to the rules of probability theory, see the Appendix A). This indicates that ρ I ρ = ρ is in the same location in the statistical manifold. That is, the global state of knowledge has not changed—what has changed is the way in which the local information in ρ has been expressed, which must be invertible in general.
While one could impose that the transformations f be diffeomorphisms (i.e., smooth maps between X and X ), it is not necessary that we restrict f in this way. Without loss of generality, we only assume that the bijections f C 0 ( X ) are continuous. For discussions involving diffeomorphism invariance and statistical manifolds see the works of Amari [25], Ay et al. [30] and Bauer et al. [31].
For a coordinate transformation (3) involving two variables, x X and y Y , we also have that type one transformations give,
p ( x , y ) I p ( x , y ) where p ( x , y ) d x d y = p ( x , y ) d x d y .
A few general properties of these type I transformations are as follows: First, the density p ( x , y ) is expressed in terms of the density p ( x , y ) ,
p ( x , y ) = p ( x , y ) γ ( x , y ) ,
where γ ( x , y ) is the determinant of the Jacobian [10] that defines the transformation,
γ ( x , y ) = | det J ( x , y ) | , where J ( x , y ) = x x x y y x y y .
For a finite number of variables x = ( x 1 , , x N ) , the general type I transformations p ( x 1 , , x N ) I p ( x 1 , , x N ) are written,
p ( x 1 , , x N ) i = 1 N d x i = p ( x 1 , , x N ) i = 1 N d x i ,
and the Jacobian becomes,
J ( x 1 , , x N ) = x 1 x 1 x 1 x N x N x 1 x N x N .
One can also express the density p ( x ) in terms of the original density p ( x ) by using the inverse transform,
p ( x ) = p ( f 1 ( x ) ) γ ( x ) = p ( x ) γ ( x ) .
In general, since coordinate transformations preserve the probabilities associated to a joint proposition space, they also preserve several structures derived from them. One of these is the Fisher-Rao (information) metric [25,31,32], which was proved by Čencov [26] to be the unique metric on statistical manifolds that represents the fact that the points ρ Δ are probability distributions and not structureless [10] (For a summary of various derivations of the information metric, see [10] Section 7.4).

2.1.2. Split Invariant Coordinate Transformations

Consider a class of coordinate transformations that result in a diagonal Jacobian matrix, i.e.,
γ ( x 1 , , x N ) = i = 1 N x i x i .
These transformations act within each of the variable spaces independently, and hence they are guaranteed to preserve the definition of the split between any n-partitions of the propositions, and because they are coordinate transformations, they are invertible and do not change our state of knowledge, ( ρ , X ) Ia ( ρ , X ) = ( ρ , X ) . We call such special types of transformations (10) split invariant coordinate transformations and denote them as type Ia. From (10), it is obvious that the marginal distributions of ρ are preserved under split invariant coordinate transformations,
p ( x i ) d x i = p ( x i ) d x i .
If one allows generic coordinate transformations of the joint space, then the marginal distributions may depend on variables outside of their original split. Thus, if one redefines the split after a coordinate transformation to new variables X X , the original problem statement changes as to what variables we are considering correlations between and thus Equation (11) no longer holds. This is apparent in the case of two variables ( x , y ) , where x = f x ( x , y ) , since,
d x = d f x = f x x d x + f x y d y ,
which depends on y. In the situation where x and y are independent, redefining the split after the coordinate transformation (12) breaks the original independence since the distribution that originally factors, p ( x , y ) = p ( x ) p ( y ) , would be made to have conditional dependence in the new coordinates, i.e., if x = f x ( x , y ) and y = f y ( x , y ) , then,
p ( x , y ) = p ( x ) p ( y ) I p ( x , y ) = p ( x ) p ( y | x ) .
So, even though the above transformation satisfies (3), this type of transformation may change the correlations in ρ by allowing for the potential redefinition of the split X X . Hence, when designing our functional, we identify split invariant coordinate transformations as those which preserve correlations. These restricted coordinate transformations help isolate a single functional form for our global correlation quantifier.

2.1.3. Type II: Entropic Updating

Type II transformations are those induced by updating [10], φ ρ in which one maximizes the relative entropy,
S [ ρ , φ ] = d x p ( x ) log p ( x ) q ( x ) ,
subject to constraints and relative to the prior, q ( x ) . Constraints often come in the form of expectation values [10,20,21,22],
f ( x ) = d x p ( x ) f ( x ) = κ .
A special case of these transformations is Bayes’ rule [28,29],
p ( x ) II p ( x ) where p ( x ) = p ( x | θ ) = p ( x ) p ( θ | x ) p ( θ ) .
In (14) and throughout the rest of the paper we will use log base e (natural log) for all logarithms, although the results are perfectly well defined for any base (the quantities S [ ρ , φ ] and I [ ρ , X ] will simply differ by an overall scale factor when using different bases). Maximizing (14) with respect to constraints such as (15) induces a jump in the statistical manifold. Type II transformations, while well defined, are not necessarily continuous, since in general one can map nearby points to disjoint subsets in Δ . Type II transformations will also cause ρ II ρ ρ in general as it jumps within the statistical manifold. This means, because different ρ ’s may have different correlations, that type II transformations can either increase, decrease, or leave the correlations invariant.

2.1.4. Type III: Marginalization

Type III transformations are induced by marginalization,
p ( x , y ) III p ( x ) = d y p ( x , y ) ,
which is effectively a quotienting of the statistical manifold, Δ ( x ) = Δ ( x , y ) / y , i.e., for any point p ( x ) , we equivocate all values of p ( y | x ) . Since the distribution ρ changes under type III transformations, ρ III ρ , the amount of correlations can change.

2.1.5. Type IV: Products

Type IV transformations are created by products,
p ( x ) IV p ( x , y ) = p ( x ) p ( y | x ) ,
which are a kind of inverse transformation of type III, i.e., the set of propositions X becomes the product X × Y . There are many different situations that can arise from this type, a most trivial one being an embedding,
p ( x ) IVa p ( x , y ) = p ( x ) δ ( y f ( x ) ) ,
which can be useful in many applications. The function δ ( · ) in the above equation is the Dirac delta function [33] which has the following properties,
δ ( x ) = x = 0 0 otherwise , and d x δ ( x ) = 1 .
We will denote such a transformation as type IVa. Another trivial example of type IV is,
p ( x ) IVb p ( x , y ) = p ( x ) p ( y ) ,
which we will call type IVb. Like type II, generic transformations of type IV can potentially create correlations, since again we are changing the underlying distribution.

2.2. Remarks on Inferential Transformations

There are many practical applications in inference which make use of the above transformations by combining them in a particular order. For example, in machine learning and dimensionality reduction, the task is often to find a low-dimensional representation of some proposition space X , which is done by combining types I,III and IVa in the order, ρ IVa ρ I ρ III ρ . Neural networks are a prime example of this sequence of transformations [34]. Another example of IV,I,III transformations are convolutions of probability distributions, which takes two proposition spaces and combines them into a new one [5].
In Appendix C we discuss how our resulting design functionals behave under the aforementioned transformations.

3. Designing a Global Correlation Quantifier

In this section we seek to achieve our design goal for the special case of the total correlation,
Design Goal:Given a space of N variables X = X 1 × × X N and a statistical manifold Δ, we seek to design a functional I [ ρ , X ] which ranks distributions ρ Δ according to their total amount of correlations.
Unlike deriving a functional, designing a functional is done through the process of eliminative induction. Derivations are simply a means of showing consistency with a proposed solution whereas design is much deeper. In designing a functional, the solution is not assumed but rather achieved by specifying design criteria that restrict the functional form in a way that leads to a unique or optimal solution. One can then interpret the solution in terms of the original design goal. Thus, by looking at the “nail”, we design a “hammer”, and conclude that hammers are designed to knock in and remove nails. We will show that there are several paths to the solution of our design criteria, the proof of which is in Section 4.
Our design goal requires that I [ ρ , X ] be scalar valued such that we can rank the distributions ρ according to their correlations. Considering a continuous space X = X 1 × × X N of N variables, the functional form of I [ ρ , X ] is the functional,
I [ ρ , X ] = I [ p ( x 1 , , x N ) ; p ( x 1 , , x N ) ; ; X ] ,
which depends on each of the possible probability values for every x X (In Watanabe’s paper [6], the notation for the total correlation between a set of variables λ is written as C tot ( λ ) = i S ( λ i ) S ( λ ) , where S ( λ i ) is the Shannon entropy of the subspace λ i λ . For a proof of Watanabe’s theorem see Appendix B).
Given the types of transformations that may be enacted on ρ , we state the main guiding principle we will use to meet our design goal,
Principle of Constant Correlations (PCC):The amount of correlations in ( ρ , X ) should not change unless required by the transformation, ( ρ , X ) ( ρ , X ) .
While simple, the PCC is incredibly constraining. By stating when one should not change the correlations, i.e., I [ ρ , X ] I [ ρ , X ] = I [ ρ , X ] , it is operationally unique (i.e., that you do not do it) rather than stating how one is required to change them, I [ ρ , X ] I [ ρ , X ] I [ ρ , X ] , of which there are infinitely many choices. The PCC therefore imposes an element of objectivity into I [ ρ , X ] . If we are able to complete our design goal, then we will be able to uniquely quantify how transformations of type I-IV affect the amount of correlations in ρ .
The discussion of type I transformations indicate that split invariant coordinate transformations do not change ( ρ , X ) . This is because we want to not only maintain the relationship among the joint distribution (3), but also the relationships among the marginal spaces,
p ( x i ) d x i = p ( x i ) d x i .
Only then are the relationships between the n-partitions guaranteed to remain fixed and hence the distribution ρ remains in the same location in the statistical manifold. When a coordinate transformation of this type is made, because it does not change ( ρ , X ) , we are not explicitly required to change I [ ρ , X ] , so by the PCC we impose that it does not.
The PCC together with the design goal implies that,
Corollary 1
(Split Coordinate Invariance). The coordinate systems within a particular split are no more informative about the amount of correlations than any other coordinate system for a given ρ.
This expression is somewhat analogous to the statement that “coordinates carry no information”, which is usually stated as a design criterion for relative entropy [8,9,11] (This appears as axiom two in Shore and Johnson’s derivation of relative entropy [8], which is stated on page 27 as “II. Invariance: The choice of coordinate system should not matter.” In Skilling’s approach [9], which was mainly concerned with image analysis, axiom two on page 177 is justified with the statement “We expect the same answer when we solve the same problem in two different coordinate systems, in that the reconstructed images in the two systems should be related by the coordinate transformation.” Finally, in Caticha’s approach [11], the axiom of coordinate invariance is simply stated on page 4 as “Criterion 2: Coordinate invariance. The system of coordinates carries no information.”).
To specify the functional form of I [ ρ , X ] further, we will appeal to special cases in which it is apparent that the PCC should be imposed [9]. The first involves local, subdomain, transformations of ρ . If a subdomain of X is transformed then one may be required to change its amount of correlations by some specified amount. Through the PCC however, there is no explicit requirement to change the amount of correlations outside of this domain, hence we impose that those correlations outside are not changed. The second special case involves transformations of an independent subsystem. If a transformation is made on an independent subsystem then again by the PCC, because there is no explicit reason to change the amount of correlations in the other subsystem, we impose that they are not changed. We denote these two types of transformation independences as our two design criteria (DC).
Surprisingly, the PCC and the DC are enough to find a general form for I [ ρ , X ] (up to an irrelevant scale constant). As we previously stated, the first design criteria concerns local changes in the probability distribution ρ .
Design Criterion 1
(Locality). Local transformations of ρ contribute locally to the total amount of correlations.
The term locality has been invoked to mean many different things in different fields (e.g., physics, statistics, etc.). In this paper, as well as in [8,9,11,12,13], the term local refers to transformations which are constrained to act only within a particular subdomain D X , i.e., the transformations of the probabilities are local to D and do not affect probabilities outside of this domain. Essentially, if new information does not require us to change the correlations in a particular subdomain D X , then we do not change the probabilities over that subdomain. While simple, this criterion is incredibly constraining and leads (22) to the functional form,
I [ ρ , X ] D C 1 d x F ( p ( x 1 , , x N ) , x 1 , , x N ) ,
where F is some undetermined function of the probabilities and possibly the coordinates. We have used d x = d x 1 d x N to denote the measure for brevity. To constrain F further, we first use the corollary of split coordinate invariance (1) among the subspaces X i X and then apply special cases of particular coordinate transformations. This leads to the following functional form,
I [ ρ , X ] P C C d x p ( x 1 , , x N ) Φ p ( x 1 , , x N ) i = 1 N p ( x i ) ,
which demonstrates that the integrand is independent of the actual coordinates themselves. Like coordinate invariance, the axiom DC1 also appears in the design derivations of relative entropy [8,9,11,12,13] (In Shore and Johnson’s approach to relative entropy [8], axiom four is analogous to our locality criteria, which states on page 27 “IV. Subset Independence: It should not matter whether one treats an independent subset of system states in terms of a separate conditional density or in terms of the full system density.” In Skilling’s approach [9] locality appears as axiom one which, like Shore and Johnson’s axioms, is called Subset Independence and is justified with the following statement on page 175, “Information about one domain should not affect the reconstruction in a different domain, provided there is no constraint directly linking the domains.” In Caticha [11] the axiom is also called Locality and is written on page four as “Criterion 1: Locality. Local information has local effects.” Finally, in Vanslette’s work [12,13], the subset independence criteria is stated on page three as follows, “Subdomain Independence: When information is received about one set of propositions, it should not effect or change the state of knowledge (probability distribution) of the other propositions (else information was also received about them too).”).
This leaves the function Φ to be determined, which can be done by imposing an additional design criteria.
Design Criterion 2
(Subsystem Independence). Transformations of ρ in one independent subsystem can only change the amount of correlations in that subsystem.
The consequence of DC2 concerns independence among subspaces of X . Given two subsystems ( X 1 × X 2 ) × ( X 3 × X 4 ) X 12 × X 34 = X which are independent, the joint distribution factors,
p ( x ) = p ( x 1 , x 2 ) p ( x 3 , x 4 ) ρ = ρ 12 ρ 34 .
We will see that this leads to the global correlations being additive over each subsystem,
I [ ρ , X ] = I [ ρ 12 , X 12 ] + I [ ρ 34 , X 34 ] .
Like locality (DC1), the design criteria concerning subsystem independence appears in all four approaches to relative entropy [8,9,11,12,13] (In Shore and Johnson’s approach [8], axiom three concerns subsystem independence and is stated on page 27 as “III. System Independence: It should not matter whether one accounts for independent information about independent systems separately in terms of different densities or together in terms of a joint density.” In Skillings approach [9], the axiom concerning subsystem independence is given by axiom three on page 179 and provides the following comment on page 180 about its consequences “This is the crucial axiom, which reduces S to the entropic form. The basic point is that when we seek an uncorrelated image from marginal data in two (or more) dimensions, we need to multiply the marginal distributions. On the other hand, the variational equation tells us to add constraints through their Lagrange multipliers. Hence the gradient δ S / δ f must be the logarithm.” In Caticha’s design derivation [11], axiom three concerns subsystem independence and is written on page 5 as “Criterion 3: Independence. When systems are known to be independent it should not matter whether they are treated separately or jointly.” Finally, in Vanslette [12,13] on page 3 we have “Subsystem Independence: When two systems are a priori believed to be independent and we only receive information about one, then the state of knowledge of the other system remains unchanged.”); however, due to the difference in the design goal here, we end up imposing DC2 closer to that of the work of [12,13] as we do not explicitly have the Lagrange multiplier structure in our design space.
Imposing DC2 leads to the final functional form of I [ ρ , X ] ,
I [ ρ , X ] D C 2 d x p ( x 1 , , x N ) log p ( x 1 , , x N ) i = 1 N p ( x i ) ,
with p ( x i ) being the split dependent marginals. This functional is what is typically referred to as the total correlation (The concept of total correlation TC was first introduced in Watanabe [6] as a generalization to Shannon’s definition of mutual information. There are many practical applications of TC in the literature [35,36,37,38].) and is the unique result obtained from imposing the PCC and the corresponding design criteria.
As was mentioned throughout, these results are usually implemented as design criteria for relative entropy as well. Shore and Johnson’s approach [8] presents four axioms, of which III and IV are subsystem and subset independence. Subset independence in their framework corresponds to Equation (24) and to the Locality axiom of Caticha [11]. It also appears as an axiom in the approaches by Skilling [9] and Vanslette [12,13]. Subsystem independence is given by axiom three in Caticha’s work [11], axiom two in Vanslette’s [12,13] and axiom three in Skilling’s [9]. While coordinate invariance was invoked in the approaches by Skilling, Shore and Johnson and Caticha, it was later found to be unnecessary in the work by Vanslette [12,13] who only required two axioms. Likewise, we find that it is an obvious consequence of the PCC and does not need to be stated as a separate axiom in our derivation of the total correlation.
The work by Csiszár [39] provides a nice summary of the various axioms used by many authors (including Azcél [40], Shore and Johnson [8] and Jaynes [21]) in their definitions of information theoretic measures (A list is given on page 3 of [39] which includes the following for conditions on an entropy function H ( P ) ; (1) Positivity ( H ( P ) 0 ), (2) Expansibility (“expansion” of P by a new component equal to 0 does not change H ( P ) , i.e., embedding in a space in which the probabilities of the new propositions are zero), (3) Symmetry ( H ( P ) is invariant under permutation of the probabilities), (4) Continuity ( H ( P ) is a continuous function of P), (5) Additivity ( H ( P × Q ) = H ( P ) + H ( Q ) ), (6) Subadditivity ( H ( X , Y ) H ( X ) + H ( Y ) ), (7) Strong additivity ( H ( X , Y ) = H ( X ) + H ( Y | X ) ), (8) Recursivity ( H ( p 1 , , p n ) = H ( p 1 + p 2 , p 3 , , p n ) + ( p 1 + p 2 ) H ( p 1 p 1 + p 2 , p 2 p 1 + p 2 ) ) and (9) Sum property ( H ( P ) = i = 1 n g ( p i ) for some function g).). One could associate the design criteria in this work to some of the common axioms enumerated in [39], although some of them will appear as consequences of imposing a specific design criterion, rather than as an ansatz. For example, the strong additivity condition (see Appendix C.1.3 and Appendix C.1.4) is the result of imposing DC1 and DC2. Likewise, the condition of positivity (i.e., I [ ρ , X ] 0 ) and convexity occurs as a consequence of the design goal, split coordinate invariance (SCI) and both of the design criteria. Continuity of I [ ρ , X ] with respect to ρ is imposed through the design goal, and symmetry is a consequence of DC1. In summary, Design Goalcontinuity, DC1symmetry, (DC1 + DC2)→strong additivity, (Design Goal + SCI + DC1 + DC2)→positivity + convexity. As was shown by Shannon [14] and others [39,40], various combinations of these axioms, as well as the ones mentioned in footnote 11, are enough to characterize entropic measures.
One could argue that we could have merely imposed these axioms at the beginning to achieve the functional I [ ρ , X ] , rather than through the PCC and the corresponding design criteria. The point of this article however, is to design the correlation functionals by using principles of inference, rather than imposing conditions on the functional directly (This point was also discussed in the conclusion section of Shore and Johnson [8] see page 33.). In this way, the resulting functionals are consequences of employing the inference framework, rather than postulated arbitrarily.
One will recognize that the functional form of (28) and the corresponding n-partite informations (88) have the form of a relative entropy. Indeed, if one identifies the product marginal i = 1 N p ( x i ) as a prior distribution as in (14), then it may be possible to find constraints (15) which update the product marginal to the desired joint distribution p ( x ) . One can then interpret the constraints as the generators of the correlations. We leave the exploration of this topic to a future publication.

4. Proof of the Main Result

We will prove the results summarized in the previous section. Let a proposition of interest be represented by x i X —an N dimensional coordinate x i = ( x i 1 , , x i N ) that lives somewhere in the discrete and fixed proposition space X = { x 1 , , x i , , x | X | } , with | X | being the cardinality of X (i.e., the number of possible combinations). The joint probability distribution at this generic location is P ( x i ) P ( x i 1 , , x i N ) and the entire distribution ρ is the set of joint probabilities defined over the space X , i.e., ρ { P ( x 1 ) , , P ( x | X | ) } Δ .

4.1. Locality-DC1

We begin by imposing DC1 on I [ ρ , X ] . Consider changes in ρ induced by some transformation ( ) , where the change to the state of knowledge is,
ρ ρ = ρ + δ ρ ,
for some arbitrary change δ ρ in Δ that is required by some new information. This implies that the global correlation function must also change according to (22),
I [ ρ , X ] I [ ρ , X ] = I [ ρ , X ] + δ I .
where δ I is the change to I [ ρ , X ] induced by (29). To impose DC1, consider that the new information requires us to change the distribution in one subdomain D X , ρ ρ = ρ + δ ρ D , that may change the correlations, while leaving the probabilities in the complement domain fixed, δ ρ D ¯ = 0 . (The subdomain D and its compliment D ¯ obey the relations, D D ¯ = and D D ¯ = X .) Let the subset of the propositions in D be relabeled as { x 1 , , x d , , x | D | } { x 1 , , x i , , x | X | } . Then the variations in I [ ρ , X ] with respect to the changes of ρ in the subdomain D are,
I [ ρ , X ] I [ ρ + δ ρ D , X ] I [ ρ , X ] + d D I [ ρ , X ] P ( x d ) δ P ( x d ) ,
for small changes δ P ( x d ) . In general the derivatives in (31) are functions of the probabilities,
I [ ρ , X ] P ( x d ) = f d P ( x 1 ) , , P ( x | X | ) , x d ,
which could potentially depend on the entire distribution ρ as well as the point x d X . We impose DC1 by constraining (32) to only depend on the probabilities within the subdomain D since the variation (32) should not cause changes to the amount of correlations in the complement D ¯ , i.e.,
I [ ρ , X ] P ( x d ) D C 1 f d P ( x 1 ) , , P ( x d ) , , P ( x | D | ) , x d .
This condition must also hold for arbitrary choices of subdomains D , thus by further imposing DC1 in the most restrictive case of local changes ( D = x d ),
I [ ρ , X ] P ( x d ) D C 1 f d ( P ( x d ) , x d ) ,
guarantees that it will hold in the general case. In this most restrictive case of local changes, the functional I [ ρ , X ] has vanishing mixed derivatives,
2 I [ ρ , X ] P ( x i ) P ( x j ) = 0 , i j .
Integrating (34) leads to,
I [ ρ , X ] = i = 1 | X | F i ( P ( x i ) , x i ) + const . ,
where the { F i } are undetermined functions of the probabilities and the coordinates. As this functional is designed for ranking, nothing prevents us from setting the irrelevant constant to zero, which we do. Extending to the continuum, we find Equation (24),
I [ ρ , X ] = d x F p ( x ) , x ,
where for brevity we have also condensed the notation for the continuous N dimensional variables x = { x 1 , , x N } . It should be noted that F ( p ( x ) , x ) has the capacity to express a large variety of potential measures of correlation including Pearson’s [3] and Szekely’s [4] correlation coefficients. Our new objective is to use eliminative induction until only a unique functional form for F remains.

4.1.1. Split Coordinate Invariance–PCC

The PCC and the corollary (1) state that I [ ρ , X ] , and thus F p ( x ) , x , should be independent of transformations that keep ( ρ , X ) ( ρ , X ) = ( ρ , X ) fixed. As discussed, split invariant coordinate transformations (10) satisfy this property. We will further restrict the functional I [ ρ , X ] so that it obeys these types of transformations.
We can always rewrite the expression (37) by introducing densities m ( x ) and p ( x ) so that,
I [ ρ , X ] = d x p ( x ) 1 p ( x ) F p ( x ) m ( x ) m ( x ) , x .
Then, instead of dealing with the function F directly, we can instead deal with a new definition Φ ,
I [ ρ , X ] = d x p ( x ) Φ p ( x ) m ( x ) , p ( x ) , m ( x ) , x ,
where Φ is defined as,
Φ p ( x ) m ( x ) , p ( x ) , m ( x ) , x = def 1 p ( x ) F p ( x ) m ( x ) m ( x ) , x .
Now we further restrict the functional form of Φ by appealing to the PCC. Consider the functional I [ ρ , X ] under a split invariant coordinate transformation,
( x 1 , , x N ) ( x 1 , , x N ) m ( x ) d x = m ( x ) d x , and p ( x ) d x = p ( x ) d x ,
which amounts to sending Φ to,
Φ p ( x ) m ( x ) , p ( x ) , m ( x ) , x = Φ p ( x ) m ( x ) , γ ( x ) p ( x ) , γ ( x ) m ( x ) , x ,
where γ ( x ) = i N γ ( x i ) is the Jacobian for the transformation from ( x 1 , , x N ) to ( f 1 ( x 1 ) , , f N ( x N ) ) . Consider the special case in which the Jacobian γ ( x ) = 1 . Then due to the PCC we must have,
Φ p ( x ) m ( x ) , p ( x ) , m ( x ) , x = Φ p ( x ) m ( x ) , p ( x ) , m ( x ) , x .
However this would suggest that I [ ρ , X ] Ia I [ ρ , X ] I [ ρ , X ] since correlations could be changed under the influence of the new variables x X . Thus in order to maintain the global correlations the function Φ must be independent of the coordinates,
Φ P C C Φ p ( x ) m ( x ) , p ( x ) , m ( x ) .
To constrain the form of Φ further, we can again appeal to split coordinate invariance but now with arbitrary Jacobian γ ( x ) 1 , which causes Φ to transform as,
Φ p ( x ) m ( x ) , p ( x ) , m ( x ) = Φ p ( x ) m ( x ) , γ ( x ) p ( x ) , γ ( x ) m ( x ) .
But this must hold for arbitrary split invariant coordinate transformations, for when the Jacobian factor γ ( x ) 1 . Hence, the function Φ must also be independent of the second and third argument,
Φ P C C Φ p ( x ) m ( x ) .
We then have that the split coordinate invariance suggested by the PCC together with DC1 gives,
I [ ρ , X ] = d x p ( x ) Φ p ( x ) m ( x ) .
This is similar to the steps found in the relative entropy derivation [8,11], but differs from the steps in [12,13].

4.1.2. I m i n –Design Goal and PCC

Split coordinate invariance, as realized in Equation (47), provides an even stronger restriction on I [ ρ , X ] which we can find by appealing to a special case. Since all distributions with the same correlations should have the same value of I [ ρ , X ] by the Design Goal and PCC, then all independent joint distributions φ will also have the same value, which by design takes a unique minimum value,
p ( x ) = i = 1 N p ( x i ) I [ φ , X ] = I min .
Requiring independent joint distributions φ return a unique minimum I [ φ , X ] = I min is similar to imposing a positivity condition on I [ ρ , X ] [39]. We will find however, that positivity only arises once DC2 has been taken into account. Here, I min could be any value, so long as when one introduces correlations, φ ρ the value of I [ ρ , X ] always increases from I min . This condition could also be imposed as a general convexity property of I [ ρ , X ] , however this is already required by the design goal and does not require an additional axiom.
Inserting (48) into (47) we find,
I [ ρ , X ] = d x i = 1 N p ( x i ) Φ i = 1 N p ( x i ) m ( x ) = I min .
But this expression must be independent of the underlying distribution p ( x ) = i = 1 N p ( x i ) , since all independent distributions, regardless of the joint space X , must give the same value I min . Thus we conclude that the density m ( x ) must be the product marginal m ( x ) = i = 1 N p ( x i ) ,
p ( x i ) = d x ¯ i p ( x ) , where d x ¯ i = k i d x k ,
so it is guaranteed that,
I min = d x p ( x ) Φ ( 1 ) = Φ ( 1 ) = const .
Thus, by design, expression (47) becomes (25),
I [ ρ , X ] P C C d x p ( x ) Φ p ( x ) i = 1 N p ( x i ) .

4.2. Subsystem Independence–DC2

In the following subsections we will consider two approaches for imposing subsystem independence via the PCC and DC2. Both lead to identical functional expressions for I [ ρ , X ] . The analytic approach assumes the functional form of Φ may be expressed as a Taylor series. The algebraic approach reaches the same conclusion without this assumption.

4.2.1. Analytical Approach

Let us assume that the function Φ is analytic, so that it can be Taylor expanded. Since the argument, p ( x ) / i = 1 N p ( x i ) is defined over [ 0 , ) , we can consider the expansion over some open set of [ 0 , ) for any particular value p 0 ( x ) / i = 1 N p 0 ( x i ) as,
Φ p ( x ) i = 1 N p ( x i ) = n = 0 Φ ˜ n p ( x ) i = 1 N p ( x i ) p 0 ( x ) i = 1 N p 0 ( x i ) n ,
where Φ ˜ n are real coefficients. For p ( x ) / i = 1 N p ( x i ) in the neighborhood of p 0 ( x ) / i = 1 N p 0 ( x i ) , the series (53) converges to Φ p ( x ) / i = 1 N p ( x i ) . The Taylor expansion of Φ p ( x ) i = 1 N p ( x i ) about p ( x ) when its propositions are nearly independent, i.e., p ( x ) i = 1 N p ( x i ) , is
Φ = n = 0 1 n ! Φ ( n ) p ( x ) i = 1 N p ( x i ) p ( x ) i = 1 N p ( x i ) 1 n ,
where the upper index ( n ) denotes the nth-derivative,
Φ ( n ) p ( x ) i = 1 N p ( x i ) = δ δ p ( x ) ( n ) Φ p ( x ) i = 1 N p ( x i ) | p ( x ) = i = 1 N p ( x i ) .
The 0th term is Φ ( 0 ) p ( x ) i = 1 N p ( x i ) = Φ [ 1 ] = Φ m i n by definition of the design goal, which leaves,
Φ + = n = 1 1 n ! Φ ( n ) p ( x ) i = 1 N p ( x i ) p ( x ) i = 1 N p ( x i ) 1 n ,
where the + in Φ + refers to ( n ) > 0 .
Consider the independent subsystem special case in which p ( x ) is factorizable into p ( x ) = p ( x 1 , x 2 ) p ( x 3 , x 4 ) , for all x X . We can represent Φ + with an analogous two-dimensional Taylor expansion in p ( x 1 , x 2 ) and p ( x 3 , x 4 ) , which is,
Φ + = n 1 = 1 1 n 1 ! Φ ( n 1 ) p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) 1 n 1 + n 2 = 1 1 n 2 ! Φ ( n 2 ) p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) 1 n 2 + { n 1 = 1 n 2 = 1 1 n 1 ! n 2 ! Φ ( n 1 , n 2 ) p ( x ) i = 1 N p ( x i ) × p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) 1 n 1 p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) 1 n 2 } ,
where the mixed derivative term is,
Φ ( n 1 , n 2 ) p ( x ) i = 1 N p ( x i ) = δ δ p ( x 1 , x 2 ) ( n 1 ) δ δ p ( x 3 , x 4 ) ( n 2 ) × Φ p ( x ) i = 1 N p ( x i ) | p ( x ) = i = 1 N p ( x i ) .
Since transformations of one independent subsystem, ρ 12 ρ 12 or ρ 34 ρ 34 , must leave the other invariant by the PCC and subsystem independence, then DC2 requires that the mixed derivatives should necessarily be set to zero, Φ ( n 1 , n 2 ) p ( x 1 , x 2 ) p ( x 3 , x 4 ) i = 1 N p ( x i ) = 0 . This gives a functional equation for Φ + ,
Φ + = n 1 = 1 1 n 1 ! Φ ( n 1 ) p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) 1 n 1 + n 2 = 1 1 n 2 ! Φ ( n 2 ) p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) 1 n 2 = Φ 1 + + Φ 2 + ,
where Φ 1 + corresponds to the terms involving X 1 and X 2 and Φ 2 + corresponds to the terms involving X 3 and X 4 . Including Φ m i n = Φ ( 1 ) from the n 1 = 0 and n 2 = 0 cases we have in total that,
Φ = 2 Φ m i n + Φ 1 + + Φ 2 + .
To determine the solution of this equation we can appeal to the special case in which both subsystems are independent, p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) , and p ( x 3 , x 4 ) = p ( x 3 ) p ( x 4 ) which amounts to,
Φ = Φ min = 2 Φ min ,
which means that either Φ min = 0 or Φ min = ± , however the latter two solutions are ruled out by the design goal since setting the minimum to + makes no sense, and setting it to does not allow for ranking as it implies Φ = for all finite values of Φ + , which would violate the Design Goal. Further, Φ min = would imply that the minimum would not be a well defined constant number Φ , which violates (51). Thus, by eliminative induction and following our design method, it follows that Φ min = Φ ( 1 ) must equal 0.
The general equation for Φ having two independent subsystems ρ = ρ 12 ρ 34 is,
Φ [ ρ 12 ρ 34 ] = Φ 1 + [ ρ 12 ] + Φ 2 + [ ρ 34 ] ,
or with the arguments,
Φ p ( x 1 , x 2 ) p ( x 3 , x 4 ) p ( x 1 ) p ( x 2 ) p ( x 3 ) p ( x 4 ) = Φ 1 + p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) + Φ 2 + p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) .
If subsystem ρ 34 = ρ 3 ρ 4 is itself independent, it implies
Φ p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) · 1 = Φ 1 + p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) ,
but due to commutativity, this is also,
Φ 1 · p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) = Φ 2 + p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) .
This implies the functional form of Φ does not have dependence on the particular subsystem Φ = Φ 1 + = Φ 2 + in general. This gives the following functional equation for Φ ,
Φ p ( x 1 , x 2 ) p ( x 3 , x 4 ) p ( x 1 ) p ( x 2 ) p ( x 3 ) p ( x 4 ) = Φ p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) + Φ p ( x 3 , x 4 ) p ( x 3 ) p ( x 4 ) .
The solution to this functional equation is the log,
Φ [ z ] = A log ( z ) ,
where A is an arbitrary constant. Setting A = 1 , the global correlation functional increases in the amount of correlations,
I [ ρ , X ] = d x p ( x ) log p ( x ) i = 1 N p ( x i ) ,
which is (28).
The result in (66) could be imposed as an additivity condition on the functional I [ ρ , X ] = I [ ρ 12 , X 12 ] + I [ ρ 34 , X 34 ] [39]. In general however, the correlation functional I [ ρ , X ] obeys the stricter strong additivity condition, which we have no reason a priori to impose. Here, the strong additivity condition is instead an end result, realized as a consequence of imposing the PCC through the various design criteria.

4.2.2. Algebraic Approach

Here we present an alternative algebraic approach to imposing DC2. Consider the case in which subsystem two is independent, ρ 34 φ 34 = p ( x 3 , x 4 ) = p ( x 3 ) p ( x 4 ) (The notation φ in place of the usual ρ for a distribution is meant to represent independence among it’s subsystems.), and ρ = ρ 12 φ 34 . This special case is,
I [ ρ 12 φ 34 , X ] = d x p ( x 1 , x 2 ) p ( x 3 , x 4 ) Φ p ( x 1 , x 2 ) p ( x 3 , x 4 ) p ( x 1 ) p ( x 2 ) p ( x 3 ) p ( x 4 ) = d x 1 d x 2 p ( x 1 , x 2 ) Φ p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) = I [ ρ 12 , X 12 ] ,
which holds for all product forms of φ 34 that have no correlations and for all possible transformations of ρ 12 ρ 12 .joint
Alternatively, we could have considered the situation in which subsystem one is independent, ρ 12 φ 12 = p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) . Analogously, this case implies,
I [ φ 12 ρ 34 , X ] = I [ ρ 34 , X 34 ] ,
which holds for all product forms of φ 12 that have no correlations and for all possible transformations of ρ 34 ρ 34 .
The consequence of these considerations is that, in principle, we have isolated the amount of correlations of either system. Imposing DC2 is requiring that the amount of correlations in either subsystem cannot be affected by changes in correlations in the other. This implies that for general ρ = ρ 1 ρ 2 ,
I [ ρ 1 ρ 2 , X ] = G [ I [ ρ 1 , X 1 ] , I [ ρ 2 , X 2 ] ] .
Consider a variation of ρ 1 where ρ 2 is held fixed, which induces a change in I [ ρ 1 ρ 2 , X ] ,
δ I [ ρ 1 ρ 2 , X ] | ρ 2 = δ I [ ρ 1 ρ 2 , X ] δ I [ ρ 1 , X 1 ] δ I [ ρ 1 , X 1 ] δ ρ 1 δ ρ 1 .
Now consider a variation of ρ 1 at any other value of the second subsystem, ρ 2 ρ 2 . This is,
δ I [ ρ 1 ρ 2 , X ] | ρ 2 = δ I [ ρ 1 ρ 2 , X ] δ I [ ρ 1 , X 1 ] δ I [ ρ 1 , X 1 ] δ ρ 1 δ ρ 1 .
It follows from DC2 that transformations in one independent subsystem should not change the amount of correlations in another independent subsystem due to the PCC. However, for the same δ ρ 1 , the current functional form (72) allows for δ I ρ 1 ρ 2 , X / δ ρ 1 at one value of ρ 2 to differ from δ I ρ 1 ρ 2 , X / δ ρ 1 at another, which implies that the amount of correlations induced by the change δ ρ 1 depends on the value of ρ 2 . Imposing DC2 is therefore enforcing that functionally the amount of change in the correlations satisfies
δ I ρ 1 ρ 2 , X δ I ρ 1 , X 1 = δ I ρ 1 ρ 2 , X δ I ρ 1 , X 1 ,
for any value of ρ 2 , i.e., that the variations must be independent too. This similarly goes for variations with respect to ρ 2 where ρ 1 is kept fixed, which implies that (71) must be linear since,
δ 2 I ρ 1 ρ 2 , X δ I ρ 1 , X 1 δ I ρ 2 , X 2 = 0 .
The general solution to this differential equation is,
I ρ 1 ρ 2 , X = a I ρ 1 , X 1 + b I ρ 2 , X 2 + c .
joint We now seek the constants a , b , c . Commutativity, I ρ 1 ρ 2 , X = I ρ 2 ρ 1 , X , implies that a = b ,
I [ ρ 1 ρ 2 , X ] = a ( I [ ρ 1 , X 1 ] + I [ ρ 2 , X 2 ] ) + c .
Because I m i n = Φ [ 1 ] = Φ [ 1 N ] , for N independent subsystems we find,
I [ φ 1 φ N , X ] = I m i n = N a I m i n + ( N 1 ) c ,
and therefore the constant c must satisfy,
c = ( 1 N a ) I m i n ( N 1 ) .
Because a , c , and I m i n are all constants they should not depend on the number of independent subsystems N. Thus, for another distribution ρ which contains M N independent subsystems,
c = ( 1 N a ) I m i n ( N 1 ) = ( 1 M a ) I m i n ( M 1 ) ,
which implies N = M , which ca not be realized by definition. This implies the only solution is I m i n = c = 0 , which is in agreement with the analytic approach. One then uses (69),
I [ ρ 1 φ 2 , X ] = a I [ ρ 1 , X 1 ] = I [ ρ 1 , X 1 ] ,
and finds a = 1 . This gives a functional equation for Φ ,
Φ [ ρ 12 ρ 34 ] = Φ [ ρ 12 ] + Φ [ ρ 34 ] .
At this point the solution follows from Equation (67) so that I [ ρ ] is (28),
I [ ρ , X ] = d x p ( x ) log p ( x ) i = 1 N p ( x i ) .

5. The n-Partite Special Cases

In the previous sections of the article, we designed an expression that quantifies the global correlations present within an entire probability distribution and found this quantity to be identical to the total correlation (TC). Now we would like to discuss partial cases of the above in which one does not consider the information shared by the entire set of variables X , but only information shared across particular subsets of variables in X . These types of special cases of TC measure the n-partite correlations present for a given distribution ρ . We call such functionals an n-partite information, or NPI.
Given a set of N variables in proposition space, X = X 1 × × X N , an n-partite subset of X consists of n N subspaces { X ( k ) } n { X ( 1 ) , , X ( k ) , , X ( n ) } which have the following collectively exhaustive and mutually exclusive properties,
X ( 1 ) × × X ( n ) = X and X ( k ) X ( j ) = , k j .
The special case of (83) for any n-partite splitting will be called the n-partite information and will be denoted by I [ ρ , X ( 1 ) ; ; X ( n ) ] with ( n 1 ) semi-colons separating the partitions. The largest number n that one can form for any variable set X is simply the number of variables present in X and for this largest set the n-partite information coincides with the total correlation,
I [ ρ , X 1 ; ; X N ] = def I [ ρ , X ] .
Each of the n-partite informations can be derived in a manner similar to the total correlation, except where the density m ( x ) in step (52) is replaced with the appropriate independent density associated to the n-partite system, i.e.,
m ( x ) n partite k = 1 n p ( x ( k ) ) , x ( k ) X ( k ) .
Thus, the split invariant coordinate transformation (10) becomes one in which each of the partitions in variable space gives an overall block diagonal Jacobian, (In the simplest case, for N dimensions and n = 2 partitions, the Jacobian matrix is block diagonal in the partitions J ( X ( 1 ) , X ( 2 ) ) = J ( X ( 1 ) ) J ( X ( 2 ) ) , which we use to define the split invariant coordinate transformations in the bipartite (or mutual) information case.)
γ ( X ( 1 ) , , X ( n ) ) = k = 1 n γ ( X ( k ) ) .
We then derive what we call the n-partite information (NPI),
I [ ρ , X ( 1 ) ; ; X ( n ) ] = d x p ( x ) log p ( x ) k = 1 n p ( x ( k ) ) .
The combinatorial number of possible partitions of the spaces for n N splits is given by the combinatorics of Stirling numbers of the second kind [41]. A Stirling number of the second kind S ( N , n ) (often denoted as N n ) gives the number of n subsets one can form from a set of N elements. The definition in terms of binomial coefficients is given by,
N n = 1 n ! i = 0 n 1 ( 1 ) i n i ( n i ) N .
Thus, the number of unique n-partite informations one can form from a set of N variables is equal to N n .
Using (A28) from the appendix, for any n-partite information I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] , where n > 2 , we have the chain rule,
I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] = k = 2 n I [ ρ , X ( 1 ) × × X ( k 1 ) ; X ( k ) ] ,
where I [ ρ , X ( 1 ) × × X ( k 1 ) ; X ( k ) ] is the mutual information between the subspace X ( 1 ) × × X ( k 1 ) and the subspace X ( k ) .

5.1. Remarks on the Upper-Bound of TC

The TC provides an upper-bound for any choice of n-partition information, i.e., any n-partite information in which n < N necessarily satisfies,
I [ ρ , X ( 1 ) ; ; X ( n ) ] I [ ρ , X ] .
This can be shown by using the decomposition of the TC into continuous Shannon entropies which was discussed in [6],
I [ ρ , X ] = i = 1 N S [ ρ i , X i ] S [ ρ , X ] ,
where the continuous Shannon entropy (While it is true that the continuous Shannon entropy is not coordinate invariant, the particular combinations used in this paper are, due to the TC and n-partite information being relative entropies themselves.) S [ ρ , X ] is,
S [ ρ , X ] = d x p ( x ) log p ( x ) .
Likewise for any n-partition we have the decomposition,
I [ ρ , X ( 1 ) ; ; X ( n ) ] = k = 1 n S [ ρ ( k ) , X ( k ) ] S [ ρ , X ] .
Since we in general have the inequality [5] for entropy,
S [ ρ i j , X i × X j ] S [ ρ i , X i ] + S [ ρ j , X j ] ,
then we also have that for any k-th partition of a set of N variables, that the N k exhaustive internal partitions (i.e., k = 1 n N k = N ) of X ( k ) = X ( k ( 1 ) ) × × X ( k ( N k ) ) satisfy,
S [ ρ ( k ) , X ( k ) ] i = 1 N k S [ ρ ( k ( i ) ) , X ( k ( i ) ) ] .
Using (96) in (94), we then have that for any n-partite information,
I [ ρ , X ( 1 ) ; ; X ( n ) ] = k = 1 n S [ ρ ( k ) , X ( k ) ] S [ ρ , X ] k = 1 n i = 1 N k S [ ρ ( k ( i ) ) , X ( k ( i ) ) ] S [ ρ , X ] = I [ ρ , X ] .
Thus, the Total Correlations are always greater than or equal to any correlations between any n-partite splitting of X Upper bounds for the discrete case was discussed in [42].

5.2. The Bipartite (Mutual) Information

Perhaps the most studied special case of the NPI is the mutual information (MI), which is the smallest possible n-partition one can form. As was discussed in the introduction, it is useful in inference tasks and was the first quantity to really be defined and exploited [14] out of the general class of n-partite informations.
To analyze the mutual information, consider first relabeling the total space as Z = X 1 × × X N to match the common notation in MI literature. The bipartite information considers only two subspaces, X Z and Y Z , rather than all of them. These two subspaces define a bipartite split in the proposition space such that X Y = and X × Y = Z . This results in turning the product marginal into,
m ( x ) M I m ( x , y ) = p ( x ) p ( y ) ,
where x X and y Y . Finally, we arrive at the functional that we will label by its split as,
I [ ρ , X ; Y ] = d x d y p ( x , y ) log p ( x , y ) p ( x ) p ( y ) ,
which is the mutual information. Since the marginal space is split into two distinct subspaces, the mutual information only quantifies the correlations between the two subspaces and not between all the variables as is the case with the total correlation for a given split. Whenever the total space Z is two-dimensional, the total correlation and the mutual information coincide.
One can derive the mutual information by using the same steps as in the total correlation derivation above, except replacing the independence condition in (48) with the bipartite marginal in (98). The goal is the same as (Section 3) except that the MI ranks the distributions ρ according to the correlations between two subspaces of propositions, rather than within the entire proposition space.

5.3. The Discrete Total Correlation

One may derive a discrete total correlation and discrete NPI by starting from Equation (36),
I [ ρ , X ] = i = 1 | X | F i ( P ( x i ) ) ,
and then following the same arguments without taking the continuous limit after DC1 was imposed.
The inferential transformations explored in Section 2.1 are somewhat different for discrete distributions. Coordinate transformations are replaced by general reparameterizations, (An example of such a discrete reparameterization (or discrete coordinate transformation) is intuitively used in coin flipping experiments – the outcome of coin flips may be parameterized with (−1,1) or equally with (0, 1) to represent tails verses heads outcomes, respectively.) in which one defines a bijection between sets,
f : X X x i f ( x i ) = x i .
Like with general coordinate transformations (3), if f ( x i ) is a bijection then we equate the probabilities,
P ( x i ) = P ( f ( x i ) ) = P ( x i ) .
Since x i is simply a label for a proposition, we enforce that the probabilities associated to it are independent of the choice of label. One can define discrete split coordinate invariant transformations analogous to the continuous case. Using discrete split invariant coordinate transformations, the index i is removed from F i above, which is analogous to the removal of the x coordinate dependence in the continuous case. The functional equation for the log is found and solved analogously by imposing DC2. The discrete TC is then found and the discrete NPI may be argued.
The other transformations in Section 2.1 remain the same, except for replacing integrals by sums. In the above subsections, the continuous relative entropy is replaced by the discrete version,
S [ P , Q ] = x i P ( x i ) log P ( x i ) Q ( x i ) .
The discrete MI is extremely useful for dealing with problems in communication theory, such as noisy-channel communication and Rate-Distortion theory [5]. It is also reasonable to consider situations where one has combinations of discrete and continuous variables. One example is the binary category case [34].

6. Sufficiency

There is a large literature on the topic of sufficiency [5,30,43] which dates back to work originally done by Fisher [44]. Some have argued that the idea dates back to even Laplace [23], a hundred years before Fisher. What both were trying to do ultimately was determine whether one could find simpler statistics that contain all of the required information to make equal inferences about some parameter.
Let p ( x , θ ) = p ( x ) p ( θ | x ) = p ( θ ) p ( x | θ ) be a joint distribution over some variables X and some parameters we wish to infer Θ . Consider then a function y = f ( x ) , and also the joint density,
p ( x , y , θ ) = p ( x , y ) p ( θ | x , y ) = p ( x ) p ( θ | x , y ) δ ( y f ( x ) ) .
If y is a sufficient statistic for x with respect to θ , then the above equation becomes,
p ( x , y , θ ) = p ( x ) p ( θ | y ) δ ( y f ( x ) ) ,
and the conditional probability p ( θ | x , y ) = p ( θ | y ) does not depend on x because y is sufficient. Fisher’s factorization theorem states that a sufficient statistic for θ will give the following relation,
p ( x | θ ) = f ( y | θ ) g ( x ) ,
where f and g are functions that are not necessarily probabilities; i.e., they are not normalized with respect to their arguments, however since the left hand side is certainly normalized with respect to x, then the right hand side must be as well. To see this, we can rewrite Equation (105) in terms of the distributions,
p ( x , y , θ ) = p ( x , θ ) p ( y | x , θ ) = p ( θ ) p ( x | θ ) δ ( y f ( x ) ) .
Equating Equations (105) and (107) we find,
p ( x ) p ( θ | y ) = p ( θ ) p ( x | θ ) ,
which gives the general result,
p ( x | θ ) = p ( θ | y ) p ( θ ) p ( x ) = p ( y | θ ) p ( y ) p ( x ) .
We can then identify g ( x ) = p ( x ) which only depends on x and f ( y | θ ) = p ( y | θ ) / p ( y ) which is the ratio of two probabilities and hence, not normalized with respect to y.

6.1. A New Definition of Sufficiency

While the notion of a sufficient statistic is useful, how can we quantify the sufficiency of a statistic which is not completely sufficient but only partially? What about for the case of generic n-partite systems? The n-partite information can provide an answer. We first begin with the bi-partite case.
Our design derivation shows that the MI is uniquely designed for quantifying the global correlations in the bi-partite case. Because the correlations between two variables indicate how informative one variable is toward the inference of the other, a change in the MI indicates a change in ones ability to make such inferences over the entire spaces of both variables. Thus, we can use this interpretation toward quantifying statistical sufficiency in terms of the change in the amount of correlations in a global sense.
Consider an arbitrary continuous function f : X X , which we call a statistic of X . We define the sufficiency of the statistic f ( X ) with respect to another space Θ for a bi-partite system as simply the ratio of mutual informations,
suff Θ [ f ( X ) ] = def I [ ρ , f ( X ) ; Θ ] I [ ρ , X ; Θ ] ,
which is always bounded by 0 suff Y ( f ) 1 due to the data processing inequality in Appendix C.3 and ρ is the distribution defined over the joint space f ( X ) × Θ . In this problem space, it is assumed that there exists correlation between ( X , Θ ) , i.e., I [ ρ , X ; Θ ] > 0 , at least before the statistic is checked for sufficiency. Statistics for which suff Θ [ f ( X ) ] = 1 are called sufficient and correspond to the definition given by Fisher. We can see this by appealing to the special case p ( x , f ( x ) , θ ) = p ( x ) p ( θ | x ) δ ( y f ( x ) ) for some statistic y = f ( x ) . It is true that,
p ( y ) = d x p ( x , y ) = d x p ( x ) δ ( y f ( x ) ) ,
so that when p ( θ | y ) = p ( θ | x ) ,
I [ ρ , f ( X ) ; Θ ] = d y d θ p ( y ) p ( θ | y ) log p ( θ | y ) p ( θ ) = d x d y d θ p ( x ) δ ( y f ( x ) ) p ( θ | x ) log p ( θ | x ) p ( θ ) = I [ ρ , X ; Θ ] ,
which is the criteria for y to be a sufficient statistic. With this definition of sufficiency (110) we have a way of evaluating maps f ( X ) which attempt to preserve correlations between X and Y . These procedures are ubiquitous in machine learning [34], manifold learning and other inference tasks, although the explicit connection with statistical sufficiency has yet to be realized.

6.2. The n-Partite Sufficiency

Like mutual information, the n-partite information can provide insights into sufficiency. Let us first begin by stating a theorem.
Theorem 1
(n-partite information inequality). Let X = X 1 × × X N be a collection of N subspaces and let { X ( k ) } n be an n-partite system of X . Then, for any function f : X ( k ) X ( k ) which acts on one of the n-partite subspaces, we have the following inequality,
I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] I [ ρ , X ( 1 ) ; ; f ( X ( k ) ) ; ; X ( n ) ] .
The proof of the above theorem can be written in analogy to the data processing inequality derivation in Appendix C.3.
Proof. 
Let { X ( k ) } n be an n-partite system for a collection of N variables, X = X 1 × × X N . The joint distribution can be written,
p ( x 1 , , x N ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x N | x 1 , , x N 1 ) ,
and the n-partite marginal m ( x ) is,
m ( x ) = k = 1 n p ( x ( k ) ) ,
where x ( k ) X ( k ) is the kth collection in { X ( k ) } n . Thus the n-partite information is written,
I [ ρ , X ( 1 ) ; ; X ( n ) ] = d x p ( x ) log p ( x ) k = 1 n p ( x ( k ) ) .
Consider now that we define the function,
f x ( k ) : X ( k ) X ( k ) x ( k ) f x ( k ) ( x ( k ) ) = x ( k ) ,
which takes the collection of variables X ( k ) to some other collection X ( k ) . Thus, we have the joint distribution,
p ( x ( k ) , x ( k ) ) = p ( x ( k ) ) p ( x ( k ) | x ( k ) ) = p ( x ( k ) ) δ ( x ( k ) f x ( k ) ( x ( k ) ) ) .
Now, consider the case in which the variables X ( k ) are combined with the original variables X ( k ) in their respective partition as a Cartesian product, X ( k ) X ( k ) × X ( k ) , so that the n-partite information becomes,
I [ ρ , X ( 1 ) ; ; X ( k ) × X ( k ) ; ; X ( n ) ] = I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] + I ˜ [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) | X ( k ) ] ,
where the second term on the right hand side is,
I ˜ [ ρ , X ( 1 ) ; ; X ( n ) | X ( k ) ] = d x p ( x ) log p ( x ( k ) | x ( 1 ) , , x ( k ) , , x ( n ) ) p ( x ( k ) | x ( k ) ) = 0 ,
since p ( x ( k ) | x ( 1 ) , , x ( k ) , , x ( n ) ) = p ( x ( k ) | x ( k ) ) = δ ( x ( k ) f x ( k ) ( x ( k ) ) ) . Consider however that we break up the n-partite information in (119) by first removing X ( k ) instead of X ( k ) ,
I [ ρ , X ( 1 ) ; ; X ( k ) × X ( k ) ; ; X ( n ) ] = I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] + I ˜ [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) | X ( k ) ] ,
where again the second term is,
I ˜ [ ρ , X ( 1 ) ; ; X ( n ) | X ( k ) ] = d x p ( x ) log p ( x ( k ) | x ( 1 ) , , x ( k ) , , x ( n ) ) p ( x ( k ) | x ( k ) ) = I [ ρ , X ( k ) ; X ˜ | X ( k ) ] 0 ,
and where X ˜ = X \ X ( k ) . Thus we have that,
I [ ρ , X ( 1 ) ; ; X ( k ) × X ( k ) ; ; X ( n ) ] = I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] = I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] + I [ ρ , X ( k ) ; X ˜ | X ( k ) ] ,
and thus we have that for a general function f x ( k ) of the kth-partition, the inequality,
I [ ρ , X ( 1 ) ; ; X ( k ) ; ; X ( n ) ] I [ ρ , X ( 1 ) ; ; f ( X ( k ) ) ; ; X ( n ) ] ,
which proves the Theorem 1. □
The above theorem provides a ranking of transformation functions ( f x ( 1 ) , f x ( 2 ) , , f x ( n ) ) of the n-partitions,
I [ ρ , X ( 1 ) ; ; X ( n ) ] I [ ρ , f ( X ( 1 ) ) ; ; X ( n ) ] I [ ρ , f ( X ( 1 ) ) ; f ( X ( 2 ) ) ; ; X ( n ) ] I [ ρ , f ( X ( 1 ) ) ; f ( X ( 2 ) ) ; ; f ( X ( n ) ) ] .
The action of any set of functions { f } can only ever decrease the amount of n-partite correlations, much as in the data processing inequality. We define the sufficiency of a set of functions { f } analogously to (110). Consider that we generate a set of functions for m of the partitions, leaving n m partitions alone. Then the sufficiency of the set of functions { f } which act on the subspace { X ( k ) } m with respect to the remaining m n partitions is given by,
suff X ˜ ( { f } ) = def I [ ρ , f ( X ( 1 ) ) ; ; f ( X ( m ) ) ; X ( m + 1 ) ; ; X ( n ) ] I [ ρ , X ( 1 ) ; ; X ( n ) ] ,
where X ˜ = { X ( k ) } n \ { X ( k ) } m . Like the bi-partite sufficiency, the n-partite sufficiency is bounded between zero and one due to the n-partite information inequality.

6.3. The n-Partite Joint Processing Inequality

Using the results from Equations (97) and (113), we can express a general result which we call the n-partite joint processing inequality. While the n-partite inequality concerns functions which act individually within the n-partite spaces, we can generalize this notion to functions which act over the entire variable space, i.e., functions which jointly process partitions.
Theorem 1
(n-partite joint processing inequality). Let X = X 1 × × X N be a collection of N subspaces and let { X ( k ) } n be an n-partite system of X . Then, for any function f : X ( k ) × X ( ) X ( k ) which acts on two of the n-partite subspaces, we have the following inequality,
I [ ρ , X ( 1 ) ; ; X ( k ) ; X ( ) ; ; X ( n ) ] I [ ρ , X ( 1 ) ; ; f ( X ( k ) , X ( ) ) ; ; X ( n ) ] .
Proof. 
Consider the case of three partitions, X = X ( 1 ) × X ( 2 ) × X ( 3 ) . Then, using a result from Appendix C.1.3, Equation (A28), it is true that,
I [ ρ , X ( 1 ) ; X ( 2 ) ; X ( 3 ) ] I [ ρ , X ( 1 ) × X ( 2 ) ; X ( 3 ) ] .
Thus, together with the n-partite information inequality (113), any function f x ( 1 , 2 ) which combines two partitions necessarily satisfies,
I [ ρ , X ( 1 ) ; X ( 2 ) ; X ( 3 ) ] I [ ρ , f x ( 1 , 2 ) X ( 1 ) , X ( 2 ) ; X ( 3 ) ] .
 □
Thus, an analogous continuous definition of the sufficiency follows for n-partitions of m joint statistic functions. Consider a function f x ( 1 , , m ) which combines the first m partitions of an n-partite system,
f x ( 1 , , m ) : X ( 1 ) × × X ( m ) X ( x ( 1 ) , , x ( m ) ) f x ( 1 , , m ) ( x ( 1 ) , , x ( m ) ) = x .
Then we define the sufficiency of the map f x ( 1 , , m ) with respect to the remaining m n partitions as the ratio,
suff X ˜ ( f x ( 1 , , m ) ) = def I [ ρ , X ; X ( m + 1 ) ; ; X ( n ) ] I [ ρ , X ( 1 ) ; ; X ( m ) ; X ( m + 1 ) ; ; X ( n ) ] .
The are many possible combinations of maps of the form (130) and (117) that may analogously be expressed with a continuous notion of sufficiency; however, for the application of successive functions, one will always find a nested set of inequalities of the form (129) and (125).

6.4. The Likelihood Ratio

Here we will associate the invariance of MI to invariance of type I and type II errors. Consider a binary decision problem in which we have some set of discriminating variables X , following a mixture of two distributions (e.g., signal and background) labeled by a parameter θ = { s , b } . The inference problem can then be cast in terms of the joint distribution p ( x , θ ) = p ( x ) p ( θ | x ) . According to the Neyman-Pearson lemma [24], the likelihood ratio,
Φ ( x ) = L ( s | x ) L ( b | x ) = p ( x | s ) p ( x | b ) ,
gives a sufficient statistic for the significance level,
α ( x ) = P ( Φ ( x ) η | b ) ,
where b = H 0 is typically associated to the null hypothesis. This means that the likelihood ratio (132) will allow us to determine if the data X satisfies the significance level in (133). Given Bayes’ theorem, the likelihood ratio is equivalent to,
Φ ( x ) = p ( x | s ) p ( x | b ) = p ( b ) p ( s ) p ( s | x ) p ( b | x ) = p ( b ) p ( s ) Π ( x ) ,
which is the posterior ratio and is just as good of a statistic as the likelihood ratio, since p ( b ) / p ( s ) is a constant for all x X . If we then construct a sufficient statistic y = f ( x ) for X , such that,
Y = f ( X ) I [ ρ , f ( X ) ; θ ] = I [ ρ , X ; θ ] ,
then the posterior ratios, and hence the likelihood ratios, are equivalent,
Π ( f ( x ) ) = p ( s | f ( x ) ) p ( b | f ( x ) ) = p ( s | x ) p ( b | x ) = Π ( x ) ,
and hence the significance levels are also invariant,
α ( f ( x ) ) = P ( Φ ( f ( x ) ) η | b ) = P ( Φ ( x ) η | b ) = α ( x ) ,
and therefore the type I and type II errors will also be invariant. Thus we can use the MI as a tool for finding the type I and type II errors for some unknown probability distribution by first constructing a sufficient statistic using some technique (typically a ML technique), and then finding the type I and type II errors on the simpler distribution. Then, due to (136), the errors associated to the simpler distribution will be equivalent to the errors of the original unknown one.
Apart from its invariance, we can also show another consequence of MI under arbitrary transformations f ( X ) for binary decision problems. Imagine that we successfully construct a sufficient statistic for X . Then, it is a fact that the likelihood ratios Φ ( x ) and Φ ( f ( x ) ) will be equivalent for all x X . Consider that we adjust the probability of one value of p ( θ | f ( x i ) ) by shifting the relative weight of signal and background for that particular value f ( x i ) ,
p ( s | f ( x i ) ) p ( s | f ( x i ) ) = p ( s | f ( x i ) ) + δ p p ( b | f ( x i ) ) p ( b | f ( x i ) ) = p ( b | f ( x i ) ) δ p ,
where δ p is some small change, so that the particular value of
Π ( f ( x i ) ) = p ( s | f ( x i ) ) p ( b | f ( x i ) ) = p ( s | f ( x i ) ) + δ p p ( b | f ( x i ) ) δ p Π ( f ( x i ) ) ,
which is not equal to the value given from the sufficient statistic. Whether the value Π ( f ( x i ) ) is larger or smaller than Π ( f ( x i ) ) , in either case either the number of type I or type II errors will increase for the distribution with p ( θ | f ( x i ) ) replaced for the sufficient value p ( θ | f ( x i ) ) . Therefore, for any distribution given by the joint space X × Θ , the MI determines the type I and type II error for any statistic on the data X .

7. Discussion of Alternative Measures of Correlation

The design derivation in this paper puts the various NPI functionals on an equal foundational footing as the relative entropy. This begs the question as to whether other similar information theoretic quantities can be designed along similar lines. Some quantities that come to mind are α-mutual information [45], multivariate-mutual information [46,47], directed information [48], transfer entropy [49] and causation entropy [50,51,52].
The α -mutual information [45] belongs to the family of functionals that fall under the name Rényi entropies [53] and their close cousin Tsallis entropy [54,55]. Tsallis proposed his entropy functional as an attempted generalization for applications of statistical mechanics, however the probability distributions that it produces can be generated from the standard MaxEnt procedure [10] and do not require a new thermodynamics (For some discussion on this topic see [10] page 114.). Likewise, Rényi’s family of entropies attempts to generalize the relative entropy for generic inference tasks, which inadvertently relaxes some of the design criteria concerning independence. Essentially, Rényi introduces a set of parameterized entropies S η [ p , q ] , with parameter η , which leads to the weakening of the independent subsystem additivity criteria. Imposing that these functionals then obey subsystem independence immediately constrains η = 0 or η = 1 , and reduces them back to the standard relative entropy, i.e., S η = 0 [ p , q ] S [ p , q ] and S η = 1 [ p , q ] S [ q , p ] . Without a strict understanding of what it means for subsystems to be independent, one cannot conduct reasonable science. Thus, such “generalized” measures of correlation (such as the α -mutual information [45]) which abandon subsystem independence cannot be trusted.
Defining multivariate-mutual information (MMI) is an attempt to generalize the standard MI to a case where several sets of propositions are compared with each other, however this is different than the total correlation which we designed in this paper. For example, given three sets of propositions X , Y and Z , the multivariate-mutual information (not to be confused with the multi-information [56] which is another name for total correlation) is,
M M I [ ρ , X ; Y ; Z ] = def I [ ρ , X ; Y ] I [ ρ , X ; Y | Z ]
One difficulty with this expression is that it can be negative, as was shown by Hu [47]. Thus, defining a minimum MMI is not possible in general, which suggests that a design derivation of MMI requires a different interpretation. Despite these difficulties, there have been several recent successes in the study of MMI including Bell [57] and Baudot et al. [16,17] who studied MMI in the context of algebraic topology.
Another extension of mutual information is transfer entropy, which was first introduced by Schreiber [49] and is a special case of directed information [48]. Transfer entropy is a conditional mutual information which attempts to quantify the amount of “information” that flows between two time-dependent random processes. Given a set of propositions which are dynamical, such that X = X ( t ) and Y = Y ( t ) , so that at time t i the propositions take the form X ( t i ) = def X t i and Y ( t i ) = def Y t i , then the transfer entropy (TE) between X ( t ) and Y ( t ) at time t i is defined as the conditional mutual information,
T X Y = def I [ ρ , Y t i ; X t j < i | Y t j < i ] .
The notation t j < i refers to all times t j before t i . Thus, the TE is meant to quantify the influence of a variable X on predicting the state Y t i when one already knows the history of Y , i.e., it quantifies the amount of independent correlations provided by X . Given that TE is a conditional mutual information, it does not require a design derivation independent of the MI. It can be justified on the basis of the discussion around (A32). Likewise the more general directed information is also a conditional MI and hence can be justified in the same way.
Finally, the definition of causation entropy [50,51,52] can also be expressed as a conditional mutual information. Causation entropy (CE) attempts to quantify time-dependent correlations between nodes in a connected graph and hence generalize the notion of transfer entropy. Given a set of nodes X , Y and Z , the causation entropy between two subsets conditioned on a third is given by,
C X Y | Z = def I [ ρ , Y t i ; X t j < i | Z t j < i ] .
The above definition reduces to the transfer entropy whenever the set of variables Z = Y . As was shown by Sun et al. [58], the causation entropy (CE) allows one to more appropriately quantify the causal relationships within connected graphs, unlike the transfer entropy which is somewhat limited. Since CE is a conditional mutual information, it does not require an independent design derivation. As with transfer entropy and directed information, the interpretation of CE can be justified on the basis of (A32).

8. Conclusions

Using a design derivation, we showed that the TC is the functional designed to rank the global amount of correlations in a system between all of its variables. We relied heavily on the PCC, which while quite simple, is restrictive enough to isolate the functional form of I [ ρ , X ] using eliminative induction. We enforced the PCC using two different methods as an additional measure of rigor (analytically through Taylor expanding and algebraically through the functional Equation (71)). The fact that both approaches lead to the same functional shows that the design criteria are highly constraining, and the total correlation is the unique solution. We generalized our solution to the n-partite information–this global correlation quantifier can express the TC and MI as special cases.
Using our design derivation we were able to quantify the amount of global correlations and analyze the effect of inferential transformations in a new light. Because the correlations between variables indicate how informative a set of variables are toward the inference of the others, a change in the global amount indicates a change in ones ability to make such inferences globally. Thus, we can use NPI to quantify statistical sufficiency in terms of the change in the amount of correlations over the entire joint variable spaces. This leads to a rigorous quantification of continuous statistical sufficiency that takes an upper bound of one when the Fisher sufficiency condition satisfied.
.

Author Contributions

Conceptualization, N.C. and K.V.; Writing–original draft, N.C. and K.V.; Writing–review and editing, N.C. and K.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

We would like to thank Ariel Caticha for his invaluable support and guidance on this project. We would also like to thank Jesse Ernst, Selman Ipek and Kevin Knuth for many insightful and inspirational conversations. The authors also thank the University at Albany and the Massachusetts Institute of Technology for their support in the publication of this article.

Conflicts of Interest

The authors declare no conflict of interest

Appendix A. Coordinate Invariance

Consider a continuous space of propositions X R and an associated statistical manifold Δ defined by (1). A point in the statistical manifold is labeled ρ Δ which is a map,
ρ : P ( X ) [ 0 , 1 ] x p ( x ) .
Consider now the cumulative distribution of X for some value of x ˜ X ,
P ( X x ˜ ) = x x ˜ d x p ( x ) .
One can always recover the density p ( x ) by differentiating (A2) with respect to x,
p ( x ) = x P ( X x ˜ ) = x x x ˜ d x p ( x ) .
If we now consider a smooth bijection f : X Y , so that all the first derivatives exist, then it is true that f ( x ) will always be an increasing or decreasing function of x. It is true then that if x x ˜ , then f ( x ) f ( x ˜ ) . It will then be true that the cumulative distributions will be equal,
P ( X x ˜ ) = P ( f ( X ) f ( x ˜ ) ) = P ( Y y ˜ ) .
Therefore, the cumulative distribution distribution over Y up to some value y ˜ Y is,
P ( Y y ˜ ) = y y ˜ d y p ( y ) = x x ˜ d x p ( x ) ,
so that the density p ( x ) from (A3) becomes,
p ( x ) = x P ( Y y ˜ ) = x y y ˜ d y p ( y ) = y x p ( y ) ,
which shows that the densities transform as,
p ( x ) d x = p ( y ) d y .
For a generic n-dimensional space of propositions X R n , then density (A3) becomes,
p ( x ) = n x 1 x n x x ˜ d x p ( x ) = n x 1 x n y y ˜ d y p ( y ) ,
where we suppressed the notation in the measure d n x = d x and d n y = d y . Then, the transformation (A6) becomes,
p ( x ) = p ( y ) γ ( y ) = p ( y ) y x ,
where y x is the Jacobian of X and Y .

Appendix B. Watanabe’s Theorem

A theorem presented by Watanabe [6] concerns the grouping property of entropy and its relation to total correlation. The theorem was stated in [6] on page 70.
Theorem A1
(Watanabe (1960)). The set of all variables in consideration is divided into subsets, and each subset is again subdivided into sub-subsets, et cetera, until finally the entire set is branched into individual variables. Then, the sum of all correlations, each of which is defined with respect to a branching point, is independent of the way in which this branching procedure is made and is equal to the total correlation.
Essentially, one can choose any set of n-partite splittings to a set of N variables X and still arrive at the same total correlation by iterating over the splitting until all the variables have been individually split and adding the n-partite information’s at each split. This can be proved easily using the grouping property of entropy.
Proof. 
To prove the theorem, let denote the level of the splitting sequence of the proposition space X so that X P ( X ) denotes the collection of all variables at the th level. Let n denote the number of subsets in the th split and let X k X denote the kth subset of the subset X . Also let n k denote the number of subsets within the kth subset of the th split so that k n k = n . We then have that,
× k = 1 n X k = X , and X k X j = , k j .
Each subset X k contains within it another subset of variables at the split level ( + 1 ) , so that we identify,
X k ( + 1 ) = X k i , where i = 1 n k X k i = X k .
The combination of three indices ( , k , i ) defines a unique subset at level ( + 1 , k ) . We can write the n-partite information of the kth subset at level using (94) as,
I [ ρ , X k ] = def i = 1 n k S [ ρ , X k i ] S [ ρ , X k ] .
Continuing to the next level of the split ( + 1 ) , we can also write down the n-partite information for each of the n k subspaces as,
I [ ρ , X k + 1 ] = i = 1 n + 1 k S [ ρ , X k i + 1 ] S [ ρ , X k + 1 ] .
The sum of correlations at level is given by,
I [ ρ , X ] = def k = 1 n I [ ρ , X k ] = k = 1 n i = 1 n k S [ ρ , X k i ] S [ ρ , X k ] ,
so that using (A11) the sum of the correlations at level and level ( + 1 ) becomes,
I [ ρ , X + 1 ] + I [ ρ , X ] = k = 1 n + 1 I [ ρ , X k + 1 ] + k = 1 n I [ ρ , X k ] = k = 1 n + 1 i = 1 n + 1 k S [ ρ , X k i + 1 ] k = 1 n S [ ρ , X k ] ,
where we used from (A11) and (A10) that,
k = 1 n i = 1 n k S [ ρ , X k i ] = k = 1 n + 1 S [ ρ , X k + 1 ] .
Summing over all up to some n beginning with X 0 = def X , the expression in (A15) generalizes to,
= 0 n I [ ρ , X ] = k = 1 n n i = 1 n n k S [ ρ , X k i n ] S [ ρ , X ] .
If one continues splitting until each subset X k i N = X i contains only a single variable from X , then the above Equation becomes,
= 0 N I [ ρ , X ] = i = 1 N S [ ρ , X i ] S [ ρ , X ] = I [ ρ ] ,
which is the total correlation. Given that the subsets at each split level are disjoint (A11), then the above equation is independent of the choice of splitting. This proves the Theorem A1. □

Appendix C. Consequences of the Derivation

Here we will analyze some basic properties of total correlation and mutual information, as well as show their consistency with the design criteria.

Appendix C.1. Inferential Transformations (Again)

Due to the results of our design derivation, we can quantify how the amount of correlations change with the inferential transformations from Section 2.1 and obtain a better understanding of them. We will also discuss some of the useful properties of the mutual information in Sections Section 6.2 and Section 6.3.

Appendix C.1.1. Type I: Coordinate Transformations (Again)

Consider type I transformations, which are coordinate transformations. Under type I, the density changes from p ( x 1 , , x N ) to p ( x 1 , , x N ) so that the probabilities remain equal,
p ( x 1 , , x N ) d x = p ( x 1 , , x N ) d x .
The individual density p ( x 1 , , x N ) transforms as,
p ( x 1 , , x N ) = p ( x 1 , , x N ) γ ( x 1 , , x N ) .
If the transformation (A19) is split coordinate invariant, then we have that the new marginals must obey,
p ( x i ) d x i = p ( x i ) d x i .
It is important to note that the joint density p ( x ) and the marginals p ( x i ) are necessarily different functions than p ( x ) and p ( x i ) and are not simply the same function defined over the transformed space, which would be written as p ( x ) and p ( x i ) . Due to (41), it is true that,
I [ ρ , X ] = d x p ( x 1 , , x N ) log p ( x 1 , , x N ) i = 1 N p ( x i ) = d x p ( x 1 , , x N ) log p ( x 1 , , x N ) i = 1 N p ( x i ) = I [ ρ , X ] ,
which is coordinate invariant since the Jacobian factors γ ( x ) cancel in the logarithm.

Appendix C.1.2. Type II: Entropic Updating (Again)

We can determine how the amount of correlation changes under type II transformations. One can use the relative entropy (14) to update a joint prior distribution q ( x 1 , , x N ) to a posterior distribution p ( x 1 , , x N ) when information comes in the form of constraints,
f ( x 1 , , x N ) = d x p ( x 1 , , x N ) f ( x 1 , , x N ) = κ ,
where f ( x 1 , , x N ) is a generic function of the variables and κ is an arbitrary constant. Maximizing (14) with respect to the constraint (A23) leads to,
p ( x 1 , , x N ) = q ( x 1 , , x N ) Z exp β f ( x 1 , , x N ) ,
where Z = d x q ( x 1 , , x N ) exp β f ( x 1 , , x N ) and β is a Lagrange multiplier. The marginals of the updated distribution p ( x 1 , , x N ) are,
p ( x i ) = q ( x i ) Z d x ¯ i q ( x i | x ¯ i ) e β f ( x 1 , , x N ) ,
where d x ¯ i = k i d x k and x ¯ i = X \ X i (i.e., the total space X without the variable X i ). The total correlation of the prior is,
I [ ρ , X ] = d x q ( x 1 , , x N ) log q ( x 1 , , x N ) i = 1 N q ( x i ) ,
while the total correlation of the posterior is,
I [ ρ , X ] = d x p ( x 1 , , x N ) log p ( x 1 , , x N ) i = 1 N p ( x i ) , = d x q ( x 1 , , x N ) Z exp β f ( x 1 , , x N ) log q ( x 1 , , x N ) i = 1 N q ( x i ) β f ( x 1 , , x N ) + log Z i = 1 N p ( x i ) i = 1 N q ( x i ) p ( x 1 , , x N ) .
Transformations of type II retain some of the correlations from the prior into the posterior. The amount of correlations may increase or decrease depending on f ( x 1 , , x N ) . Note that even if f ( x 1 , , x N ) = f ( x i ) , that the amount of correlations can still change because, although q ( x k | x ¯ k ) remains fixed, q ( x i ) p ( x i ) becomes redistributed in a way that may group on highly correlated areas or not.

Appendix C.1.3. Type III: Marginalization (Again)

For type III transformations, we can determine the difference in the amount of correlations when we marginalize over a set of variables. Consider the simple case where X = X 1 × X 2 × X 3 consists of three variables. The amount of correlations between these two sets can be written using the grouping property,
I [ ρ , X ] = d x 1 d x 2 p ( x 1 , x 2 ) log p ( x 1 , x 2 ) p ( x 1 ) p ( x 2 ) + d x 1 d x 2 d x 3 p ( x 1 , x 2 ) p ( x 3 | x 1 , x 2 ) log p ( x 3 | x 1 , x 2 ) p ( x 3 ) I [ ρ , X 1 ; X 2 ; X 3 ] = I [ ρ , X 1 ; X 2 ] + I [ ρ , ( X 1 × X 2 ) ; X 3 ] ,
where each term on the right hand side is a mutual information. The first term on the right hand side is the mutual information between the variables X 1 and X 2 while the second is the mutual information between the joint space ( X 1 × X 2 ) and the space X 3 , which quantifies the correlations between the third variable and the rest of the proposition space. Essentially, (A28) shows that we can count correlations in pairs of subspaces provided we do not overcount, which is the reason why the joint space ( X 1 × X 2 ) appears in the second term.
Thus, if one marginalizes over X 3 so that the joint space becomes X 1 × X 2 , then the global correlations lost are given by the difference from (A28),
Δ I = I [ ρ , X 1 ; X 2 ] = I [ ρ , X 1 ; X 2 ; X 3 ] I [ ρ , ( X 1 × X 2 ) ; X 3 ] .
We can break down mutual informations further by using the grouping property again, which leads to the second term in the above relation becoming,
I [ ρ , ( X 1 × X 2 ) ; X 3 ] = d x 1 d x 3 p ( x 1 , x 3 ) log p ( x 1 , x 3 ) p ( x 1 ) p ( x 3 ) + d x 1 d x 2 d x 3 p ( x 1 ) p ( x 2 , x 3 | x 1 ) log p ( x 2 , x 3 | x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 ) = I [ ρ , X 1 ; X 3 ] + I [ ρ , X 2 ; X 3 | X 1 ] ,
where the quantity,
I [ ρ , X 2 ; X 3 | X 1 ] = d x 1 p ( x 1 ) d x 2 d x 3 p ( x 2 , x 3 | x 1 ) log p ( x 2 , x 3 | x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 ) ,
is typically called the conditional mutual information (CMI) [5]. In general we have the chain rule,
I [ ρ , X 1 , , X n ; Y ] = i = 1 n I [ ρ , X i ; Y | X i 1 , , X 1 ] .
Thus if one begins with the full space p ( x , y ) = p ( x 1 , x 2 , y ) and marginalizes over X 2 ,
p ( x 1 , y ) = d x 2 p ( x 1 , x 2 , y ) ,
then the global correlations lost are given by the CMI,
Δ I = I [ ρ , X ; Y ] I [ ρ , X 1 ; Y ] = I [ ρ , X 2 ; Y | X 1 ] .
Such a marginalization leaves the correlations invariant whenever the CMI is zero, i.e., whenever p ( x 2 , y ) = p ( x 2 ) p ( y ) are independent.

Appendix C.1.4. Type IV: Products (Again)

We find that transformations of type IV give the same expression. Consider the reverse of (A33) in which,
p ( x 1 , y ) = p ( x 1 ) p ( y | x 1 ) IV p ( x 1 , x 2 , y ) = p ( x 1 , x 2 ) p ( y | x 1 , x 2 ) .
Then the change in mutual information is given by the same expression as (A28), so that the gain in global correlations is simply the value of the CMI.

Appendix C.2. Type IVa and IVb: Redundancy and Noise

Using the special case of mutual information, we can better analyze and interpret the two special cases of type IV that we introduced in Section 2.1. We can show how the amount of bipartite correlations change with the addition or subtraction of variables exhibiting redundancy or noise, which are the special cases IVa and IVb, respectively.
Consider the two subspaces X and Y . If the subspace X = X 1 × × X N is a collection of N variables, then the joint distribution can be written
p ( x , y ) = p ( x 1 , , x N , y ) = p ( x 1 , , x N ) p ( y | x 1 , , x N ) .
If the conditional probability p ( y | x 1 , , x N ) is independent of X i but the distribution p ( x i | x 1 , , x i 1 , x i + 1 , , x N ) is not, then we say that X i is redundant. This is equivalent to the condition in (19) whenever x i = f ( X \ X i ) . In this case, the amount of correlations on the full set of variables I [ ρ ] and the set without X i , ( X ˜ = X \ X i ), are equivalent,
p ( x , y ) = p ( x 1 , , x i 1 , x i + 1 , , x N ) × p ( y | x 1 , , x i 1 , x i + 1 , , x N ) × δ ( x i f ( x 1 , , x i 1 , x i + 1 , , x N ) ) I [ ρ , X ] = I [ ρ ˜ , X ˜ ] .
Hence, the bipartite correlations in ( X i , Y ) are redundant. While the condition that X \ X i leads to the same mutual information as X can be satisfied by (A37), it is not necessary that X i = f ( X ˜ ) . In an extreme case, we could have that X i is independent of both X and Y ,
p ( x , y ) = p ( x i ) p ( x 1 , , x i 1 , x i + 1 , , x N ) × p ( y | x 1 , , x i 1 , x i + 1 , , x N ) .
In this case we say that the variable X i is noise, meaning that it adds dimensionality to the space ( X × Y ) without adding correlations. In the redundant case, the variable X i does not add dimension to the manifold ( X × Y ) .
In practice, each set of variables ( X , Y ) will contain some amount of redundancy and some amount of noise. We could always perform a coordinate transformation that takes X X and Y Y where,
X = X red × X noise × X corr Y = Y red × Y noise × Y corr ,
where X red , X noise are redundant and noisy variables respectively and X corr are the parts left over that contain the relevant correlations. Then the joint distribution becomes,
p ( x , y ) = p ( x noise ) p ( y noise ) p ( x corr ) p ( y corr | x corr ) × δ ( x red f ( x corr ) ) δ ( y red g ( y corr ) ) .
Thus we have that,
I [ ρ , X ; Y ] = I [ ρ , X red × X noise × X corr ; Y red × Y noise × Y corr ] = I [ ρ , X corr ; Y corr ] .
These types of transformations can be exploited by algorithms to reduce the dimension of the space ( X × Y ) to simplify inferences. This is precisely what machine learning algorithms are designed to do [34]. One particular effort to use mutual information directly in this way is the Information Bottleneck Method [59].

Appendix C.3. The Data Processing Inequality

The data processing inequality is often demonstrated as a consequence of the definition of MI. Consider the following Markov chain,
Θ X Y p ( θ , x , y ) = p ( θ ) p ( x | θ ) p ( y | x ) = p ( θ ) p ( x | θ ) δ ( y f ( x ) ) .
We can always consider the MI between the joint space ( X × Y ) and Θ , I [ ρ , X × Y ; Θ ] , which decomposes according to the chain rule (A32) as,
I [ ρ , X × Y ; Θ ] = I [ ρ , X , Θ ] + I [ ρ , Y ; Θ | X ] = I [ ρ , Y ; Θ ] + I [ ρ , X ; Θ | Y ] .
The conditional MI I [ ρ , Y ; Θ | X ] is however,
I [ ρ , Y ; Θ | X ] = d x p ( x ) d y d θ p ( y , θ ) log p ( θ | y , x ) p ( y | x ) p ( θ | x ) p ( y | x ) ,
which is zero since p ( θ | y , x ) = p ( θ | x ) . Since MI is positive, we then have that for the Markov chain (A42),
I [ ρ , X ; Θ ] I [ ρ , Y ; Θ ] ,
of which now we can interpret as expressing a loss of correlations. Equality is achieved only when I [ ρ , X ; Θ | Y ] is also zero, i.e., when Y is a sufficient statistic for X . We will discuss the idea of sufficient statistics in Section 6.

Appendix C.4. Upper and Lower Bounds for Mutual Information

The upper bounds for mutual information can be found by using the case of complete correlation, y = f ( x ) ,
I [ ρ , X ; Y ] = d x d y p ( x ) δ ( y f ( x ) ) log p ( x | y ) p ( x ) = d x p ( x ) log p ( x ) p ( x | f ( x ) ) = S [ p ( x ) , p ( x | f ( x ) ) ] ,
which is the relative entropy of p ( x ) with respect to p ( x | f ( x ) ) . If f ( x ) is a coordinate transformation, i.e., is a bijection, then Equation (A46) becomes,
I [ ρ , X ; Y ] = d x p ( x ) log δ ( x f 1 f ( x ) ) p ( x ) = ,
since δ ( x f 1 f ( x ) ) = δ ( 0 ) = . Hence, the MI is unbounded from above in the continuous case. In the discrete case [42] we find,
I [ ρ , X ; Y ] = x i , y j P ( x i ) δ y x log P ( x i | y j ) P ( x i ) = x i P ( x i ) log P ( x i ) = H [ ρ , X ] ,
where H [ ρ , X ] is the Shannon entropy and the notation X X refers to a sample drawn from the ambient space X . Since (A48) does not depend on the functional form of f, the upper bound is simply the Shannon entropy of one of the two variables. This can be seen by expanding the discrete MI as a sum of Shannon entropies,
I [ ρ , X ; Y ] = H [ ρ , X ] + H [ ρ , Y ] H [ ρ , X , Y ] ,
which in the case of complete correlation ( y = f ( x ) ), the joint Shannon entropy becomes,
H [ ρ , X ; Y ] = x i , y j P ( x i ) δ y x log ( P ( x i ) δ y x ) = x i P ( x i ) log P ( x i ) = H [ ρ , X ] ,
and so the upper bound is,
I max [ ρ , X ; Y ] = max H [ ρ , X ] , H [ ρ , Y ] .
If y = f ( x ) is a bijection, then the entropies H [ ρ , X ] = H [ ρ , Y ] since Y is just a reparameterization of X and hence the probabilities P ( x ) = P ( y ) .
Since the mutual information is unbounded from above, then so too is the total correlation due to the additivity in the chain rule (A28).

References

  1. Caticha, A. Towards an Informational Pragmatic Realism. arXiv 2014, arXiv:1412.5644. [Google Scholar] [CrossRef]
  2. Cox, R.T. The Algebra of Probable Inference; The Johns Hopkins University Press: Baltimore, MD, USA, 1961. [Google Scholar]
  3. Pearson, K. Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 1895, 58, 240–242. [Google Scholar]
  4. Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing independence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
  5. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  6. Watanabe, S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar] [CrossRef]
  7. Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting Novel Associations in Large Datasets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Shore, J.E.; Johnson, R.W. Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar] [CrossRef] [Green Version]
  9. Skilling, J. The Axioms of Maximum Entropy. In Maximum-Entropy and Bayesian Methods in Science and Engineering; Erickson, G.J., Smith, C.R., Eds.; Kluwer: Dordrecht, The Netherlands, 1988. [Google Scholar]
  10. Caticha, A. Entropic Inference and the Foundations of Physics; (Monograph Commissioned by the 11th Brazilian Meeting on Bayesian Statistics–EBEB-2012). Available online: http://www.albany.edu/physics/ACaticha-EIFP-book.pdf (accessed on 16 March 2020).
  11. Caticha, A. Entropic Inference. AIP Conf. Proc. 2011, 1350, 20–29. [Google Scholar]
  12. Vanslette, K. Entropic Updating of Probabilities and Density Matrices. Entropy 2017, 19, 664. [Google Scholar] [CrossRef] [Green Version]
  13. Vanslette, K. The Inferential Design of Entropy and its Application to Quantum Measurements. arXiv 2018, arXiv:1804.09142. [Google Scholar]
  14. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Champaign, IL, USA, 1999. [Google Scholar]
  15. Baez, J.C.; Fritz, T.A. Bayesian characterization of relative entropy. Theory Appl. Categ. 2014, 29, 422–456. [Google Scholar]
  16. Baudot, P.; Bennequin, D. The homological nature of entropy. Entropy 2015, 17, 1–66. [Google Scholar] [CrossRef]
  17. Baudot, P.; Tapia, M.; Bennequin, D.; Goaillard, J. Topological Information Data Analysis. Entropy 2019, 21, 869. [Google Scholar] [CrossRef] [Green Version]
  18. Baudot, P. The Poincaré-Shannon Machine: Statistical Physics and Machine Learning Aspects of Information Cohomology. Entropy 2019, 21, 881. [Google Scholar] [CrossRef] [Green Version]
  19. Cox, R. Probability, Frequency and Reasonable Expectation. Am. J. Phys. 1946, 14, 1–13. [Google Scholar] [CrossRef]
  20. Jaynes, E.T. Probability Theory: The Logic of Science; Bretthorst, L., Ed.; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  21. Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620. [Google Scholar] [CrossRef]
  22. Jaynes, E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern. 1968, SSC-4, 227. [Google Scholar] [CrossRef]
  23. Stigler, S. Studies in the History of Probability and Statistics. XXXII: Laplace, Fisher, and the discovery of the concept of sufficiency. Biometrika 1973, 60, 439–445. [Google Scholar] [CrossRef]
  24. Neyman, J.; Pearson, E. IX. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. A 1933, 231, 289–337. [Google Scholar]
  25. Amari, S. Differential Geometrical Theory of Statistics. In Differential Geometry in Statistical Inference; Lecture Note Monograph Series; Institute of Mathematical Statistics: California, CA, USA, 1987; Volume 10. [Google Scholar]
  26. Cencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Soc.: Washington, DC, USA, 2000. [Google Scholar]
  27. Caticha, A. The Entropic Dynamics Approach to Quantum Mechanics. Entropy 2019, 21, 943. [Google Scholar] [CrossRef] [Green Version]
  28. Caticha, A.; Giffin, A. Updating Probabilities. AIP Conf. Proc. 2006, 872, 31–42. [Google Scholar]
  29. Giffin, A.; Caticha, A. Updating Probabilities with Data and Moments. AIP Conf. Proc. 2007, 954, 74–84. [Google Scholar]
  30. Ay, N.; Jost, J.L.H.; Schwachhofer, L. Information geometry and sufficient statistics. Probab. Theory Relat. Fields 2015, 162, 327–364. [Google Scholar] [CrossRef] [Green Version]
  31. Bauer, M.; Bruveris, M.; Michor, P.W. Uniqueness of the Fisher-Rao metric on the space of smooth densities. Bull. Lond. Math. Soc. 2016, 48, 499–506. [Google Scholar] [CrossRef] [Green Version]
  32. Lê, H. The uniqueness of the Fisher metric as information metric. Ann. Inst. Stat. Math. 2017, 69, 879–896. [Google Scholar] [CrossRef] [Green Version]
  33. Dirac, P. The Principles of Quantum Mechanics; Oxford at the Clarendon Press: Oxford, UK, 1930. [Google Scholar]
  34. Carrara, N.; Ernst, J.A. On the Upper Limit of Separability. arXiv 2017, arXiv:1708.09449. [Google Scholar]
  35. Ver Steeg, G. Unsupervised Learning via Total Correlation Explanation. arXiv 2017, arXiv:1706.08984. [Google Scholar]
  36. Ver Steeg, G.; Galstyan, A. The Information Sieve. arXiv 2015, arXiv:1507.02284. [Google Scholar]
  37. Gao, S.; Brekelmans, R.; Ver Steeg, G.; Galstyan, A. Auto-Encoding Total Correlation Explanation. arXiv 2018, arXiv:1802.05822. [Google Scholar]
  38. Ver Steeg, G.; Galstyan, A. Discovering Structure in High-Dimensional Data Through Correlation Explanation. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Dutchess County, NY, USA, 2014; pp. 577–585. [Google Scholar]
  39. Csiszár, I. Axiomatic Characterizations of Information Measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef] [Green Version]
  40. Azcél, J.; Daróczy, Z. On Meausre of Information and Their Characterizations, Mathematics in Science and Engineering; Academic Press: Cambridge, MA, USA, 1975; Volume 115. [Google Scholar]
  41. Graham, R.L.; Knuth, D.E.; Patashnik, O. Concrete Mathematics; Addison–Wesley: Reading, MA, USA, 1988. [Google Scholar]
  42. Merkh, T.; Montufar, G. Factorized Mutual Information Maximization. arXiv 2019, arXiv:1906.05460. [Google Scholar]
  43. Kullback, S. Information Theory and Statistics; John Wiley and Sons: Hoboken, NJ, USA, 1959. [Google Scholar]
  44. Fisher, R. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. A 1922, 222, 309–368. [Google Scholar]
  45. Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 Feburary 2015; pp. 1–6. [Google Scholar]
  46. McGill, W. Multivariate information transmission. Psychometrika 1954, 19, 97–116. [Google Scholar] [CrossRef]
  47. Hu, K. On the Amount of Information. Theory Probab. Appl. 1962, 7, 439–447. [Google Scholar]
  48. Massey, J. Casuality, Feedback and Directed Information. In Proceedings of the 1990 International Symposium on Information Theory and Its Applications, Waikiki, HI, USA, 27–30 November 1990. [Google Scholar]
  49. Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85, 461–464. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  50. Sun, J.; Bollt, E.M. Causation entropy identifies indirect influences, dominance of neighbors and anticipatory couplings. Phys. D Nonlinear Phenom. 2014, 267, 49–57. [Google Scholar] [CrossRef] [Green Version]
  51. Sun, J.; Cafaro, C.; Bollt, E.M. Identifying the Coupling Structure in Complex Systems through the Optimal Causation Entropy Principle. Entropy 2014, 16, 3416–3433. [Google Scholar] [CrossRef] [Green Version]
  52. Cafaro, C.; Lord, W.M.; Sun, J.; Bollt, E.M. Causation entropy from symbolic representations of dynamical systems. CHAOS 2015, 25, 043106. [Google Scholar] [CrossRef] [Green Version]
  53. Renyi, A. On measures of entropy and information. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–3 July 1961; Volume 1, p. 547. [Google Scholar]
  54. Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statis-tics. J. Stat. Phys. 1988, 52, 479. [Google Scholar] [CrossRef]
  55. Tsallis, C. The non-additive entropy Sq and its applications in physics and elsewhere; some remarks. Entropy 2011, 13, 1765–1804. [Google Scholar] [CrossRef]
  56. Studený, M.; Vejnarová, J. The Multiinformation Function as a Tool for Measuring Stochastic Dependence. In Learning in Graphical Models; Jordan, M.I., Ed.; Springer: Dordrecht, The Netherlands, 1998; pp. 261–297. [Google Scholar] [CrossRef]
  57. Bell, A. The Co-Information Lattice. In Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 1–4 April 2003. [Google Scholar]
  58. Sun, J.; Taylor, D.; Bollt, E.M. Causal Network Inference by Optimal Causation Entropy. arXiv 2014, arXiv:1401.7574. [Google Scholar] [CrossRef] [Green Version]
  59. Tishby, N.; Pereira, F.C.; Bialek, W. The Information Bottleneck Method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]

Share and Cite

MDPI and ACS Style

Carrara, N.; Vanslette, K. The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency. Entropy 2020, 22, 357. https://doi.org/10.3390/e22030357

AMA Style

Carrara N, Vanslette K. The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency. Entropy. 2020; 22(3):357. https://doi.org/10.3390/e22030357

Chicago/Turabian Style

Carrara, Nicholas, and Kevin Vanslette. 2020. "The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency" Entropy 22, no. 3: 357. https://doi.org/10.3390/e22030357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop