Sufficient Dimension Reduction: An Information-Theoretic Viewpoint

Ghosh, Debashis

doi:10.3390/e24020167

Open AccessArticle

Sufficient Dimension Reduction: An Information-Theoretic Viewpoint

by

Debashis Ghosh

Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO 80045, USA

Entropy 2022, 24(2), 167; https://doi.org/10.3390/e24020167

Submission received: 3 December 2021 / Revised: 27 December 2021 / Accepted: 18 January 2022 / Published: 22 January 2022

(This article belongs to the Special Issue Applications of Information Theory in Statistics)

Download Versions Notes

Abstract

:

There has been a lot of interest in sufficient dimension reduction (SDR) methodologies, as well as nonlinear extensions in the statistics literature. The SDR methodology has previously been motivated by several considerations: (a) finding data-driven subspaces that capture the essential facets of regression relationships; (b) analyzing data in a ‘model-free’ manner. In this article, we develop an approach to interpreting SDR techniques using information theory. Such a framework leads to a more assumption-lean understanding of what SDR methods do and also allows for some connections to results in the information theory literature.

Keywords:

central subspace; information bottleneck; single-index model

1. Introduction

In statistical modeling, a key challenge is to determine appropriate transformations of the data that can reduce its dimension while at the same time capturing the essential information in the regression relationship between a set of covariates and a response variable. To this end, there has been a field of statistics, termed sufficient dimension reduction (SDR), that has sought to develop a methodology with this goal in mind. Broadly speaking, sufficient dimension reduction represents a class of ‘model-free’ methodologies that seek to find directions in the data that can capture the essential information in the regression relationship previously mentioned. An excellent recent monograph on the topic can be found in [1].

Historically, the basis for sufficient dimension reduction methods was the observation by authors, such as Brillinger [2] and Li and Duan [3], that regression parameters estimated by ordinary least squares were consistent, up to a constant, for their population counterparts in a generalized single-index model. This result required an assumption on the covariates being elliptically symmetric, which has been reframed into the current sufficient dimension reduction literature as the so-called linearity assumption. More recent formulations for sufficient dimension reduction have postulated the existence of a central subspace; subsequently, the goal of sufficient dimension reduction methods is to estimate the basis vectors of the central subspace. There now exists a wide variety of techniques available for estimation in sufficient dimension reduction; we will provide a review of such methods in Section 2.1.

In this article, we propose a new interpretation for sufficient dimension reduction based on conditional independence assumptions. Using graphical models, we are able to connect sufficient dimension reduction methods to information bottleneck theory [4]. The information bottleneck methodology was pioneered by the late Naftali Tishby and seeks to develop a ‘short code for X that preserves the maximum information about Y’ [4]. Information bottleneck typically formulates an optimization problem that seeks to find a compressed representation that minimizes information loss while imposing a penalty related to the expected distortion of the compression. The compression is the ‘bottleneck’ in the term ‘information bottleneck’. The optimization is solved using calculus of variations and leads to a set of self-consistent equations for finding optimal codes that are related to proposals by Blahut [5] and Arimoto [6]. The information bottleneck approach has been applied to a variety of problems in machine learning, such as document clustering [7,8], multivariate density estimation [9] and deep learning [10,11].

The interpretation developed in this paper allows us to demonstrate the following:

(a).: We can view sufficient dimension reduction as a means of preserving information that is relevant to a response variable. It can be interpreted as performing the information bottleneck in two directions.
(b).: Conversely, we will see that the information bottleneck is performing sufficient dimension reduction in a certain sense.
(c).: By moving to mutual information, we can relax some of the distributional assumptions needed for sufficient dimension reduction in a manner different from that in [12,13,14,15,16]. This direction is a departure from the viewpoint that SDR serves as a means to estimate a target parameter, typically the span of basis vectors of the central subspace.
(d).: In the case of Gaussian variables, we can develop a method for identifying ‘phase transitions’ in the structural dimension of central subspaces by expanding the work of [17] to handle sufficient dimension reduction procedures.

While many of the information-theoretic results are well-known to the information theory community, their embedding and merging with the literature on sufficient dimension reduction will be novel to statisticians.

The closest statistical work in nature to ours is that of Wang et al. [18]. They leverage the Hellinger integral of order two [19], which is related to the Kullback–Leibler divergence, an important quantity in information theory. Wang et al. [18] define subspace-based information measures using the Hellinger integral and demonstrate that a central subspace preserves information on this scale. For its estimation, they use a nonparametric regression approach that bears a resemblance to the minimum average variance estimation approach of Xia et al. [12]. The idea of using the Kullback–Leibler divergence for the optimization and estimation of the central subspace and other measures of association was used by Yin and co-authors in a series of papers [20,21,22,23]. We will also note work by Cook and Ni [24], who use a minimum discrepancy approach for finding the central subspace, and the work of Yao et al. [25], who developed a sufficient reduction procedure using the Fisher information metric, which can also be shown to be connected to Kullback–Leibler divergence. Finally, a device we use in the paper is graphical models, and we note recent work by [26].

The outline of this paper is as follows. We review the literature on sufficient dimension reduction, as well as pointing out some limitations, in Section 2. Section 3 seeks to develop connections between sufficient dimension reduction and information theory using graphical models. We focus on the Gaussian information bottleneck [17] and its relationship to SDR in Section 4. We illustrate the methodology with application to a dataset in Section 5. Section 6 concludes with some discussion.

2. Background and Preliminaries

2.1. Data Structures and Review of Dimension Reduction Methods

Much of the material presented here is expounded upon in the monograph by Li [1]. Let the data be represented as

(Y_{i}, Z_{i})

,

i = 1, \dots, n

, a random sample from the joint distribution

(Y, Z)

, where Y denotes the response of interest and Z is a p-dimensional vector of covariates. Suppose we formulate the following regression model for Y given Z:

E (Y ∣ Z) = g (β_{1}^{'} Z, β_{2}^{'} Z, \dots, β_{k}^{'} Z, u),

(1)

where

β_{j}

(j = 1, \dots, k)

are p-dimensional vectors of unknown regression coefficients, u is an error term, and g is an unspecified monotonic link function. Because of the presence of the parametric components involving

β_{j}

, as well as the nonparametric specification of the link function, model (1) is semiparametric. Note that when

k = 1

, model (1) reduces to a single-index model [27]. In addition, model (1) can accommodate non-homoskedasticity in the error term if the variance depends on

β_{j}^{'} Z

.

The starting point of dimension reduction methods is the conditional independence of Y and Z given

E (Y ∣ Z)

. We define two random variables, A and B, to be conditionally independent given C if

P (A | B, C) = P (A | C) .

We will use the notation

A ⫫ B | C

to represent conditional independence. An implication of model (1) being true is that there exists a

p \times k

matrix B, where

Y ⫫ Z | B^{'} Z .

(2)

Another way of stating (2) is that the projection

B^{'} Z

provides a sufficient data reduction and contains the essential information about the relationship between Z and Y. More generally, we can define a projection operator

P_{B}

to be the symmetric and idempotent operator onto the subspace spanned by the columns of

B

. Then, (2) can be re-expressed as

Y ⫫ Z ∣ P_{B} Z .

(3)

If (3) holds, then it also holds for any subspace of

\tilde{B}

such that the span of

B

is the same as the span of

\tilde{C}

. Let

S (B)

be the subspace generated by the columns of

B

. Let

S_{Y | Z}

denote the intersection of all possible subspaces; if

S_{Y | Z}

also satisfies (3), then we will refer to

S_{Y | Z}

as the central subspace [28]. We will assume throughout that the central subspace exists [28,29,30]. In the classical presentation for sufficient dimension reduction methodology, the parameter has been defined to be the span of

S_{Y | Z}

. In other words, if

v_{1}, \dots, v_{K}

denote the basis vectors for

S_{Y | Z}

, then

S_{Y | Z} \equiv span (v_{1}, \dots, v_{K})

is the target of sufficient dimension reduction procedures. Thus, there is an estimand that is often targeted by sufficient dimension reduction procedures.

We assume, without a loss of generality, that Z has a mean zero vector and covariance matrix equal to the identity matrix. One key assumption necessary for the implementation of one class of sufficient dimension reduction procedures is that the distribution of Z, conditional on

P_{B} Z

, satisfies a conditional linearity in the mean, i.e.,

E (Z ∣ P_{B} Z) = P_{B} Z .

(4)

Assumption (4) pertains to the marginal distribution of Z and means that all the information about Z is contained in its projection onto the subspace spanned by

B

. One class of distributions that satisfies the linearity condition is the family of elliptically symmetric distributions. This includes distributions, such as the multivariate normal distributions and scale mixtures of multivariate normal distributions.

As mentioned in the Introduction, there are many algorithms available for estimating the basis vectors of the central subspace. We describe the implementation of sliced inverse regression proposed by Li [28].

(a).: ‘Slice’ the response variable Y into J slices, denoted as $Y_{1}, \dots, Y_{J}$ ;
(b).: Standardize the predictor observations as

${\tilde{Z}}_{i} = {\hat{Σ}}^{- 1 / 2} (Z_{i} - \hat{μ}), (i = 1, \dots, n),$

where $\hat{μ}$ and $\hat{Σ}$ are the sample mean and covariance matrices of $Z_{1}, \dots, Z_{n}$ ;
(c).: Calculate sample mean estimates within slices: ${\bar{Z}}_{j} = {n_{j}}^{- 1} \sum_{i = 1}^{n} I (Y_{i} \in Y_{j}) {\tilde{Z}}_{i},$ where $n_{j} = \sum_{i = 1}^{n} I (Y_{i} \in Y_{j})$ , $j = 1, \dots, J$ ;
(d).: Estimate the population covariance matrix of Z given Y by

$\hat{Θ} = \sum_{j = 1}^{J} \frac{n_{j}}{n} {\bar{Z}}_{j} {\bar{Z}}_{j}^{'};$
(e).: Compute the eigenvalues of $\hat{Θ}$ . These are the estimates of the basis vectors for the central subspace.

This algorithm is termed ‘inverse regression’ because effectively, information on the ‘backwards regression’

E (Z ∣ Y)

is being estimated here rather than the ‘forward regression’

E (Y ∣ Z)

. Li [28] argues that this approach circumvents the usual issue of the curse of dimensionality. Other advantages of the sliced inverse regression algorithm are that it avoids multivariate nonparametric smoothing and is quite easy to fit.

The validity of the sliced inverse regression algorithm for estimating the central subspace relies on the linearity assumption. There has been much work on developing alternative estimation procedures that seek to relax the linearity assumption. For example, Xia et al. [12] propose the minimum average variance estimation procedure, which relies on a combination of nonparametric smoothing with weighted least squares. Since it involves nonparametric regression, its convergence depends on an appropriate rate of convergence for the bandwidth in conjunction with the sample size converging to infinity. Cook and Ni [24] proposed a minimum discrepancy method in which sufficient dimension reduction is characterized using an objective function approach. This leads to an alternating least squares algorithm for the estimation of the central subspace.

Many of the sufficient dimension-reduction methods can be viewed as solving the following eigenvalue/eigenvector problem:

A b_{j} = λ_{j} Σ_{Z} b_{j},

(5)

for

j = 1, \dots, k

, where

Σ_{Z}

denotes the covariance matrix of Z, and

(λ_{j}, b_{j})

denotes the eigenvalue/eigenvector pairs. We say that the matrix pair

(A, Σ_{Z})

is a generalized eigenvalue solution (GES) if it satisfies (5). This is discussed at length in the monograph by Li [1]. Note that typically, the solutions to (5) are returned as

λ_{1} \geq λ_{2} \geq \dots \geq λ_{k},

and

b_{j}^{'} Σ_{Z} b_{j} = 1

for

j = 1, \dots, k

. The choice of

A

in (5) depends on the particular sufficient dimension reduction algorithm that is used. For example, in sliced inverse regression,

A

would represent the covariance matrix of the slice means. For principal Hessians directions [31],

A

in (5) is taken to be a weighted covariance matrix of the response to Z.

The matrix formulation in (5) allows for immediate generalizations to nonlinear versions of sufficient dimension. This can be done by replacing A in (5) with a so-called ‘kernelized’ matrix computed using inner products of covariates mapped to higher-dimensional spaces. Such methods are related to the procedures in Wu et al. [32], Fukumizu et al. [14] and Lee et al. [16].

2.2. Limitations of Sufficient Dimension Reduction

As mentioned above, one of the key assumptions in applying sufficient dimension reduction methodology is termed the linearity condition. A sufficient condition for this to hold is that the predictor variables of interest follow an elliptically contoured distribution. Distributions that satisfy elliptical symmetry include the multivariate normal distribution and the multivariate t-distribution. One of the main criticisms leveled against the sufficient dimension reduction methods is that this assumption will not be satisfied in practice. For example, if covariates are discrete, then this will violate the linearity condition. Many authors invoke the theoretical results of Hall and Li [33], which suggests that in an asymptotic framework, the linearity condition will hold. An alternative approach has been to develop generalizations of the sufficient dimension reduction methodology that relax the linearity condition. Such approaches can be found in proposals, such as Chiaromonte et al. [34], Fukumizu et al. [13,14], Li et al. [15] and Lee et al. [16].

The other issue with sufficient dimension reduction methods involves the identification of the basis of

S_{Y | Z}

, which is referred to as the directions of the central subspace. These vectors are not estimable in the situation where the components of Z are discrete. Such variables arise routinely in biomedical, sociological and demographic studies (e.g., race/gender), and this limitation makes the use of sufficient dimension reduction methods challenging. In an important work, Chiaromonte et al. [34] developed an approach to sufficient dimension reduction with categorical predictors. The idea is to perform the sliced averaging of the continuous covariates within each of the levels defined by the combination of the categorical variables. Then, the level-specific covariance matrices are pooled, and the directions are estimated using spectral decomposition, similar to the description of sliced inverse regression in Section 2.1.

3. Graphical Models, Connections and Information Theoretic Results

To link sufficient dimension reduction methods to the information bottleneck, we will now introduce some concepts from graph theory and graphical models [26,35]. A graph

G = (V, E)

consists of a set of vertices V and a collection of edges E. Here,

V \equiv {v_{1}, \dots, v_{m}}

denotes the collection of m vertices and the edges E consist of two-element subsets of the power set of V that denote edges between vertices. To simplify the discussion, we will assume that there are no edges from a vertex to itself, i.e., no self-loops. Graphs whose edges have more than two elements are referred to as hypergraphs [36] and will not be considered further in the paper. The vertices represent random variables, and the edges are used to specify dependencies between the random variables. There are two types of edges we will consider here between vertices

v_{1}

and

v_{2}

. A directed edge is denoted by

v_{1} \to v_{2}

and implies that

v_{1}

affects

v_{2}

and not vice versa. An undirected edge is denoted by

v_{1} - v_{2}

and is equivalent to

v_{1} \to v_{2}

and

v_{2} \to v_{1}

. Thus for undirected edges,

v_{1}

and

v_{2}

simultaneously affect each other. We define the parents of a vertex v by

pa (v) = {u \in V : \exists a directed edge from u to v} .

It is a well-known fact that for an acyclic directed graph [26,35], one can factorize the joint distribution of random variables defined on the graph G as

f (X_{u} : u \in G) = \prod_{v \in V} p (x_{v} | x_{s}, s \in p a (v)) .

(6)

The final graphical model concept we will need is that of d-separation [35]. If G is a directed graph in which X, Y and Z. are a disjoint sets of vertices, then X and Y are d-separated by Z in G if and only if every path from a vertex in X to a vertex in Y is intercepted by a vertex in Z.

We can see that assumption (3) corresponds to the following graphs

\begin{matrix} Z & \to P_{B} Z & \to Y \\ Z & \leftarrow P_{B} Z & \leftarrow Y \\ Z & \to P_{B} Z & \leftarrow Y \\ Z & \leftarrow P_{B} Z & \to Y \end{matrix}

(7)

This follows from using the definition of undirected graphs and conditional independence. Similarly, the information bottleneck approach works with the graph

Z \to T \to Y

(8)

The comparison of (7) and (8) offers the following insights. First, the central subspace performs d-separation of Z and Y. Similarly, the role of T in the information bottleneck framework is to intercept paths between Z and Y. This leads us to the following result, which will be new to statisticians:

Proposition 1.

The central subspace can also serve as an information bottleneck.

Proof.

The proof of the proposition follows by observing that the graph in (8) is a subgraph of the graphs in (7). □

Remark 1.

Returning to the work of Wang et al. [18], the graphical representation in (7) makes sufficient dimension reduction integrating the forward and backward regressions. The graphs in (7) are precisely the forward and backward regression that Wang and colleagues speak of. They can also be viewed as ‘forward’ and ‘backward’ information bottlenecks. Thus, we see that sufficient dimension reduction is attempting to simultaneously perform a forward and reverse information bottleneck, while information bottlenecks themselves operate in the forward direction.

Based on the proposition, we observe that the role of the central subspace in sufficient dimension reduction plays a role akin to the information bottleneck. Using the viewpoint of information theory, we can interpret the goal of sufficient dimension reduction as one of information compression. This allows the use of these methods even in situations when the central subspace will not be estimable.

To make the idea concrete, we will be interpreting

P_{B} Z

as a random variable in the rest of the section. We will further assume that Z and Y are discrete random variables that are potentially multivariate. The entropy of Z is defined by

H (Z) = - log \sum_{z \in Z} p (z) log p (z),

(9)

where

Z

is the range of Z and

p (z)

denotes the probability mass function. Similarly, we can define the mutual information of Z and Y as

I (Z; Y) = \sum_{z, y} p (z, y) log \frac{p (z, y)}{p (z) p (y)}

(10)

which extends upon (9) in a natural way. Mutual information measures the dependence between two random variables. It has the following properties: (a) it is symmetric in Z and Y; (b) it is nonnegative; (c) it is equal to zero if and only if Z and Y are independent.

A comprehensive overview of entropy and mutual information can be found in Cover and Thomas [37]. To keep the discussion self-contained, we now provide a summary of many basic properties of entropy and mutual information. Further details can be found in Chapter 2 of Cover and Thomas [37].

Property 1.

1.: $I (Z; Y) = H (Z) - H (Z | Y)$ , where $H (Z | Y) = - \sum_{z, y} p (z, y) log p (z | y)$ .
2.: $I (Z; Y) = H (Y) - H (Y | Z)$ .
3.: $I (Z; Y) = I (Y; Z)$ .
4: $I (Z; Y) = H (Z) + H (Y) - H (X, Y)$ .
5.: $H (X, Y) = H (X) + H (Y | X)$ .

We note that the last property is typically referred to as the chain rule for entropy and can be extended to more than two random variables.

For the graphs considered in (7) and (8), we need to consider conditional versions of mutual information. This is given to us by Equation (2.61) of Cover and Thomas [37].

Definition 1.

The conditional mutual information of Z and Y given W is defined as

I (Z; Y | W) = \sum_{z, y, w} p (z, y, w) log \frac{p (z, y | w)}{p (z | w) p (y | w)} .

Finally, we will need one more definition from Chapter 2.8 of Cover and Thomas [37].

Definition 2

(p. 34 of [37]).

Z, Y

and W form a Markov Chain, denoted as

Z \to W \to Y

if the conditional distribution of Y given W and Z only depends on Z.

We assume a reversible Markov Chain so that

Z \to W \to Y

and

Y \to W \to Z

are treated as equivalent. Thus, the reversibility of the Markov Chain allows us to conceptually drop the directionality in DAGs, which becomes in line with the conditional independence assumptions outlined in Section 2.1. We have the following celebrated result from information theory, the data-processing inequality (p. 34 of [37]):

Theorem 1.

If

Z \to W \to Y

, then

I (W; Z) \geq I (Y; Z)

.

The data processing inequality guarantees equality if and only if Z and Y are independent given W. We can now take these results and apply them to the graphs for sufficient dimension reduction.

Corollary 1.

Assumption (3) is equivalent to

I (Z; P_{B} Z) = I (Z; Y) .

This corollary also relates to Theorem 1 of Wang et al. [18]. This formalizes the proposition earlier in the paper.

Remark 2.

Note that we have rephrased the central subspace as a random variable that attempts to minimize an information-based criterion. Thus, we get away from the traditional viewpoint where we view the goal of sufficient dimension reduction as targeting the span of the central subspace. Doing so provides another justification for the use of sufficient dimension reduction. This is in the spirit of ‘assumption-lean inference’ [38] in which the goal is to have available statistical methods that can be useful even when a true model or parameter does not exist.

Remark 3.

The mutual information is intimately related to the Kullback–Leibler divergence of two probability measures. A nice overview of how Kullback–Leibler divergences are related to information theoretic quantities can be found in [19]. We will explore the link between sufficient dimension reduction methods and Kullback–Leibler divergences in future work.

4. The Case of Gaussian Variables

In most problems involving the information bottleneck, one can use the Blahut–Arimoto algorithm [5,6], which is an iterative algorithm that involves repeated projection operations. In this section, we study a noniterative information bottleneck algorithm by Chechik et al. [17]. They deal with the situation of Z and Y having a joint Gaussian distribution and show that one can use an eigenvector/eigenvalue decomposition of certain matrices to achieve an information bottleneck. We then show how to relate this to several sufficient dimension reduction procedures.

Chechik et al. [17] considered the situation of

(Z, Y)

having a joint Gaussian or multivariate normal distribution. Without loss of generality, we will assume a mean of zero throughout the section. The goal of the Gaussian information bottleneck is to find a mapping from Z to T, such that the information content of Z is sufficiently compressed while at the same time maintaining its association with Y. Formally, the Gaussian information bottleneck involves the minimization of

L \equiv I (Z; T) - β I (T; Y)

over matrices

A

and

Σ_{e}

, where

T = A Z + e, e \sim M V N (0, Σ_{e}),

(11)

and

M V N (0, Σ)

denotes a multivariate normal distribution with mean zero vector and covariance matrix

Σ

. Note that in (11), we assume that e is independent of Z. Because of the linearity of T in Z in (11), T will have a multivariate Gaussian distribution with mean zero vector and covariance matrix

A Σ_{Z} A^{'} + Σ_{e}

. Chechik et al. [17] prove the following theorem.

Theorem 2

(Theorem 3.1. of [17]). The optimal solution to the Gausssian information bottleneck problem (11) for a given β is given by

Σ_{e}^{o p t} = I

and

A = \{\begin{matrix} [0^{'} \dots 0^{'}] i f 0 \leq β \leq β_{1}^{*} \\ [γ_{1} v_{1}^{'} 0^{'} \dots] i f β_{1}^{*} < β \leq β_{2}^{*} \\ [γ_{1} v_{1}^{'} γ_{2} v_{2}^{'} 0^{'} \dots] i f β_{2}^{*} < β \leq β_{3}^{*} \\ ⋮ \end{matrix}

where

γ_{1}, \dots, γ_{n}

are functions of the eigenvalues

α_{1}, \dots, α_{n}

of

Σ_{Z | Y} Σ_{Z}^{- 1}

, defined as

γ_{i} = \sqrt{\frac{β (1 - α_{i}) - 1}{α_{i} r_{i}}},

r_{i} = v_{i}^{'} Σ_{Z} v_{i}

and

β_{i}^{*} = {(1 - α_{i})}^{- 1}, i = 1, \dots, n .

Thus, the theorem demonstrates the tradeoff between compression and its associated cost. If β is smaller than

β_{1}^{*}

, then the cost is too high, and the optimal solution is the zero matrix. Otherwise, we see that we can start to identify subspaces for larger values of β associated with the eigenvectors of

Σ_{Z | Y} Σ_{Z}^{- 1}

. We also see a transition in terms of the dimensions of the subspaces spanned by

v_{i}

as β increases. There are also discrete jump points for β. We refer to the result of Theorem 2 as the Gaussian Information Bottleneck Theorem.

Note that the theorem also involves solving the eigenvalue/eigenvector decomposition via the equation

Σ_{Z | Y} v = λ Σ_{Z} v .

Comparing the structure of this equation to (5), we see that

(Σ_{Z | Y}, Σ_{Z})

can be viewed as a GES. The difference between the typical GES solution with the result of Theorem 2 is the order in which eigenvalues appear. For GES, they occur in descending order, while for Theorem 2, they are in ascending order. Using the link of Theorem 2 to generalized eigenvector solutions, we have the following result for sufficient dimension reduction methodology [17] to demonstrate the following result.

Proposition 2.

The result of the Gaussian information bottleneck theorem holds

(a).: for sliced inverse regression with $C o v {E (Z | Y), Σ_{Z}}$ as a GES;
(b).: for partial inverse regression with $C o v {Σ_{Z F} Σ_{F F}^{- 1} Σ_{F Z}, Σ_{Z}}$ as a GES, where for a known transformation of Y, $F (Y)$ ,

$\begin{matrix} Σ_{F F} & = & V a r F (Y) \\ Σ_{F Z} & = & C o v {F (Y), Z} \end{matrix}$
(c).: for sliced average variance estimation [12] with $C o v {Σ_{Z} - V a r (Z | Y), Σ_{Z}}$ as a GES;
(d).: for principal Hessians directions [31] with $C o v {Σ_{Z Z Y}, Σ_{Z}}$ as a GES, where

$Σ_{Z Z Y} = E (Z Z^{'} Y)$

Proof.

All of these results follow by defining the GES equivalences as found in Li [1]. □

The proposition affords us new insights into how to view the information compression/basis calculation for several existing sufficient dimension reduction procedures from the information bottleneck viewpoint.

We note that if we sort the eigenvectors in descending order, the problem of selecting which index to stop is precisely that of selecting the dimension of the central subspace. This is an important problem for which there have been several approaches in the literature. Ye and Weiss [39] proposed an approach to the selection using the nonparametric bootstrap. Recently, Luo and Li [40] proposed the ladle approach, which used the bootstrap but combined information from both the eigenvalues and the eigenvectors of the central subspace to determine the dimension of the central subspace. Another recent innovation by Luo and Li [41] was to augment the predictor matrix with noise variables, which is in the spirit of the recent, popular ‘knockoffs’ approach in statistics [42]. One sees that the problem of order determination of the central subspace is dual to the Gaussian information bottleneck theorem. Equivalently, increasing the dimension of the central subspace will be orthogonal to the goal of minimizing information compression.

5. Numerical Illustration

The example in this paper comes from a randomized trial of opioid-dependent participants. Opioid addiction involving both heroin and diverted prescription opioid use represents major public health epidemics in the United States [43]. Currently, two treatments that are effective for opioid addition are agonist therapy with either buprenorphine (BUP) or methadone (MET). The study by Saxon et al. [44] was to determine if there were differences between BUP and MET with respect to liver function in subjects being treated for opioid dependence. Subjects who met the study inclusion criteria were randomized to BUP or MET; there was a total of

n = 832

subjects in the analysis. Here, we will focus on the change in weight from baseline to week 12 as the dependent variable. Predictor variables include weight at baseline, treatment, gender and ethnicity. Assuming that the central subspace is of dimension one, using sliced inverse regression [28], we estimate the basis to be

(- 0.38, - 0.75, 0.53, 0.03)

. Thus, we would estimate the first direction to be

- 0.38 Tx - 0.75 Gender + 0.53 Ethn + 0.03 BaseWt .

(12)

Note that in the classical framework of sufficient dimension reduction, the interpretation of the estimate is problematic. This is due to the fact that treatment, gender and ethnicity are binary variables. This means that viewed as an estimand; the central subspace formally does not exist. Having said this, the framework in this paper would view (12) as the linear combination of the variables that achieves maximum information compression in the predictors while simultaneously minimizing information loss between the covariates with the outcome variable. Note that this interpretation does not require the existence of a central subspace.

Naik and Tsai [45] proposed the use of partial least squares (PLS) as a means of sufficient dimension reduction in the situation where the dimension of the central subspace equals one. For these data, we would estimate a combination analogous to (12) based on partial least squares by

0.0001 Tx + 0.0003 Gender - 0.0001 Ethn - 0.0268 BaseWt

(13)

Comparing the magnitudes of (12) and (13), SIR estimates larger relative weights with the exception of weight at baseline. Again, we can interpret the PLS estimate as the linear combination of the variables that achieves maximum information compression in the covariates while simultaneously minimizing information loss regarding their association with the outcome variable. A Github repository illustrating these analyses can be found at http://github.com/GhoshLab/ITSDR/.

6. Discussion

In this article, we have attempted to reinterpret the sufficient dimension reduction methodology in the statistical literature using connections to information theory. This link, and in particular to that of the theory of information bottleneck [4], allows for some new insights and interpretations to occur:

We can avoid the goal of SDR as estimating a parameter, namely the basis of the central subspace, and view it instead as a means for information compression while simultaneously preserving association with an outcome variable. This information-theoretic view can allow for one to relax distributional assumptions in a way that is different from the $σ$ —field approach described in [16].
By recognizing that the Gaussian bottleneck information theorem (Theorem 3.1 of [17]) is identical to solving a generalized eigenvalue problem, we can extend the results of [17] to a variety of sufficient dimension reduction methods. There, we see that the goals of information compression and central subspace dimension estimation are dual to each other.

Our hope is that this initial exploration of information theory with sufficient dimension reduction will allow for the adaptation and extension of information theoretic concepts into the SDR literature. We envision there being connections and development of methodologies for SDR in time series [46] and online [47,48] problems. This is currently under investigation.

Funding

This research was funded by National Science Foundation, Grant No. DMS 1914739 and National Cancer Institute, Grant No. R01 CA129102.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The author would like to acknowledge support from NIH R01 CA129102 and NSF DMS 1914739. He would like to thank Thao Vu, Elin Shaddox and Debmalya Nandy, who provided useful feedback on the paper. In addition, he would like to acknowledge the two referees, whose feedback improved the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, B. Sufficient Dimension Reduction: Methods and Applications with R; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Brillinger, D.R. A generalized linear model with “Gaussian” regressor variables. In Selected Works of David Brillinger; Springer: Berlin/Heidelberg, Germany, 2012; pp. 589–606. [Google Scholar]
Li, K.C.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
Tishby, N.; Pereira, F.; Bialek, W.; Hajek, B.; Sreenivas, R. The informational bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, Florham Park, NJ, USA, 30 September 1999; Available online: https://www.bibsonomy.org/bibtex/15bd5efbf394791da00b09839b9a5757 (accessed on 11 December 2021).
Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef] [Green Version]
Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef] [Green Version]
Slonim, N.; Tishby, N. Agglomerative Information Bottleneck; ACM: New York, NY, USA, 1999; Volume 4. [Google Scholar]
Slonim, N.; Tishby, N. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 208–215. [Google Scholar]
Slonim, N.; Friedman, N.; Tishby, N. Multivariate information bottleneck. Neural Comput. 2006, 18, 1739–1789. [Google Scholar] [CrossRef]
Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
Xia, Y.; Tong, H.; Li, W.K.; Zhu, L.X. An adaptive estimation of dimension reduction space. J. R. Stat. Soc. Ser. B 2002, 64, 299–346. [Google Scholar] [CrossRef]
Fukumizu, K.; Bach, F.R.; Jordan, M.I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 2004, 5, 73–99. [Google Scholar]
Fukumizu, K.; Bach, F.R.; Jordan, M.I. Kernel dimension reduction in regression. Ann. Stat. 2009, 37, 1871–1905. [Google Scholar] [CrossRef] [Green Version]
Li, B.; Artemiou, A.; Li, L. Principal support vector machines for linear and nonlinear sufficient dimension reduction. Ann. Stat. 2011, 39, 3182–3210. [Google Scholar] [CrossRef] [Green Version]
Lee, K.Y.; Li, B.; Chiaromonte, F. A general theory for nonlinear sufficient dimension reduction: Formulation and estimation. Ann. Stat. 2013, 41, 221–249. [Google Scholar] [CrossRef]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
Wang, Q.; Yin, X.; Critchley, F. Dimension reduction based on the Hellinger integral. Biometrika 2015, 102, 95–106. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
Yin, X. Canonical correlation analysis based on information theory. J. Multivar. Anal. 2004, 91, 161–176. [Google Scholar] [CrossRef] [Green Version]
Iaci, R.; Yin, X.; Sriram, T.; Klingenberg, C.P. An informational measure of association and dimension reduction for multiple sets and groups with applications in morphometric analysis. J. Am. Stat. Assoc. 2008, 103, 1166–1176. [Google Scholar] [CrossRef]
Yin, X.; Sriram, T. Common canonical variates for independent groups using information theory. Stat. Sin. 2008, 18, 335–353. [Google Scholar]
Xue, Y.; Wang, Q.; Yin, X. A unified approach to sufficient dimension reduction. J. Stat. Plan. Inference 2018, 197, 168–179. [Google Scholar] [CrossRef]
Cook, R.D.; Ni, L. Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. J. Am. Stat. Assoc. 2005, 100, 410–428. [Google Scholar] [CrossRef]
Yao, W.; Nandy, D.; Lindsay, B.G.; Chiaromonte, F. Covariate information matrix for sufficient dimension reduction. J. Am. Stat. Assoc. 2019, 114, 1752–1764. [Google Scholar] [CrossRef]
Lauritzen, S.L. Graphical Models; Clarendon Press: Oxford, UK, 1996; Volume 17. [Google Scholar]
Ichimura, H. Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. J. Econom. 1993, 58, 71–120. [Google Scholar] [CrossRef] [Green Version]
Li, K.C. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 1991, 86, 316–327. [Google Scholar] [CrossRef]
Cook, R.D. Regression Graphics: Ideas for Studying Regressions through Graphics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 482. [Google Scholar]
Yin, X.; Li, B.; Cook, R.D. Successive direction extraction for estimating the central subspace in a multiple-index regression. J. Multivar. Anal. 2008, 99, 1733–1757. [Google Scholar] [CrossRef] [Green Version]
Li, K.C. On principal Hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. J. Am. Stat. Assoc. 1992, 87, 1025–1039. [Google Scholar] [CrossRef]
Wu, Q.; Liang, F.; Mukherjee, S. Kernel sliced inverse regression: Regularization and consistency. In Abstract and Applied Analysis; Hindawi: London, UK, 2013; Volume 2013. [Google Scholar]
Hall, P.; Li, K.C. On almost linearity of low dimensional projections from high dimensional data. Ann. Stat. 1993, 21, 867–889. [Google Scholar] [CrossRef]
Chiaromonte, F.; Cook, R.D.; Li, B. Sufficient dimension reduction in regressions with categorical predictors. Ann. Stat. 2002, 30, 475–497. [Google Scholar] [CrossRef]
Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Berge, C. Hypergraphs: Combinatorics of Finite Sets; Elsevier: Amsterdam, The Netherlands, 1984; Volume 45. [Google Scholar]
Cover, T.M.; Thomas, J. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Berk, R.; Buja, A.; Brown, L.; George, E.; Kuchibhotla, A.K.; Su, W.; Zhao, L. Assumption lean regression. Am. Stat. 2019, 75, 76–84. [Google Scholar] [CrossRef] [Green Version]
Ye, Z.; Weiss, R.E. Using the bootstrap to select one of a new class of dimension reduction methods. J. Am. Stat. Assoc. 2003, 98, 968–979. [Google Scholar] [CrossRef]
Luo, W.; Li, B. Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 2016, 103, 875–887. [Google Scholar] [CrossRef]
Luo, W.; Li, B. On order determination by predictor augmentation. Biometrika 2021, 108, 557–574. [Google Scholar] [CrossRef]
Barber, R.F.; Candès, E.J. Controlling the false discovery rate via knockoffs. Ann. Stat. 2015, 43, 2055–2085. [Google Scholar] [CrossRef] [Green Version]
Substance Abuse and Mental Health Services Administration. Opioid Treatment Program (OTP) Guidance; Substance Abuse and Mental Health Services Administration: Rckville, MD, USA, 2020.
Saxon, A.J.; Ling, W.; Hillhouse, M.; Thomas, C.; Hasson, A.; Ang, A.; Doraimani, G.; Tasissa, G.; Lokhnygina, Y.; Leimberger, J.; et al. Buprenorphine/naloxone and methadone effects on laboratory indices of liver health: A randomized trial. Drug Alcohol Depend. 2013, 128, 71–76. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Naik, P.; Tsai, C.L. Partial least squares estimator for single-index models. J. R. Stat. Soc. Ser. B 2000, 62, 763–771. [Google Scholar] [CrossRef]
Li, K.C.; Shedden, K. Identification of shared components in large ensembles of time series using dimension reduction. J. Am. Stat. Assoc. 2002, 97, 759–765. [Google Scholar] [CrossRef]
Cai, Z.; Li, R.; Zhu, L. Online Sufficient Dimension Reduction Through Sliced Inverse Regression. J. Mach. Learn. Res. 2020, 21, 1–25. [Google Scholar]
Artemiou, A.; Dong, Y.; Shin, S.J. Real-time sufficient dimension reduction through principal least squares support vector machines. Pattern Recognit. 2021, 112, 107768. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ghosh, D. Sufficient Dimension Reduction: An Information-Theoretic Viewpoint. Entropy 2022, 24, 167. https://doi.org/10.3390/e24020167

AMA Style

Ghosh D. Sufficient Dimension Reduction: An Information-Theoretic Viewpoint. Entropy. 2022; 24(2):167. https://doi.org/10.3390/e24020167

Chicago/Turabian Style

Ghosh, Debashis. 2022. "Sufficient Dimension Reduction: An Information-Theoretic Viewpoint" Entropy 24, no. 2: 167. https://doi.org/10.3390/e24020167

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sufficient Dimension Reduction: An Information-Theoretic Viewpoint

Abstract

1. Introduction

2. Background and Preliminaries

2.1. Data Structures and Review of Dimension Reduction Methods

2.2. Limitations of Sufficient Dimension Reduction

3. Graphical Models, Connections and Information Theoretic Results

4. The Case of Gaussian Variables

5. Numerical Illustration

6. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI