Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence

Mandal, Abhijit; Cichocki, Andrzej

doi:10.3390/e15072788

Open AccessArticle

Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence

by

Abhijit Mandal

^* and

Andrzej Cichocki

Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako, 351-0198 Saitama, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2013, 15(7), 2788-2804; https://doi.org/10.3390/e15072788

Submission received: 14 June 2013 / Revised: 12 July 2013 / Accepted: 15 July 2013 / Published: 18 July 2013

Download

Browse Figures

Versions Notes

Abstract

:

We propose a generalized method of the canonical correlation analysis using Alpha-Beta divergence, called AB-canonical analysis (ABCA). From observations of two random variables,

x \in R^{P}

and

y \in R^{Q}

, ABCA finds directions,

w_{x} \in R^{P}

and

w_{y} \in R^{Q}

, such that the AB-divergence between the joint distribution of

(w_{x}^{T} x, w_{y}^{T} y)

and the product of their marginal distributions is maximized. The number of significant non-zero canonical coefficients are determined by using a sequential permutation test. The advantage of our method over the standard canonical correlation analysis (CCA) is that it can reconstruct the hidden non-linear relationship between

w_{x}^{T} x

and

w_{y}^{T} y

, and it is robust against outliers. We extend ABCA when data are observed in terms of tensors. We further generalize this method by imposing sparseness constraints. Extensive simulation study is performed to justify our approach.

Keywords:

canonical correlation analysis (CCA); non-linearity; AB-divergence; robustness; tensor; sparseness constraints

1. Introduction

In statistics and data analysis, we are often interested to find out the relationship between two sets of multi-dimensional random variables,

x \in R^{P}

and

y \in R^{Q}

. Canonical correlation analysis (CCA) focuses on the correlation between a linear combination of the variables in one set and another linear combination of the variables in the other set. The idea is to first determine linear combinations of

x

and

y

, called canonical variables, such that the correlation between the canonical variables is the highest possible among all such linear combinations.

Based on the observed random sample, the aim in standard CCA is to find the linear relationship between

x

and

y

. Therefore, the method fails if the relationship is non-linear. Another disadvantage of the standard CCA is that it is very sensitive to outliers, as it is based on the correlation coefficient. In this paper, we generalize the concept of CCA, which can extract the non-linear relationship between two sets of variables, and at the same time, the method is robust against outliers. We assume that there exists a hidden relationship of the following type:

w_{y}^{T} y = ψ (w_{x}^{T} x) + ϵ,

(1)

where ψ is an unknown smooth function and ϵ is the random error. Our aim is to find out vectors,

w_{x} \in R^{P}

and

w_{y} \in R^{Q}

, from observed values of

x

and

y

. Yin (2004) [1] has developed a technique to solve this problem based on an information theoretic approach (see, also, Yin et al., 2008 [2]; Iaci et al., 2010 [3]). Recently, Iaci and Sriram (2013) [4] applied this method using beta-divergence and power divergence. Wang et al. (2012) [5] have used Bregman divergence to perform CCA. We will explore this problem in detail and extend this method by using the Alpha-Beta divergence (or AB-divergence) (Cichocki et al., 2011 [6]), which is a generalized measure of divergence. Moreover, the earlier methods are limited to the case where

x

and

y

are random vectors; we will extend it to the tensor (multiway array) valued random variables.

Kernel CCA (Lai and Fyfe, 2000 [7]; Shawe-Taylor and Cristianini, 2004 [8]) deals with the non-linear relationship between two sets of random variables, but the setting of the problem is different than our approach. Kernel CCA first transforms the data to a higher (or infinite) dimensional non-linear space, called the reproducing kernel Hilbert space, and then assumes that there exists a linear relationship between the variables in the transformed space. In kernel CCA, it is not possible to recover the non-linear relationship, whereas in our case, we can find out the unknown function, ψ, in Equation (1) by further analysis (see Breiman and Friedman, 1985 [9]). However, in this paper, our main interest is to recover

w_{x}

and

w_{y}

, which satisfy Equation (1).

The rest of the paper is organized as follows. In Section 2 and Section 3, we discuss the basic formulations of CCA and AB-divergence, respectively. The new method, AB-canonical analysis (ABCA), is proposed in Section 4. In Section 5, we describe the algorithm of ABCA. The sequential permutation test is proposed to determine the number of significant canonical variable pairs in Section 6. In Section 7, we generalize ABCA when data sets are observed as tensors. The sparsity constraint is introduced in Section 8. Numerical illustrations of the performance of this method are presented in Section 9. Section 10 has some concluding remarks.

2. Canonical Correlation Analysis

Suppose we have N pairs of observations from two sets of random variables,

x

and

y

,

{x (n) \in R^{P}, y (n) \in R^{Q}; n = 1, 2, \dots, N}

. In CCA, we look for linear combinations of

x

and

y

, which have maximum correlation with each other (Hotelling, 1936 [10]). Formally, the classical CCA computes two projection vectors,

w_{x} \in R^{P}

and

w_{y} \in R^{Q}

, such that the correlation coefficient:

ρ = \frac{w_{x}^{T} Σ_{x y} w_{y}}{\sqrt{w_{x}^{T} Σ_{x} w_{x}} \sqrt{w_{y}^{T} Σ_{y} w_{y}}}

(2)

is maximized, where

Σ_{x y}

is the covariance matrix between

x

and

y

, and

Σ_{x}

and

Σ_{y}

are the dispersion matrices of

x

and

y

, respectively. Since ρ is invariant to the scaling of vectors

w_{x}

and

w_{y}

, CCA can be formulated equivalently as the following constrained optimization problem:

max_{w_{x}, w_{y}} w_{x}^{T} Σ_{x y} w_{y}, subject to w_{x}^{T} Σ_{x} w_{x} = w_{y}^{T} Σ_{y} w_{y} = 1 .

(3)

We denote the optimum values of

(w_{x}, w_{y})

as

(^{1} w_{x},^{1} w_{y})

. We refer to

u_{1} =^{1} w_{x}^{T} x

and

v_{1} =^{1} w_{y}^{T} y

as the pair of first canonical variables.

Next, we determine a new pair of linear combinations, say

u_{2}

and

v_{2}

, which has the highest correlation subject to

u_{2}

, being uncorrelated with

u_{1}

, and

v_{2}

being uncorrelated with

v_{1}

(the construction actually ensures that

u_{1}

and

v_{2}

are uncorrelated, as well, as are

u_{2}

and

v_{1}

). Therefore, at the i-th step, the canonical vectors are obtained as:

(^{i} w_{x},^{i} w_{y}) = arg max_{w_{x}, w_{y}} w_{x}^{T} Σ_{x y} w_{y}

(4)

subject to:

\begin{matrix} ^{i} w_{x}^{T} Σ_{x}^{i} w_{x} =^{i} w_{y}^{T} Σ_{y}^{i} w_{y} = 1, \end{matrix}

(5)

\begin{matrix} ^{j} w_{x}^{T} Σ_{x}^{i} w_{x} =^{j} w_{y}^{T} Σ_{y}^{i} w_{y} = 0, \end{matrix}

(6)

for all

j = 1, 2, \dots, i - 1

and

i \leq min {p, q}

. The process continues, until subsequent pairs of linear combinations no longer produce a significant correlation.

3. AB-Divergence

Consider two density functions, f and g, with respect to a Lebesgue measure. Then, the AB-divergence (Cichocki et al., 2011 [6]) between f and g is denoted as

D_{α, β} (f | | g)

and is defined by:

D_{α, β} (f | | g) = - \frac{1}{α β} \int_{x} (f^{α} (x) g^{β} (x) - \frac{α}{α + β} f^{α + β} (x) - \frac{β}{α + β} g^{α + β} (x)) d x,

(7)

where

α, β, α + β \neq 0

. The singularity for certain values of parameters are avoided by taking continuous limits with respect to the parameters. Thus, AB-divergence is expressed in a more explicit form as:

D_{α, β} (f | | g) = \int_{x} d_{α, β} (f, g) d x,

(8)

where:

d_{α, β} = \{\begin{matrix} - \frac{1}{α β} (f^{α} g^{β} - \frac{α}{α + β} f^{α + β} - \frac{β}{α + β} g^{α + β}) & if α, β, α + β \neq 0 \\ \frac{1}{α^{2}} (f^{α} ln {(\frac{f}{g})}^{α} - f^{α} + g^{α}) & if α \neq 0, β = 0 \\ \frac{1}{α^{2}} (ln {(\frac{g}{f})}^{α} + {(\frac{g}{f})}^{- α} - 1) & if α = - β \neq 0 \\ \frac{1}{β^{2}} (g^{β} ln {(\frac{g}{f})}^{β} - g^{β} + f^{β}) & if α = 0, β \neq 0 \\ \frac{1}{2} {(ln f - ln g)}^{2} & if α, β = 0 . \end{matrix}

(9)

There are several important divergences in the class of AB-divergence: for a suitable choice of the parameters α and β, we can construct those divergences (Amari, 2007 [11]; Minami and Eguchi, 2002 [12]). For example, when

(α + β) = 1

, the AB-divergence reduces to the Alpha-divergence (Amari, 2007 [11]; Cichocki et al., 2011 [6]). On the other hand, when

α = 1

, it becomes Beta-divergence (Basu et al., 1998 [13]; Cichocki et al., 2006 [14]; Kompass, 2007 [15]; Minami and Eguchi, 2002 [12]; Févotte et al., 2009 [16]). The AB-divergence becomes the standard Kullback-Leibler divergence for

α = 1

and

β = 0

. Itakura-Saito divergence and the Hellinger distance also belong to the class of AB-divergence (Cichocki et al., 2006 [14]; Févotte et al., 2009 [16]).

One important property of the divergence is that

D_{α, β} (f | | g)

is non-negative for all f and g and is equal to zero if and only if

f \equiv g

almost everywhere (Cichocki et al., 2011 [6]). Let us take f to be the joint density of two random variables,

x

and

y

, and g to be the product of their marginal densities. Then,

D_{α, β} (f | | g) = 0

if and only if

x

and

y

are independent. We will use this property of AB-divergence to find the canonical variables.

4. AB-Canonical Analysis

Let us denote the joint distribution of two random variables as

f (\cdot, \cdot)

, whereas the marginal distribution as

f (\cdot)

. We define the AB-divergence between the joint distribution of

(w_{x}^{T} x, w_{y}^{T} y)

and the product of their marginal distributions as:

D_{α, β} (w_{x}, w_{y}) = D_{α, β} (f (w_{x}^{T} x, w_{y}^{T} y) | | f (w_{x}^{T} x) f (w_{y}^{T} y)) .

(10)

From the property of the AB-divergence, we know that

D_{α, β} (w_{x}, w_{y}) = 0

if and only if

w_{x}^{T} x

and

w_{y}^{T} y

are statistically independent. Here, our aim is to find directions

w_{x}

and

w_{y}

, such that

w_{x}^{T} x

and

w_{y}^{T} y

are as much dependent as possible. Therefore, we find

w_{x}

and

w_{y}

from the optimization problem:

max_{w_{x}, w_{y}} D_{α, β} (w_{x}, w_{y}), subject to w_{x}^{T} w_{x} = w_{y}^{T} w_{y} = 1 .

(11)

We denote the first set of AB-canonical vectors as

(^{1} w_{x},^{1} w_{y})

. The i-th set of canonical vectors are obtained as:

(^{i} w_{x},^{i} w_{y}) = arg max_{w_{x}, w_{y}} D_{α, β} (w_{x}, w_{y}),

(12)

subject to:

\begin{matrix} ^{i} w_{x}^{T}^{i} w_{x} =^{i} w_{y}^{T}^{i} w_{y} = 1, \end{matrix}

(13)

\begin{matrix} ^{j} w_{x}^{T}^{i} w_{x} =^{j} w_{y}^{T}^{i} w_{y} = 0, \end{matrix}

(14)

for all

j = 1, 2, \dots, i - 1

and

i \leq min {p, q}

. Like CCA, we continue, until a subsequent pairs of canonical variables no longer produce a significant dependence.

We note that

D_{α, β} (w_{x}, w_{y}) = 0

implies that

w_{x}^{T} x

and

w_{y}^{T} y

are statistically independent, regardless of the distributions of

x

and

y

. On the other hand, in standard CCA, the zero canonical correlation implies that

x

and

y

are uncorrelated, but in general, they may not be independent. However, if

x

and

y

follow normal distributions, then they are independent. The concept of statistical dependence is more general and flexible than the concept of correlation. If

x

and

y

are independent, then they are also uncorrelated, but not vice versa.

5. ABCA Algorithm

Suppose we have N pairs of observations from two sets of random variables,

x

and

y

,

{x (n) \in R^{P}, y (n) \in R^{Q}; n = 1, 2, \dots, N}

. We calculate

D_{α, β}^{(N)} (w_{x}, w_{y})

, the sample version of

D_{α, β} (w_{x}, w_{y})

, using kernel density estimates (Yin, 2004 [1]). Therefore,

D_{α, β}^{(N)} (w_{x}, w_{y}) = D_{α, β} (f_{N} (w_{x}^{T} x, w_{y}^{T} y) | | f_{N} (w_{x}^{T} x) f_{N} (w_{y}^{T} y)),

(15)

where:

f_{N} (u) = \frac{1}{N h} \sum_{n = 1}^{N} K (\frac{u - u_{n}}{h}), u \in R,

(16)

and:

f_{N} (u, v) = \frac{1}{N h_{1} h_{2}} \sum_{n = 1}^{N} K_{2} (\frac{u - u_{n}}{h_{1}}, \frac{v - v_{n}}{h_{2}}), (u, v) \in R^{2} .

(17)

Here,

h, h_{1}

and

h_{2}

are suitably chosen bandwidths and

K (\cdot)

and

K_{2} (\cdot, \cdot)

are univariate and bivariate kernels, respectively. For simplicity, we will take the product kernel (Scott, 1992 [17]), i.e.:

f_{N} (u, v) = \frac{1}{N h_{1} h_{2}} \sum_{n = 1}^{N} K (\frac{u - u_{n}}{h_{1}}) K (\frac{v - v_{n}}{h_{2}}), (u, v) \in R^{2} .

(18)

For convergence of the kernel density functions to the corresponding underlying densities, we need to ensure that the bandwidth parameters tend to zero as the sample size increases. We follow the method described in Silverman (1986) [18] by taking

h = 1.06 s N^{- 1 / 5}

,

h_{j} = s_{j} N^{- 1 / 6}

,

j = 1, 2

, where

s, s_{1}

and

s_{2}

are the corresponding standard deviations. Moreover, the choice of the bandwidth parameters satisfies the condition of Theorem 1, stated later in this section. Here, we use Gaussian kernel. Robust kernel may be used to make the procedure robust against outliers (Kim and Scott, 2012 [19]), but we prefer to choose suitable tuning parameters, α and β, to make the procedure robust.

The AB-canonical vectors obtained from Equation (15) are consistent in the sense that they converge to the original canonical vectors for large sample sizes. The following theorem ensures this result. The proof of the theorem can be done in the same line of thought as mentioned in Proposition 3 of Yin (2004) [1] or Theorem 1 of Iaci and Sriram (2013) [4].

Theorem 1 : Assume that both the univariate and bivariate density functions,

f (\cdot)

and

f (\cdot, \cdot)

, are continuous. Suppose that the kernel density, K, is a bounded variation function, and the sequence of the bandwidth parameter,

h_{n}

, used in the k-dimensional Density Estimation satisfies the following bound:

\sum_{n = 1}^{\infty} e^{- γ n h_{n}^{2 k}} < \infty, f o r a l l γ > 0,

(19)

where

k = 1, 2

. Let us denote

({\hat{w}}_{x}, {\hat{w}}_{y}) = a r g m a x D_{α, β}^{(N)} (w_{x}, w_{y})

and

(w_{x}, w_{y}) = a r g m a x D_{α, β} (w_{x}, w_{y})

, where

(α, β) \in R^{2}

. Then,

({\hat{w}}_{x}, {\hat{w}}_{y}) \to (w_{x}, w_{y})

, almost surely as

N \to \infty

.

It should be mentioned here that the optimization problem in Equation (12) is non-linear, and it may stick at a local maxima. Therefore, it is often needed to repeat the algorithm several times with different initial values to get the appropriate solution. We use the interior point algorithm (see Byrd et al., 1999 [20]; Byrd et al. 2000 [21]) to estimate the canonical vectors,

w_{x}

and

w_{y}

. A MATLAB program for the ABCA will be found in [22].

The value of

D_{α, β} (w_{x}, w_{y})

is always non-negative, but there does not exist any fixed upper limit for all values of α and β. Therefore, it is difficult to interpret the result from the values of AB-divergence. Whereas in standard CCA, the value of the canonical coefficient close to one signifies better performance from this method, therefore we will calculate the maximal correlation (Breiman and Friedman, 1985 [9]) as a measure of dependency. The maximal correlation coefficient between

w_{x} x

and

w_{y} y

is denoted by

ρ^{*}

and is defined as:

ρ^{*} = max_{ψ} Corr (w_{y} y, ψ (w_{x} x)) .

(20)

Here, we call

ρ^{*}

as the AB-canonical coefficient. It is the maximum possible correlation between

w_{y} y

and any function of

w_{x} x

. The value of

ρ^{*}

lies in [0,1]. We calculate

ρ^{*}

using the alternating conditional expectation algorithm (Breiman and Friedman, 1985 [9]).

6. Sequential Permutation Test

One advantage of ABCA is that if the AB-canonical coefficient is zero, it implies that the corresponding AB-canonical variables are independent, regardless of the distributions of

y

and

x

. Therefore, the non-parametric sequential permutation test can be applied to determine the number of significant AB-canonical variables (Yin, 2004 [1]; Efron and Tibshirani, 1994 [23]; Davison and Hinkley, 1997 [24]). On the other hand, the test of significance for the standard CCA is very complicated, and it is typically under the normality assumption (Yin, 2004 [1]).

Let

(^{i} w_{x},^{i} w_{y})

be the i-th AB-canonical vectors pair. We want to test the following hypothesis:

^{i} H_{0} : D_{α, β} (^{i} w_{x},^{i} w_{y}) = 0, vs.^{i} H_{1} : D_{α, β} (^{i} w_{x},^{i} w_{y}) > 0 .

(21)

Testing

^{i} H_{0}

implies that the two canonical variables,

^{i} w_{x}^{T} x

and

^{i} w_{y}^{T} y

, are independent. First, we fix the previously found AB-canonical variables,

(^{j} w_{x},^{j} w_{y})

,

j = 1, 2, \dots, i - 1

. Then, we take a random permutation of the N observations of

x

, say

x^{*}

, and perform ABCA with

x^{*}

and

y

using the algorithm described in Section 5. Let us denote the corresponding AB-divergence measure as

D_{α, β}^{*}

.

We repeat this procedure a sufficient number of times (say, R times), and we calculate

D_{α, β}^{*} (r)

, the corresponding AB-divergence measure for the r-th permutation,

r = 1, 2, \dots, R

. Let

D_{γ}

be the

(1 - γ)

-th percentile point of

D_{α, β}^{*} (r), r = 1, 2, \dots, R

, where γ is the level of significance of the test. Then, we reject the null hypothesis,

^{i} H_{0}

, if:

D_{α, β}^{(N)} (^{i} w_{x},^{i} w_{y}) > D_{γ},

(22)

where

D_{α, β}^{(N)} (^{i} w_{x},^{i} w_{y})

is the actual observed value of

D_{α, β} (^{i} w_{x},^{i} w_{y})

without permuting data. If

^{i} H_{0}

is rejected, we proceed to the next step to calculate another AB-canonical variable pair.

7. Extension to Tensor

In this section, we extend the concept of ABCA in the case of tensor data. In many applications, the data structures often contain higher order modes, such as subjects, groups, trials, classes, conditions, etc., together with the intrinsic dimensions of space, time and frequency. Many studies of neuroscience involve recording data over time for multiple subjects (people or animals) and in different conditions, leading to experimental data structures conveniently represented by multi-array tensors. We generalize the idea of ABCA to extract the meaningful components from this type of high dimensional tensor data.

Tensors are denoted by underlined capital boldface letters, e.g.,

\underset{̲}{Y} \in R^{I_{1} \times I_{2} \times \dots \times I_{Q}}

. The order of a tensor is the number of modes, also known as ways or dimensions (e.g., frequency, subjects, trials, classes, groups and conditions). Throughout this section, we will use the basic tensor operations proposed in the literature (Kolda and Bader, 2009 [25]; Cichocki et al., 2009 [26]). Specifically, the mode-n multiplication of a tensor,

\underset{̲}{Y} \in R^{I_{1} \times I_{2} \times \dots \times I_{Q}}

, by a vector,

a \in R^{I_{n}}

, is denoted by:

\underset{̲}{Y} {\bar{\times}}_{n} a \in R^{I_{1} \times \dots \times I_{n - 1} \times I_{n + 1} \times \dots \times I_{Q}},

(23)

where the

(i_{1}, i_{2}, \dots, i_{n - 1}, i_{n + 1}, \dots, i_{Q})

-th element is given by:

\sum_{i_{n} = 1}^{I_{n}} y_{i_{1}, i_{2}, \dots, i_{Q}} a_{i_{n}} .

(24)

The mode-n multiplication of a tensor,

\underset{̲}{Y} \in R^{I \times J \times K}

, by vectors,

a \in R^{I}, b \in R^{J}

and

c \in R^{K}

, can be expressed as:

\underset{̲}{Y} {\bar{\times}}_{1} a {\bar{\times}}_{2} b {\bar{\times}}_{3} c = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \sum_{k = 1}^{K} y_{i j k} a_{i} b_{j} c_{k} .

(25)

Suppose we have two sets of data from the tensor valued random variables,

\underset{̲}{X}

and

\underset{̲}{Y}

,

{\underset{̲}{X} (n) \in R^{I_{1} \times I_{2} \times \dots \times I_{P}}, \underset{̲}{Y} (n) \in R^{K_{1} \times K_{2} \times \dots \times K_{Q}}; n = 1, 2, \dots, N}

, where N is the sample size. In tensor ABCA, our aim is to find

w_{x}^{(1)} \in R^{I_{1}}, w_{x}^{(2)} \in R^{I_{2}}, \dots, w_{x}^{(P)} \in R^{I_{P}}

and

w_{y}^{(1)} \in R^{K_{1}}, w_{y}^{(2)} \in R^{K_{2}}, \dots, w_{y}^{(Q)} \in R^{K_{Q}}

, such that the AB-divergence between the joint distribution of the canonical variables:

u_{1} = \underset{̲}{X} {\bar{\times}}_{1} w_{x}^{(1)} {\bar{\times}}_{2} w_{x}^{(2)} \dots {\bar{\times}}_{P} w_{x}^{(P)},

v_{1} = \underset{̲}{Y} {\bar{\times}}_{1} w_{y}^{(1)} {\bar{\times}}_{2} w_{y}^{(2)} \dots {\bar{\times}}_{Q} w_{1}^{(Q)},

(26)

and the product of their marginal distributions is maximized. We define:

D_{α, β} (w_{x}^{(1)}, \dots, w_{x}^{(P)}, w_{y}^{(1)}, \dots, w_{y}^{(Q)}) = D_{α, β} (f (u_{1}, v_{1}) | | f (u_{1}) f (v_{1})) .

(27)

Here, we find

w_{x}^{(1)}, \dots, w_{x}^{(P)}

and

w_{y}^{(1)}, \dots, w_{y}^{(Q)}

from the optimization problem:

max_{w_{x}^{(1)}, \dots, w_{x}^{(P)}, w_{y}^{(1)}, \dots, w_{y}^{(Q)}} D_{α, β} (w_{x}^{(1)}, \dots, w_{x}^{(P)}, w_{y}^{(1)}, \dots, w_{y}^{(Q)})

(28)

subject to:

w_{x}^{(p) T} w_{x}^{(p)} = w_{y}^{(q) T} w_{y}^{(q)} = 1,

(29)

for

p = 1, 2, \dots, P

and

q = 1, 2, \dots, Q

.

We denote the first set of AB-canonical vectors as

(^{1} w_{x}^{(1)}, \dots,^{1} w_{x}^{(P)},

^{1} w_{y}^{(1)}, \dots,^{1} w_{y}^{(Q)})

. The i-th set of AB-canonical vectors,

(^{i} w_{x}^{(1)}, \dots,^{i} w_{x}^{(P)},

^{i} w_{y}^{(1)}, \dots,^{i} w_{y}^{(Q)})

, is obtained as:

arg max_{w_{x}^{(1)}, \dots, w_{x}^{(P)}, w_{y}^{(1)}, \dots, w_{y}^{(Q)}} D_{α, β} (w_{x}^{(1)}, \dots, w_{x}^{(P)}, w_{y}^{(1)}, \dots, w_{y}^{(Q)})

(30)

subject to:

\begin{matrix} ^{i} w_{x}^{(p) T}^{i} w_{x}^{(p)} =^{i} w_{y}^{(q) T}^{i} w_{y}^{(q)} = 1, \end{matrix}

(31)

\begin{matrix} ^{j} w_{x}^{(p) T}^{i} w_{x}^{(p)} =^{j} w_{y}^{(q) T}^{i} w_{y}^{(q)} = 0, \end{matrix}

(32)

for all

j = 1, 2, \dots, i - 1

.

8. Sparseness Constraints

The standard CCA has some disadvantages, especially for large-scale and noisy problems. In general, the canonical variables are linear combinations of all the components of

x

(or

y

). This means the canonical variables are dense (not sparse), which often make the physical interpretation of the CCA difficult in many applications. For example, in many applications (from genetics, image analysis, etc.), the coordinate axes have a physical interpretation (each axis may correspond to a specific feature), so a sparse canonical variable is more meaningful than a dense one. Recently, several modifications of CCA have been proposed that impose some sparseness conditions for the canonical variables, and the corresponding method is called sparse canonical correlation analysis (SCCA); see Torres et al. (2007) [27]. The main idea in SCCA is to force the canonical variables to be sparse; however, the sparsity profile should be adjustable or well controlled via some parameters in order to discover specific features in the observed data. In a similar way, we propose the sparse AB-canonical analysis.

For sparse AB-canonical analysis, we impose suitable sparsity constraints on the canonical vectors (Witten et al., 2009 [28]; Witten, 2010 [29]). Here, the optimization problem reduces to:

(w_{x}, w_{y}) = arg max_{w_{x}, w_{y}} {D_{α, β} (w_{x}, w_{y}) - λ_{1} P_{1} (w_{x}) - λ_{2} P_{2} (w_{y})}

(33)

subject to:

w_{x}^{T} w_{x} = 1, w_{y}^{T} w_{y} = 1,

(34)

where

P_{1}

and

P_{2}

are convex penalty functions and

λ_{1}, λ_{2}

are suitably chosen tuning parameters. Some frequently used penalty functions are:

\begin{matrix} P (w) & = & {| | w | |}_{1} = \sum_{i} | w_{i} |, (LASSO) \end{matrix}

(35)

\begin{matrix} P (w) & = & {| | w | |}_{0} = \sum_{i} sign (w_{i}), (Cardinality Penalty) \end{matrix}

(36)

\begin{matrix} P (w) & = & \sum_{i} | w_{i} | + λ \sum_{i} | w_{i} - w_{i - 1} |, (Fused LASSO) . \end{matrix}

(37)

Here, also, we use the interior-point algorithm to estimate the canonical vectors. A MATLAB code will be obtained just by changing the optimization function of the standard ABCA in [22]. However, if we use a cardinality penalty, then we need to modify the program a little bit, so that the algorithm tries to find a solution in the lower dimensional subspace. For tensor AB-canonical analysis, the sparseness constraints can be imposed in a similar way (see Allen, 2012 [30]).

9. Simulation Results

The validity and the performance of the proposed ABCA is evaluated based on the simulated data. In the following examples, we have generated

{x (n),

y (n); n = 1, 2, \dots, N}

, such that they have a relationship, as mentioned in Equation (1). Note that the following types of relations are, for example, included in the model:

\begin{matrix} b_{1} y_{1} + b_{2} y_{2} & = & {(a_{0} + a_{1} x_{1} + a_{2} x_{2})}^{2} + ϵ, \end{matrix}

(38)

\begin{matrix} b_{1} y_{1} + b_{2} y_{2} & = & sin (a_{0} + a_{1} x_{1} + a_{2} x_{2}) + ϵ, \end{matrix}

(39)

\begin{matrix} b_{1} y_{1} + b_{2} y_{2} & = & {(a_{0} + a_{1} x_{1} + a_{2} x_{2})}^{2} + sin (a_{0} + a_{1} x_{1} + a_{2} x_{2}) + ϵ, \end{matrix}

(40)

where

x = {(x_{1}, x_{2}, x_{3})}^{T}

,

y = {(y_{1}, y_{2})}^{T}

,

b_{1}, b_{2}

and

a_{0}, a_{1}, a_{2}

are unknown constants. Here, ϵ is the random error. However, if

a_{2} \neq 0

, then the following models are not included in Equation (1):

\begin{matrix} b_{1} y_{1} + b_{2} y_{2} & = & {(a_{0} + a_{1} x_{1})}^{2} + a_{2} x_{2} + ϵ, \end{matrix}

(41)

\begin{matrix} b_{1} y_{1} + b_{2} y_{2} & = & sin (a_{0} + a_{1} x_{1}) + a_{2} x_{2} + ϵ, \end{matrix}

(42)

\begin{matrix} b_{1} y_{1} + b_{2} y_{2} & = & {(a_{0} + a_{1} x_{1})}^{2} + sin (a_{0} + a_{1} x_{1} + a_{2} x_{2}) + ϵ . \end{matrix}

(43)

In the first example, we have generated data, such that there exists a non-linear relationship between

x

and

y

. We will notice that ABCA successfully extracts the hidden relationship, whereas standard CCA fails. In the next example, we show the robustness property of ABCA and compare it with the standard CCA. Finally, we have given an example when data sets are tensors.

Figure 1. (a) and (b): The scatter plots of the latent variables. (c) and (d): The scatter plots of the first two AB-canonical variable pairs. It is clearly seen that the non-linear relationship is reconstructed.

9.1. Extraction of Non-linear Relationship

Example 1: The dimensions of

x

and

y

are taken as six and four, respectively; so,

x = {(x_{1}, x_{2}, \dots, x_{6})}^{T}

and

y = {(y_{1}, y_{2}, y_{3}, y_{4})}^{T}

.

x

is the explanatory variable, where the components are generated from independent

N (0, 1)

random variables.

y

is the dependent variable based on the following latent variables:

\begin{matrix} y_{1}^{*} & = & sin (3 a_{1} x) + ϵ_{1}, \end{matrix}

(44)

\begin{matrix} y_{2}^{*} & = & {(a_{2} x)}^{3} - a_{2} x + ϵ_{2}, \end{matrix}

(45)

where

ϵ_{1}

and

ϵ_{2}

are the random errors, and we assume

ϵ_{i} \sim 0.05 N (0, 1), i = 1, 2

. The coefficient vectors,

a_{1}

and

a_{2}

, are generated from independent uniform

(- 1 / 2, 1 / 2)

random variables, and then, they are orthogonalized. Therefore,

a_{1}^{T} a_{2} = 0

. The relationship between

y

and the latent variables,

y^{*} = {(y_{1}^{*}, y_{2}^{*})}^{T}

, is assumed to be the linear combination, as mentioned below:

y_{1} = c_{1}^{T} y^{*}, y_{2} = c_{2}^{T} y^{*},

(46)

and

y_{3}

and

y_{4}

are independent

N (0, 1)

random variables. The elements of the matrix,

C = (c_{1}, c_{2})

, are generated from independent uniform

(- 1 / 2, 1 / 2)

random variables, and then, their rows are orthogonalized, so that the columns of

C^{- 1}

become orthogonal. We generate a sample size of 100 from

x

and

y

.

The scatter plots of the latent variables are given in (a) and (b) of Figure 1. We perform ABCA for this data set with divergent parameters,

α = 0.5

and

β = 0.5

. The first two AB-canonical variable pairs are plotted in (c) and (d) of Figure 1. The values of the first two AB-canonical coefficients are 0.9616 and 0.9301. It is obvious that ABCA extracts the latent variable quite accurately. We notice that the scale and the sign of the canonical vectors cannot be recovered from ABCA. The standard CCA fails to extract them, due to a non-linear relationship with the latent variables. The first two standard canonical variable pairs are plotted in (a) and (b) of Figure 2. The values of the first two canonical coefficients are 0.5704 and 0.3559.

Figure 2. Scatter plots for the first two standard canonical variable pairs. Here, canonical correlation analysis (CCA) fails to reconstruct the non-linear relationship.

Figure 3. (a) Simulated data with outliers inside the red circle. (b) Scatter plot for the AB-canonical variable pair.

Figure 4. (a) Scatter plot for the standard canonical variable pair. (b) Scatter plot for the canonical variable pair using Yin (2004) [1] approach.

9.2. Robustness Property

Example 2: In this example, we check the robustness property of ABCA. To compare it with standard CCA, we have generated data, such that

x

and

y

have a linear relationship, and then, few outliers are inserted. The dimensions of

x

and

y

are taken as five and three, respectively. All the components of

x

are generated from independent

N (0, 1)

random variables. For simplicity, we have taken the relationship between

x

and

y

as follows:

y_{1} = 1 + x_{1} + ϵ,

(47)

where ϵ is the random error, and we assume

ϵ \sim 0.05 N (0, 1)

. Here,

y_{1}

and

x_{1}

are the first components of

x

and

y

, respectively. The other components of

y

are generated from independent

N (0, 1)

random variables. We have generated only 90 random samples from this model, and we have taken 10 outliers. For the outlying observations, we have taken

x_{1} = 0

and

y_{1} = 10

. Figure 3a represents the original data, where there are 10 outlying observations inside the red circle. In Figure 3b, we have plotted the first AB-canonical variable pair. The divergence parameters are taken as

α = 0.5

and

β = 0.5

. It is seen that ABCA successfully extracts the canonical variables, but Figure 4a shows that the standard CCA completely fails. In Figure 4b, we present the scatter plot of the first pair of the canonical variables using the approach of Yin (2004) [1]. This is based on Kulback-Leibler divergence, so it is a special case of ABCA, where

α = 1

and

β = 0

. The values of the first AB-canonical coefficients for

α = 0.5, β = 0.5

and

α = 1, β = 0

are 0.9121 and 0.7107, respectively. Thus, we can make ABCA robust by choosing suitable tuning parameters.

9.3. Tensor Data

Example 3: In this example, we have generated data from tensor valued random variables,

\underset{̲}{X}

and

\underset{̲}{Y}

. The dimensions of

\underset{̲}{X}

and

\underset{̲}{Y}

are taken as (4,3,2) and (3,2,2), respectively.

\underset{̲}{X}

is the explanatory variable, where the components are generated from independent

N (0, 1)

random variables. Let us define:

u_{1} = \underset{̲}{X} {\bar{\times}}_{1} a_{x}^{(1)} {\bar{\times}}_{2} a_{x}^{(2)} {\bar{\times}}_{3} a_{x}^{(3)},

u_{2} = \underset{̲}{X} {\bar{\times}}_{1} b_{x}^{(1)} {\bar{\times}}_{2} b_{x}^{(2)} {\bar{\times}}_{3} b_{x}^{(3)} .

(48)

The vectors,

a_{x}^{(i)}

and

b_{x}^{(i)}

,

i = 1, 2, 3

, are generated from independent uniform

(- 1 / 2, 1 / 2)

random variables, and then, they are orthogonalized. Therefore,

a_{x}^{(i) T} b_{x}^{(i)} = 0, i = 1, 2, 3

.

\underset{̲}{Y}

is the dependent variable based on the following latent variables:

\begin{matrix} y_{1}^{*} & = & cos (10 u_{1}) + ϵ_{1}, \end{matrix}

(49)

\begin{matrix} y_{2}^{*} & = & \frac{2}{100 u_{2}^{2} + 1} + ϵ_{2}, \end{matrix}

(50)

where

ϵ_{1}

and

ϵ_{2}

are the random errors, and we assume

ϵ_{i} \sim 0.05 N (0, 1), i = 1, 2

. The relationship between

\underset{̲}{Y}

and the latent variables,

y^{*} = {(y_{1}^{*}, y_{2}^{*})}^{T}

, is assumed to be the linear combination, as follows:

y_{1, 1, 1} = c_{1}^{T} y^{*}, y_{2, 2, 2} = c_{2}^{T} y^{*},

(51)

All other components of

\underset{̲}{Y}

are independent

N (0, 1)

random variables. The elements of the matrix,

C = (c_{1}, c_{2})

, are generated following the way we did in Example 1. We have generated a sample size of 100 from

\underset{̲}{X}

and

\underset{̲}{Y}

.

The scatter plots of the latent variables are given in Figure 5a,b. We have performed tensor ABCA for this data set with divergent parameters,

α = 0.5

and

β = 0.5

. The first two tensor AB-canonical variable pairs are plotted in Figure 5c,d. The values of the first two tensor AB-canonical coefficients are 0.98671 and 0.9712. It is obvious that ABCA extracts the latent variable quite accurately.

Figure 5. (a) and (b): The scatter plots of the latent variables. (c) and (d): The scatter plots of the first two tensor AB-canonical variable pairs. It is clearly seen that the non-linear relationship is reconstructed.

9.4. Choice of Divergence Parameters

There does not exist any universal way of selecting divergence parameters, α and β. They generally control the trade-off between the efficiency and robustness properties of the procedure. Although they cover the whole two-dimensional plane, the rate of change in the values of AB-divergence coefficients for very high or very small values of the tuning parameters are very slow. Therefore, we are often interested in choosing the parameters in the interval [0, 1]. For

α = 1

and

β = 1

, the AB-divergence turns out to be the

L_{2}

-distance between two densities.

L_{2}

-distance is regarded as a strong robust divergence in the literature, but the robustness is achieved at some loss of efficiency (Basu et al., 1998 [13]; Scott, 2001 [31]). On the other hand, for

α = 0

and

β = 0

, the AB-divergence becomes the

L_{2}

-distance between the logarithm of two densities, which may be regarded as non-robust. Therefore, a suitable choice of the parameters are needed to balance between robustness and efficiency. In our simulation examples, α and β around

(0.5, 0.5)

seem to a good choice.

10. Conclusion

We have used AB-divergence measure to perform the canonical correlation analysis. It can extract the hidden non-linear relationship between two sets of data, whereas the standard CCA is designed to find out only the linear relationship. Moreover, the standard CCA is very non-robust against the outlying observations. On the other hand, by choosing suitable tuning parameters, α and β, for the AB-divergence, we can make ABCA robust against outliers. Our method is very general in the sense that it uses AB-divergence, which is a general measure of discrepancy. Moreover, we have generalized the method in the case of tensor data, and we have also considered the sparseness constants.

Acknowledgements

The authors gratefully acknowledge the comments of the referees, which led to an improved version of the paper.

Conflict of Interest

The authors declare no conflict of interest.

References

Yin, X. Canonical correlation analysis based on information theory. J. Multivar. Anal. 2004, 91, 161–176. [Google Scholar] [CrossRef]
Yin, X.; Sriram, T. Common canonical variates for independent groups using information theory. Stat. Sin. 2008, 18, 335–353. [Google Scholar]
Iaci, R.; Sriram, T.; Yin, X. Multivariate association and dimension reduction: A generalization of canonical correlation analysis. Biometrics 2010, 66, 1107–1118. [Google Scholar] [CrossRef] [PubMed]
Iaci, R.; Sriram, T. Robust multivariate association and dimension reduction using density divergences. J. Multivar. Anal. 2013, 117, 281–295. [Google Scholar] [CrossRef]
Wang, X.; Crowe, M.; Fyfe, C. Dual stream data exploration. Int. J. Data Min., Model. Manage. 2012, 4, 188–202. [Google Scholar] [CrossRef]
Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
Lai, P.L.; Fyfe, C. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 2000, 10, 365–377. [Google Scholar] [CrossRef] [PubMed]
Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural comput. 2002, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed]
Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
Cichocki, A.; Zdunek, R.; Amari, S.I. Csiszár’s divergences for non-negative matrix factorization: Family of new algorithms. In Independent Component Analysis and Blind Signal Separation, Proceedings of Fifth International Conference, ICA 2004, Granada, Spain, 22–24 September 2004; Puntonet, C.G., Prieto, A., Eds.; Springer: Berlin, Heidelberg, Germany, 2006; pp. 32–39. [Google Scholar]
Kompass, R. A generalized divergence measure for nonnegative matrix factorization. Neural comput. 2007, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence with application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; Wiley: New York, NY, USA, 1992; Volume 1. [Google Scholar]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall/CRC: London, UK, 1986; Volume 26. [Google Scholar]
Kim, J.S.; Scott, C. Robust kernel density estimation. J. Mach. Learn. Res. 2012, 13, 2529–2565. [Google Scholar]
Byrd, R.H.; Hribar, M.E.; Nocedal, J. An interior point algorithm for large-scale nonlinear programming. SIAM J. Optim. 1999, 9, 877–900. [Google Scholar] [CrossRef]
Byrd, R.H.; Gilbert, J.C.; Nocedal, J. A trust region method based on interior point techniques for nonlinear programming. Math. Program. 2000, 89, 149–185. [Google Scholar] [CrossRef]
MATLAB code of ABCA. Available online: http://www.isical.ac.in/∼abhijit_v/ABC.m (accessed on 17 July 2013).
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, NY, USA, 1993; Volume 57. [Google Scholar]
Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997; Volume 1. [Google Scholar]
Kolda, T.G.; Bader, B.W. Tensor decompositions and applications. SIAM rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S.I. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation; Wiley: Chichester, UK, 2009. [Google Scholar]
Torres, D.A.; Turnbull, D.; Barrington, L.; Lanckriet, G.R. Identifying words that are musically meaningful. In Proceedings of the 8th International Conference of Music Information Retrieval, Vienna, Austria, 23–27 September 2007; Volume 7, pp. 405–410.
Witten, D.M.; Tibshirani, R.; Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, 10, 515–534. [Google Scholar] [CrossRef] [PubMed]
Witten, D.M. A penalized matrix decomposition, and its applications. PhD thesis, Stanford University, USA, 2010. [Google Scholar]
Allen, G.I. Sparse higher-order principal components analysis. In Proceedings of 15th International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, 20–22 April 2012; Volume 22, pp. 27–36.
Scott, D.W. Parametric statistical modeling by minimum integrated square error. Technometrics 2001, 43, 274–285. [Google Scholar] [CrossRef]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Mandal, A.; Cichocki, A. Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence. Entropy 2013, 15, 2788-2804. https://doi.org/10.3390/e15072788

AMA Style

Mandal A, Cichocki A. Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence. Entropy. 2013; 15(7):2788-2804. https://doi.org/10.3390/e15072788

Chicago/Turabian Style

Mandal, Abhijit, and Andrzej Cichocki. 2013. "Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence" Entropy 15, no. 7: 2788-2804. https://doi.org/10.3390/e15072788

Article Menu

Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence

Abstract

1. Introduction

2. Canonical Correlation Analysis

3. AB-Divergence

4. AB-Canonical Analysis

5. ABCA Algorithm

6. Sequential Permutation Test

7. Extension to Tensor

8. Sparseness Constraints

9. Simulation Results

9.1. Extraction of Non-linear Relationship

9.2. Robustness Property

9.3. Tensor Data

9.4. Choice of Divergence Parameters

10. Conclusion

Acknowledgements

Conflict of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI