An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances

Jaenada, María; Miranda, Pedro; Pardo, Leandro; Zografos, Konstantinos

doi:10.3390/e25050713

Open AccessArticle

An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances

¹

Interdisciplinary Mathematics Institute, Complutense University of Madrid, 28040 Madrid, Spain

²

Probability-Statistics and Operational Research Unit, Department of Mathematics, University of Ioannina, 45110 Ioannina, Greece

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(5), 713; https://doi.org/10.3390/e25050713

Submission received: 30 January 2023 / Revised: 16 March 2023 / Accepted: 21 April 2023 / Published: 25 April 2023

(This article belongs to the Special Issue Robust Distance Metric Learning in the Framework of Statistical Information Theory)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Canonical Correlation Analysis (CCA) infers a pairwise linear relationship between two groups of random variables,

X

and

Y .

In this paper, we present a new procedure based on Rényi’s pseudodistances (RP) aiming to detect linear and non-linear relationships between the two groups. RP canonical analysis (RPCCA) finds canonical coefficient vectors,

a

and

b

, by maximizing an RP-based measure. This new family includes the Information Canonical Correlation Analysis (ICCA) as a particular case and extends the method for distances inherently robust against outliers. We provide estimating techniques for RPCCA and show the consistency of the proposed estimated canonical vectors. Further, a permutation test for determining the number of significant pairs of canonical variables is described. The robustness properties of the RPCCA are examined theoretically and empirically through a simulation study, concluding that the RPCCA presents a competitive alternative to ICCA with an added advantage in terms of robustness against outliers and data contamination.

Keywords:

Information Canonical Correlation Analysis; Kullback–Leibler divergence; mutual information; Renyi’s pseudodistances; robustness; consistency

1. Introduction

Canonical Correlation Analysis (CCA) is a statistical technique used to identify and measure associations among two sets of variables; in the following, denoted by

X_{q \times 1}

and

Y_{p \times 1}

(

q \leq p) .

It is appropriate in situations where multiple regression would be used but where there are multiple intercorrelated outcome variables. Hence, it allows us to summarize relationships into a lesser number of statistics while preserving the main facets of those relationships. CCA was first considered in [1] and has been widely used in the statistical literature; for example, to summarize relationships between sets of variables, to reduce the dimensionality of data or to transform two sets of variables into a new dataset of uncorrelated variables as a preprocessing step for the multiple linear regression model. More insight about CCA can be found, e.g., in [2,3].

CCA looks for two direction vectors

a, b

(canonical vectors) such that the linear combinations

U = a^{T} X

and

V = b^{T} Y,

so called canonical variables, are (linearly) correlated as much as possible. However, if a linear relationship does not exist between the pairs

a^{T} X

and

b^{T} Y,

CCA could fail in detecting these pairs of canonical vectors. In other words, CCA can only detect linear relations between the canonical variables, but other functional relationships may exist.

The linear restriction is a significant drawback of CCA when analyzing some real data with highly non-linear relationships. For example, Oulai et al. [4] presented a real situation with non-linear relationships between variables regarding the representation of a hydrological process in the delineation of homogeneous regions. In their context, the two groups of variables under consideration were hydrological variables and meteorological and/or graphical characteristics of watersheds, and their non-linear relationship depended essentially on the physiographic characteristics of the watersheds. Additionally, Ref. [5] presented a nice application of non-linear CCA to seasonal climate forecasting. In [6], some real life data with complex non-linear relationships that cannot be properly captivated by classical CCA are also presented. There is an extensive bibliography addressing non-linear CCA. Without wishing to cite all the existing literature on the topic, we would like to mention some interesting works on the subject: [7] (Chapter 6), [8,9,10,11] and references therein.

To shed light on this problem, let us consider the following situation described in [12]: let

X = {(X_{1}, X_{2})}^{T}

and

Y = {(Y_{1}, Y_{2})}^{T}

be a pair of random vectors such that

X \sim N (0, I), Y_{1} = X_{1}^{2} + Z, Y_{2} = Z,

with

Z \sim χ_{1}^{2}

and independent from

X .

In this case,

C o v (X, Y) = 0_{2 \times 2},

and so the vectors

X

and

Y

are uncorrelated (so they are linearly independent). Consequently, classical CCA cannot detect that, indeed,

Y_{1}

is related (although not linearly) to

X_{1},

even if the variables are not fully independent. On the other hand, as the pair

(X, Y)

does not follow a normal distribution (and therefore uncorrelation does not imply independence), a hidden relationship may exist (and indeed it does exist!) that has not been detected by CCA. Of course, under normality, rejecting any linear correlation using CCA implies independence between both variables.

It is not surprising that CCA fails in the previous example, as CCA focuses on “linear trends”, but the true relation underlying it is quadratic. To overcome this drawback, in a pioneer paper, Yin [12] proposed the use of the Kullback–Leibler divergence and developed a new procedure called Informational Canonical Correlation Analysis (ICCA), aiming to also detect non-linear relationships for linear combinations of the components.

Let

U = a^{T} X

and

V = b^{T} Y

be linear combinations of

X

and

Y

defining a pairwise of canonical variables, with

a \in R^{q}

and

b \in R^{p}

. We denote by

f_{U V} (u, v)

the joint probability density function (PDF) of

(U, V),

and, by

f_{U} (u)

(resp.

f_{V} (v)

), the marginal unidimensional PDF of U (resp. V). From a statistical point of view, both canonical variables U and V would be independent if their joint distribution coincides with the product of the marginal PDF’s,

f_{U V} (u, v) = f_{U} (u) \times f_{V} (v),

and, conversely, a strong dependence between U and V would result in a large statistical distance between the joint PDF and the product of the marginals. A suitable divergence should then be adopted to measure such statistical closeness of the two PDFs. The Kullback–Leibler divergence is the most commonly used measure for distinguishing two distributions, and it has a great statistical importance in the field of information theory.

The Kullback–Leibler divergence between

f_{U V} (u, v)

and

f_{U} (u) \times f_{V} (v)

is given as

D_{K L} (a, b) : = D_{K L} (f_{U V}, f_{U} \times f_{V}) = \int_{R^{2}} f_{U V} (u, v) ln \frac{f_{U V} (u, v)}{f_{U} (u) f_{V} (v)} d u d v .

(1)

The above divergence is not symmetric, so it quantifies the expected inaccuracy excess from using

f_{U} \times f_{V}

as a model when the actual PDF is

f_{U V} .

That is, the inaccuracy caused by assuming independence between the pair of canonical variables. Consequently, truly independent canonical variables should minimize the Kullback–Leibler divergence in Equation (1); conversely, functionally dependent canonical variables should maximize the divergence. For more details about the Kullback–Leibler divergence, see [13].

In this vein, ICCA aims to identify q pairwise canonical variables

a_{i} \in R^{q}

and

b_{i} \in R^{p}

,

i \leq q \leq p

such that

D_{K L} (a_{i}, b_{i}) = {max}_{a, b} D_{K L} (a, b)

. However, the Kullback–Leibler divergence is invariant under linear transformations, and so there are infinitely many ways to define canonical vectors yielding the same objective function. Then, for identification, we constrain the canonical variables to have unit variance. Moreover, once a relationship is identified by a pair of canonical variables, we expect to exclude its effect from the consecutive canonical variables. For such a purpose, we also require that pairs of canonical variables are uncorrelatated with any other pair. That is, ICCA finds q linearly independent pairs of canonical variables with unit variance maximizing (in decreasing order) the corresponding Kullback–Leibler divergence. Mathematically, to compute each pair of variables, we need to solve the optimization problem

D_{K L} (a_{i}, b_{i}) = {max}_{a, b} D_{K L} (a, b)

subject to

a_{i}^{T} Σ_{X} a_{i} =

b_{i}^{T} Σ_{Y} b_{i} = 1

and

a_{j}^{T} Σ_{X} a_{i} =

b_{j}^{T} Σ_{Y} b_{i} = 0

for

j = 1, \dots, i - 1,

where

Σ_{X}

and

Σ_{Y}

denote the variance–covariance matrices of

X

and

Y,

respectively. We apply the same for RP.

From Yin’s (2004) reinterpretation of the canonical analysis, several procedures based on divergence and entropy measures have been proposed to reduce the limitations of CCA. For example, Mandal et al. [14] considered

(α, β)

divergence measures defined in [15], and Iaci and Sriram [6] used the density power divergence measures defined in [16] as a measure of statistical closeness. In [17], canonical dependence based on the squared-loss mutual information was studied. Other interesting results regarding ICCA can be seen in [18,19,20,21,22,23].

Despite its popularity, the Kullback–Leibler divergence association measure is quite sensitive to outlying observations, as pointed out in [24]. For outliers, we mean data that behave very differently to expectations according to the law modeling the relation. The main purpose of this paper is to extend the ICCA procedure to a wider family of robust methods based on RP divergence, which remains competitive to ICCA in terms of efficiency but provides a more stable estimation of the canonical vectors in the presence of contamination in the data.

The RP family, parameterized by a tuning parameter

τ

controlling the trade-off between robustness and efficiency, was considered for the first time in Jones et al. [25]. Later, Broniatowski et al. [26] demonstrated that RP is a proper divergence, positive for any two densities and for all values of the tuning parameter [26,27], and it is null if (and only if) both densities are the same. The theory in [26] for independent and identically distributed random variables was extended to the case of independent but not identically distributed random variables in [28]. They termed this family of pseudodistances as RP because of their similarities with Renyi’s divergence measures Rényi (1961) [29]. Rényi’s pseudodistance has shown promising behavior in other statistical problems, providing robust minimum RP estimators with good asymptotic and robustness properties, and it includes the Kullback–Leibler divergence as a particular case at

τ = 0 .

For example, Toma and Leoni-Aubin [30] considered efficient and robust measures for general parametric models based on RP and, Toma et al. [31] later developed a new criterion for model selection based on the RP. In [27], Castilla et al. introduced a family of Wald-type tests for testing the parameters in linear regression models, and these results were later extended for generalized linear regression models in [32,33]. Wald-type tests based on minimum RP estimators in bidimensional normal populations were considered in [34]. Jaenada et al. [35] introduced and studied the minimum RP estimators under restricted parameter spaces, which are of great statistical interest in many practical applications such as hypothesis testing. Under the name of γ-entropy, Fujisawa and Eguchi [36] applied RP to introduce robust estimators of general parametric families. Motivated for the great performance of the minimum RP estimator on those different statistical models in terms of robustness, we have adopted the RP divergence to extend the ICCA procedure.

The rest of the paper is organized as follows. The Rényi’s Pseudodistance Canonical Correlation Analysis (RPCCA) is introduced in Section 2, and some of its properties are studied. Next, an estimation design for computing the canonical vectors in practice using RPCCA is described in Section 3. In Section 4, the robustness of the RPCCA is theoretically established. Section 5 describes a permutation test to determine the number of significant canonical variables and thereby provide a dimension reduction method. In Section 6, a Monte Carlo simulation study is carried out to empirically evaluate the performance of the RPCCA and compare the proposed method with the ICCA in terms of estimation accuracy and robustness. An example with real data is studied in Section 6.3. Finally, some conclusions are drawn in Section 7.

2. Rényi’s Pseudodistance Canonical Correlation Analysis

Given two multidimensional random variables

X

and

Y,

the RPCCA aims to identify two direction vectors

a

and

b

(the canonical vectors), such that the corresponding canonical variables

U = a^{T} X

and

V = b^{T} Y

are as dependent as possible. Such dependency is measured in terms of RP between their joint distribution and the product of their marginal distributions. The RP of tuning parameter $τ$ between the joint distribution of the bidimensional random variable

(U, V)

and the product of their marginals,

f_{U} (u) \times f_{V} (v)

, is given for

τ > 0

by (cf. [26]).

\begin{matrix} d_{τ} (a, b) & = & d_{τ} (f_{U} \times f_{V}, f_{U V}) \\ = & \frac{1}{τ + 1} ln \int_{R^{2}} f_{U}^{τ + 1} (u) f_{V}^{τ + 1} (v) d u d v - \frac{1}{τ} ln \int_{R^{2}} f_{U}^{τ} (u) f_{V}^{τ} (v) f_{U V} (u, v) d u d v \\ + \frac{1}{τ (τ + 1)} ln \int_{R^{2}} f_{U V}^{τ + 1} (u, v) d u d v . \end{matrix}

Hence, the RP measures the statistical discrepancy between the joint PDF of the canonical variables,

f_{U V}

and the marginal PDF’s product

f_{U} \times 4_{V},

or, in other words, the loss in accuracy that comes with assuming independence.

For

τ = 0

, the RP can be defined as the corresponding limit,

τ \to 0

, yielding the Kullback–Leibler divergence:

d_{0} (a, b) = lim_{τ ↓ 0} d_{τ} (a, b) = lim_{τ ↓ 0} d_{τ} (f_{U} \times f_{V}, f_{U V}) = D_{K L} (f_{U V}, f_{U} \times f_{V}) .

(2)

As earlier discussed, independent canonical variables lead to

d_{τ} (a, b) = 0,

and, contrarily, strong dependency should result in large RP distances. Then, the RPCCA procedure aims to identify pairwise canonical vectors

a_{i} \in R^{q}

and

b_{i} \in R^{p}

,

i \leq q \leq p

such that

d_{τ} (a_{i}, b_{i}) = max_{a, b} d_{τ} (a, b),

and, as before for identification, the canonical variables should have unit variance and be uncorrelated with any previous pairwise of canonical variables:

a_{i}^{T} Σ_{X} a_{i} = b_{i}^{T} Σ_{Y} b_{i} = 1, \forall i,

a_{j}^{T} Σ_{X} a_{i} = b_{j}^{T} Σ_{Y} b_{i} = 0, \forall j = 1, \dots, i - 1,

where

Σ_{X}

and

Σ_{Y}

are the variance–covariance matrices of

X

and

Y,

respectively.

Note that, by Equation (2), the ICCA procedure presented in [12] is recovered at

τ = 0,

and so the RPCCA generalizes ICCA.

Remark 1.

Given the random vectors

X

and

Y,

RPCCA finds the vectors

a_{1}, b_{1}

such that

a_{1}^{T} X

and

b_{1}^{T} Y

are maximally related. This maximal relation is measured via

d_{τ} (a_{i}, b_{i})

, as previously defined. Once these vectors

a_{1}, b_{1}

are obtained, the procedure looks for a new pair of vectors

a_{2}, b_{2}

such that

a_{1}

and

a_{2}

are incorrelated, and the same applies for

b_{1}, b_{2}

, and

a_{2}^{T} X

and

b_{2}^{T} Y

are maximally related. Consequently,

d_{τ} (a_{1}, b_{1}) \geq d_{τ} (a_{2}, b_{2}) .

Next, the procedure looks for

a_{3}, b_{3}

being incorrelated to

a_{1}, a_{2}

and

b_{1}, b_{2},

respectively, and so on. Hence, it follows that

d_{τ} (a_{i}, b_{i}) \geq d_{τ} (a_{i + 1}, b_{i + 1}),

for any

i = 1, \dots, q - 1 .

If

d_{τ} (a_{i}, b_{i}) = 0

. Then, independence arises and the procedure stops. In practice, we will have an estimation of

d_{τ} (a_{i}, b_{i})

, and we will stop the procedure if this value does not exceed a certain threshold. This will be applied in Section 5 in order to determine the number of components.

Let us consider, again, the example described in the introduction, where

X = {(X_{1}, X_{2})}^{T}

and

Y = {(Y_{1}, Y_{2})}^{T}

satisfy

X \sim N (0, I), Y_{1} = X_{1}^{2} + Z, Y_{2} \sim Z, Z \sim χ_{1}^{2} .

The true value of the first pair of canonical vectors are then

a_{1} = b_{1} = {(1, 0)}^{T} .

Under the described setup, it follows that

f_{U V} (u, v) = f_{X_{1}, Y_{1}} (x, y) = \frac{1}{π} \frac{1}{2} exp (- \frac{1}{2} y) {(y - x^{2})}^{- \frac{1}{2}}, y > 0, x \in R,

and

f_{U} (u) = f_{X_{1}} (x) = \frac{1}{\sqrt{2 π}} exp (- \frac{1}{2} x^{2}), f_{V} (v) = f_{Y_{1}} (y) = \frac{1}{2} exp (- \frac{1}{2} y), y > 0 .

Clearly,

f_{U V} (u, v) \neq f_{U} (u) \times f_{V} (v)

and because of the properties of the divergence

d_{τ} (a_{1}, b_{1}) > 0 .

The last inequality holds because the RP divergence,

d_{τ} (\cdot, \cdot),

only reaches the value zero if both arguments coincide, as discussed in Section 1 (see [26] for more details). In this case, RPCCA should identify a pair

a_{1}, b_{1}

with a non-zero informational coefficient of canonical correlation defining the canonical variables

a_{1}^{T} X

and

b_{1}^{T} Y

.

For practical use of RPCCA, it is interesting to note that RPCCA is equivariant under invertible linear transformations. This equality does not hold for other extensions of ICCA, but proportionality arises instead.

Proposition 1.

Consider two random variables U and V, and take

R = c U

and

S = e V,

where c and e are non-zero real numbers (Indeed, the result also holds if we consider two random vectors

U, V

, and consider

R = C U

and

S = D V,

where C and D are two invertible matrices. In this case, RP is computed considering multidimensional integrals.). Then,

d_{τ} (f_{U} \times f_{V}, f_{U V}) = d_{τ} (f_{R} \times f_{S}, f_{R S}) .

Proof.

By definition,

\begin{matrix} d_{τ} (f_{R} \times f_{S}, f_{R S}) & = & \frac{1}{τ + 1} ln \int f_{R}^{τ + 1} (r) f_{S}^{τ + 1} (s) d r d s + \frac{1}{τ (τ + 1)} ln \int f_{R S}^{τ + 1} (r, s) d r d s \\ - \frac{1}{τ} ln \int f_{R}^{τ} (r) f_{S}^{τ} (s) f_{R S} (r, s) d r d s \\ = & \frac{1}{τ + 1} ln \int f_{U}^{τ + 1} (u) {(\frac{1}{c})}^{τ + 1} f_{V}^{τ + 1} (v) {(\frac{1}{e})}^{τ + 1} c e d u d v \\ + \frac{1}{τ (τ + 1)} ln \int f_{U V}^{τ + 1} (u, v) {(\frac{1}{c e})}^{τ + 1} c e d u d v \\ - \frac{1}{τ} ln \int f_{U}^{τ} (u) {(\frac{1}{c})}^{τ} f_{V}^{τ} (s) {(\frac{1}{e})}^{τ} f_{U V} (u, v) (\frac{1}{c e}) c e d u d v \\ = & \frac{1}{τ + 1} ln {(\frac{1}{| c | | e |})}^{τ} \int f_{U}^{τ + 1} (u) f_{V}^{τ + 1} (v) d u d v \\ + \frac{1}{τ (τ + 1)} ln {(\frac{1}{| c | | e |})}^{τ} \int f_{U V}^{τ + 1} (u, v) d u d v \\ - \frac{1}{τ} ln {(\frac{1}{| c | | e |})}^{τ} \int f_{U}^{τ} (u) f_{V}^{τ} (v) f_{U V} (u, v) d u d v \\ = & (\frac{1}{τ + 1} + \frac{1}{τ (τ + 1)} - \frac{1}{τ}) ln {(\frac{1}{| c | | e |})}^{τ} + d_{τ} (f_{U} \times f_{V}, f_{U V}) \\ = & d_{τ} (f_{U} \times f_{V}, f_{U V}) . \end{matrix}

□

The next result establishes that the RPCCA is reduced to CCA in the case of normal distributions.

Proposition 2.

In the case of normal distributions, RPCCA coincides with CCA.

Proof.

Consider normal populations, i.e., assume that the multidimensional random variables

X

and

Y

are jointly normally distributed,

(\begin{matrix} X \\ Y \end{matrix}) \equiv N ((\begin{matrix} μ_{X} \\ μ_{Y} \end{matrix}), (\begin{matrix} Σ_{X} & Σ_{X Y} \\ Σ_{Y X} & Σ_{Y} \end{matrix})) .

Therefore, the bidimensional random variable

(U, V) = (a^{T} X, b^{T} Y)

follows a bidimensional normal distribution whose vector mean is

μ = {(μ_{1}, μ_{2})}^{T} = {(E [a^{T} X], E [b^{T} Y])}^{T} = {(a^{T} μ_{X}, b^{T} μ_{Y})}^{T},

and the variance–covariance matrix is given by

(\begin{matrix} σ_{1}^{2} & σ_{1} σ_{2} ρ \\ σ_{1} σ_{2} ρ & σ_{2}^{2} \end{matrix})

being

σ_{1}^{2} = V a r [a^{T} X] = a^{T} Σ_{X} a, σ_{2}^{2} = V a r [b^{T} Y] = b^{T} Σ_{Y} b and ρ = \frac{C o v (U, V)}{σ_{1} σ_{2}} = \frac{a^{T} Σ_{X Y} b}{σ_{1} σ_{2}} .

On the other hand, the marginal densities

f_{μ_{1}, σ_{1}} (u)

and

f_{μ_{2}, σ_{2}} (v)

of

a^{T} X

and

b^{T} Y,

respectively, are normal distributions,

f_{U} (u) \equiv N (μ_{1}, σ_{1}^{2}) and f_{V} (v) \equiv N (μ_{2}, σ_{2}^{2}) .

We first compute the RP between

f_{μ_{1}, σ_{1}} (u) \times

f_{μ_{2}, σ_{2}} (v)

and

f_{μ_{1}, μ_{2}, σ_{1,} σ_{2}, ρ} (u, v) .

Considering the results obtained in Supplementary Materials (Appendix A) in [6], we have

\int_{R^{2}} f_{U} {(u)}^{τ + 1} f_{V} {(v)}^{τ + 1} d u d v = k_{1}^{τ} {(1 + τ)}^{- 1},

being

k_{1} = {(2 π σ_{1} σ_{2})}^{- 1}

and

\int_{R^{2}} f_{U V} {(u, v)}^{τ + 1} d u d v = k_{1}^{τ} {(1 + τ)}^{- 1} {(1 - ρ^{2})}^{- \frac{τ}{2}} .

On the other hand, it is not difficult to see that

\int_{R^{2}} f_{U}^{τ} (u) f_{V}^{τ} (v) f_{U V} (u, v) d u d v = k_{1}^{τ} {[(1 + τ (1 + ρ)) (1 + τ (1 - ρ))]}^{- 1 / 2} .

Based on the previous quantities, we have

\begin{matrix} d_{τ} (a, b) & = & \frac{1}{τ + 1} ln k_{1}^{τ} {(1 + τ)}^{- 1} - \frac{1}{τ} ln k_{1}^{τ} k_{1}^{τ} {[(1 + τ (1 + ρ)) (1 + τ (1 - ρ))]}^{- 1 / 2} \\ + \frac{1}{τ (τ + 1)} ln k_{1}^{τ} {(1 + τ)}^{- 1} {(1 - ρ^{2})}^{- τ / 2} \\ = & ln \frac{{((1 + τ (1 + ρ)) (1 + τ (1 - ρ)))}^{1 / 2 τ}}{{(1 + τ)}^{1 / τ} {(1 - ρ^{2})}^{1 / 2 (τ + 1)}} . \end{matrix}

For fixed

τ

, it can be seen from the previous expression that

d_{τ} (a, b)

depends on

ρ .

Moreover, it is not difficult to show that

d_{τ} (a, b)

is an increasing function on

ρ^{2}

for any

τ

(see Figure 1 for

τ = 0.1, 0.3

and

0.9

). To show this, it suffices to see that

f_{τ} (ρ) = \frac{{[(1 + τ (1 + ρ)) (1 + τ (1 - ρ))]}^{1 / 2 τ}}{{(1 + τ)}^{1 / τ} {(1 - ρ^{2})}^{1 / 2 (τ + 1)}}, ρ \in (- 1, 1)

is increasing in

ρ^{2}

, and so it will be its logarithm transform. Now, note that

(1 + τ (1 + ρ)) (1 + τ (1 - ρ)) = 1 + 2 τ + τ^{2} (1 - ρ^{2}) .

So, it suffices to show that the function

\frac{{(1 + 2 τ + τ^{2} (1 - ρ^{2}))}^{1 / 2 τ}}{{(1 - ρ^{2})}^{1 / 2 (τ + 1)}}

is increasing in

ρ^{2} .

Taking derivatives with respect to

ρ^{2}

, we obtain

f^{'} (ρ^{2}) = [\frac{1}{2 τ} {(1 + 2 τ + τ^{2} (1 - ρ^{2}))}^{1 / 2 τ - 1} (- τ^{2}) {(1 - ρ^{2})}^{1 / 2 (τ + 1)}

- \frac{1}{2 (τ + 1)} {(1 - ρ^{2})}^{1 / 2 (τ + 1) - 1} {(1 + 2 τ + τ^{2} (1 - ρ^{2}))}^{1 / 2 τ}] \frac{1}{{(1 - ρ^{2})}^{\frac{1}{τ + 1}}} .

Thus, it suffices to check the non-negativity of

[\frac{1}{2 τ} (- τ^{2}) (1 - ρ^{2}) + \frac{1}{2 (τ + 1)} (1 + 2 τ + τ^{2} (1 - ρ^{2}))] .

Finally,

\begin{matrix} [\frac{1}{2 τ} (- τ^{2}) (1 - ρ^{2}) + \frac{1}{2 (τ + 1)} (1 + 2 τ + τ^{2} (1 - ρ^{2}))] = \\ \frac{- τ^{2} (1 - ρ^{2})}{2 τ (τ + 1)} + \frac{1 + 2 τ}{2 (τ + 1)} = \frac{τ^{2} (ρ^{2} + 1) + τ}{2 τ (τ + 1)} > 0 . \end{matrix}

and the result holds. Thus, RPCCA is equivalent to classical CCA in the case of random normal variables. □

It can be seen that

d_{τ} (a; b)

is an increasing function on

ρ^{2}

for any

τ > 0

under normal distributions; hence, RPCCA also extends CCA with a tuning parameter

τ

determining the sharpness of the distance

d_{τ} (a, b)

(or the function

f_{τ} (\cdot)

in the proof of Proposition 2).

3. Consistency

We now focus on the practical side of the RPCCA estimation. In practice, the PDFs

f_{U V},

f_{U}

and

f_{V}

are unknown; thus, they should be empirically estimated. Likewise, the RPCCA should be formulated for an empirical setup.

The RP,

d_{τ} (a, b),

can be expressed in terms of expected values as

\begin{matrix} d_{τ} (a, b) = & \frac{1}{τ (τ + 1)} ln E_{f_{U V}} [f_{U, V} {(U, V)}^{τ}] - \frac{1}{τ} ln E_{f_{U V}} [f_{U}^{τ} (U) f_{V}^{τ} (V)] \\ + \frac{1}{τ + 1} ln E_{f_{U}} [f_{U} {(U)}^{τ}] E_{f_{V}} [f_{V} (V)] . \end{matrix}

(3)

This interpretation of

d_{τ} (a, b)

makes the definition of its empirical estimator easier. Let

(X_{i}, Y_{i}), i = 1, \dots, n

be a random sample of size n from the multidimensional random variables

(X, Y) .

Then, an empirical estimator of

d_{τ} (a, b)

is given by

\begin{matrix} \hat{d_{τ}^{n}} (a, b) & = & \frac{1}{τ (τ + 1)} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i})] - \frac{1}{τ} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i})] \end{matrix}

(4)

\begin{matrix} + \frac{1}{τ + 1} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) \frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{V}^{n}}}^{τ} (v_{i})] . \end{matrix}

(5)

Here,

\hat{f_{U}^{n}} (u),

\hat{f_{V}^{n}} (v)

and

\hat{f_{U V}^{n}} (u, v)

are kernel density estimators of

f_{U} (u),

f_{V} (v)

and

f_{U V} (u, v),

respectively, given by

\hat{f_{U}^{n}} (u) = \frac{1}{n a_{n}^{1}} \sum_{i = 1}^{n} K (\frac{u - u_{i}}{a_{n}^{1}}), u \in R,

(6)

\hat{f_{V}^{n}} (v) = \frac{1}{n a_{n}^{2}} \sum_{i = 1}^{n} K (\frac{v - v_{i}}{a_{n}^{2}}), v \in R,

(7)

and

\hat{f_{U V}^{n}} (u, v) = \frac{1}{n b_{n}^{1} b_{n}^{2}} \sum_{i = 1}^{n} K (\frac{u - u_{i}}{b_{n}^{1}}) K (\frac{v - v_{i}}{b_{n}^{2}}) .

(8)

For the PDF’s estimators, we will use the univariate Gaussian kernel with

a_{n}^{j} = 1.06 n^{- 0.2} s_{j}

and

b_{n}^{j} = n^{- 1 / 6} s_{j}

for

j = 1, 2,

and the corresponding sample standard deviations

s_{1}

and

s_{2} .

This kernel function was proposed in [37] and adopted in many other extensions of ICCA, but other types of kernels could be considered instead, as long as they satisfy the conditions of Lemma 1 below (When the distribution is known up to a parameter value

f_{θ},

this information should be taken into account. Hence, the procedure would be the usual procedure in these situations. First, we estimate the parameter of the distribution

θ

by

\hat{θ}

and then consider the distribution with the estimated parameters

f_{\hat{θ}}

. Next, we use

f_{\hat{θ}}

instead of

\hat{f} .

). Other interesting results about kernel distributions can be found in [38,39].

Then, the estimated canonical vectors, based on the RP with tuning parameter

τ

can be computed as

\begin{matrix} ({\hat{a}}_{n}^{τ}, {\hat{b}}_{n}^{τ}) = arg max_{a, b} \hat{d_{τ}^{n}} (a, b), \\ s . t . {(a_{n}^{τ})}^{T} {\hat{Σ}}_{11} a_{n}^{τ} = 1 and {(b_{n}^{τ})}^{T} {\hat{Σ}}_{22} b_{n}^{τ} = 1, \end{matrix}

(9)

where

{\hat{Σ}}_{11}

and

{\hat{Σ}}_{22}

are the empirical estimators of the variance–covariance matrices of

X

and

Y,

respectively.

We next establish the consistency of the estimated canonical vectors under some regularity conditions. That is, we will prove that the estimated canonical vectors

({\hat{a}}_{n}^{τ}, {\hat{b}}_{n}^{τ})

converge for large sample sizes to the true canonical vectors defining the underlying functional relationship. For such a result, it is necessary to present the following lemma whose proof can be found in [40].

Lemma 1.

Let

(X_{i}, Y_{i}), i = 1, \dots, n

be i.i.d. replications of the multidimensional random variables

(X, Y) .

Consider a sequence

{a_{n}}_{n \in N}

such that

0 < a_{n}

and

{lim}_{n \to \infty} a_{n} = 0 .

Assume

\sum_{n = 1}^{\infty} e^{- γ n a_{n}^{2}} < \infty, \sum_{n = 1}^{\infty} e^{- γ n a_{n}^{4}} < \infty, \forall γ > 0 .

Consider a function K of bounded variation (Consider a function

g : R^{k} \mapsto R

, and let P be the set of finite partitions of

R^{k}

in rectangles

p = {[x_{j}, y_{j}), j = 1, \dots, u_{p}}

. Then, g is said to be of bounded variation if

s u p_{p \in P} {\sum_{j = 1}^{u_{p}} \sum_{ϵ_{1}, \dots, ϵ_{k} \in {0, 1}^{k}} {(- 1)}^{\sum_{i = 1}^{k} ϵ_{i}} g (ϵ_{1} x_{j 1} + (1 - ϵ_{1}) y_{j 1}, \dots, ϵ_{k} x_{j k} +

(1 - ϵ_{k}) y_{j k})} < \infty .

) and suppose

f_{U} (a^{T} x)

is uniformly continuous in

a

and

x, f_{V} (b^{T} y)

is uniformly continuous in

b

and

y,

and

f_{U V} (a^{T} x, b^{T} y)

is uniformly continuous in

a, x, b

and

y .

Then,

sup_{a, x} | \hat{f_{U}^{n}} (a^{T} x) - f_{U} (a^{T} x) | \overset{a . s .}{⟶} 0 .

sup_{b, y} | \hat{f_{V}^{n}} (b^{T} y) - f_{V} (b^{T} y) | \overset{a . s .}{⟶} 0 .

sup_{a, b, x, y} | \hat{f_{U V}^{n}} (a^{T} x, b^{T} y) - f_{U V} (a^{T} x, b^{T} y) | \overset{a . s .}{⟶} 0 .

Note that the Gaussian kernel functions defined in Equations (6)–(8) satisfy the conditions of Lemma 1. Of course, any other election of the kernel should also satisfy these regularity conditions. Now, let us define for any real value

b > 0

the set of indices such that the observations

a^{T} x_{i}

and

b^{T} y_{i}, i = 1, \dots, n

have positive densities

χ_{b} = {i : f_{U V}^{τ} (a^{T} x_{i}, b^{T} y_{i}) \geq b, f_{U}^{τ} (a^{T} x_{i}) \geq b, f_{V}^{τ} (b^{T} y_{i}) \geq b}

and denote by

n_{b}

the number of data outside this set. The next result establishes the consistency of the RPCCA.

Proposition 3.

Suppose the conditions of Lemma 1 hold. Assume

b \to 0

such that

\frac{n_{b}}{n} \underset{n \to \infty}{\overset{P}{⟶}} 0,

and consider the estimated and true pairs of canonical vectors,

({\hat{a}}_{n}, {\hat{b}}_{n}) = arg max_{a, b} \hat{d_{τ}^{n}} (a, b) and (a^{*}, b^{*}) = arg max_{a, b} d_{τ} (a, b) .

Further, assume that the maximum

(a^{*}, b^{*})

is unique. Then,

({\hat{a}}_{n}, {\hat{b}}_{n}) \underset{n \to \infty}{\overset{P}{⟶}} (a^{*}, b^{*}) .

Proof.

Take

0 < ϵ, 0 < b

such that

ϵ \underset{n \to \infty}{⟶} 0, b \underset{n \to \infty}{⟶} 0

and

ϵ b^{- 1} \underset{n \to \infty}{⟶} 0 .

By identification, we can assume

{\hat{a}}_{n}^{T} Σ_{11} {\hat{a}}_{n} = {\hat{b}}_{n}^{T} Σ_{22} {\hat{b}}_{n} = 1 .

Let us suppose that

({\hat{a}}_{n}, {\hat{b}}_{n}) \underset{n \to \infty}{\overset{P}{↛}} (a^{*}, b^{*}) .

Hence, there exists a subsequence of

{({\hat{a}}_{n}, {\hat{b}}_{n})}

(that will be denoted by

({\hat{a}}_{n}, {\hat{b}}_{n})

to avoid hard notation) and

(a_{0}, b_{0})

such that

a_{0}^{T} Σ_{11} a_{0} = b_{0}^{T} Σ_{22} b_{0} = 1, (a_{0}, b_{0}) \neq (a^{*}, b^{*})

and

({\hat{a}}_{n}, {\hat{b}}_{n}) ⟶ (a_{0}, b_{0}) .

Now, applying Lemma 1, we know that

sup | \hat{f_{U}^{n}} ({\hat{a}}_{n}^{T} x_{i}) - f_{U} ({\hat{a}}_{n}^{T} x_{i}) | \underset{n \to \infty}{\overset{a . s .}{⟶}} 0 .

sup | \hat{f_{V}^{n}} ({\hat{b}}_{n}^{T} y_{i}) - f_{V} ({\hat{b}}_{n}^{T} y_{i}) | \underset{n \to \infty}{\overset{a . s .}{⟶}} 0 .

sup | \hat{f_{U V}^{n}} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) - f_{U V} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) | \underset{n \to \infty}{\overset{a . s .}{⟶}} 0 .

Thus, for

τ > 0,

sup | {\hat{f_{U}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}) - f_{U}^{τ} ({\hat{a}}_{n}^{T} x_{i}) | \underset{n \to \infty}{\overset{a . s .}{⟶}} 0 .

sup | {\hat{f_{V}^{n}}}^{τ} ({\hat{b}}_{n}^{T} y_{i}) - f_{V}^{τ} ({\hat{b}}_{n}^{T} y_{i}) | \underset{n \to \infty}{\overset{a . s .}{⟶}} 0 .

sup | {\hat{f_{U V}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) - f_{U V}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) | \underset{n \to \infty}{\overset{a . s .}{⟶}} 0 .

Hence, for an n large enough,

{\hat{f_{U}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}) = f_{U}^{τ} ({\hat{a}}_{n}^{T} x_{i}) + Δ_{1 i} = f_{U}^{τ} (a_{0}^{T} x_{i}) + δ_{1 i},

{\hat{f_{V}^{n}}}^{τ} ({\hat{b}}_{n}^{T} y_{i}) = f_{V}^{τ} ({\hat{b}}_{n}^{T} y_{i}) + Δ_{2 i} = f_{V}^{τ} (b_{0}^{T} y_{i}) + δ_{2 i},

{\hat{f_{U V}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) = f_{U V}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) + Δ_{3 i} = f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i}) + δ_{3 i} .

(10)

Here,

| δ_{1 i} |, | δ_{2 i} |, | δ_{3 i} | < ϵ .

Remark that

ln (\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}))

can be written as

ln (\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i}) \frac{\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})}{\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})})

= ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})) + ln (\frac{\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i})}{\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})}) .

Now, applying Equation (10), we obtain

\frac{\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} ({\hat{a}}_{n}^{T} x_{i}, {\hat{b}}_{n}^{T} y_{i})}{\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})} = 1 + \frac{\frac{1}{n} \sum_{i = 1}^{n} δ_{3 i}}{\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})} .

The same can be performed for the two other cases. As

| δ_{3 i} | < ϵ,

it follows that

\frac{1}{n} \sum_{i = 1}^{n} δ_{3 i} \leq ϵ .

On the other hand,

\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i}) \geq \frac{1}{n} \sum_{i = 1}^{n} I (i \in χ_{b}) f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i}) \geq \frac{n - n_{b}}{n} b .

As

ϵ b^{- 1} \to 0

and

\frac{n_{b}}{n} \to 0,

we conclude that

\frac{\frac{1}{n} \sum_{i = 1}^{n} δ_{3 i}}{\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})} \to 0 .

The same can be performed for the two other cases. Hence,

\begin{matrix} {\hat{d}}_{τ}^{n} ({\hat{a}}_{n}, {\hat{b}}_{n}) & = & \frac{1}{τ + 1} ln (\frac{1}{n^{2}} \sum_{i = 1}^{n} f_{U}^{τ} (a_{0}^{T} x_{i}) \sum_{i = 1}^{n} f_{V}^{τ} (b_{0}^{T} y_{i})) + \frac{1}{τ + 1} o (1) \\ + \frac{1}{τ (τ + 1)} ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})) + \frac{1}{τ (τ + 1)} o (1) \\ - \frac{1}{τ} ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U}^{τ} (a_{0}^{T} x_{i}) f_{V}^{τ} (b_{0}^{T} y_{i})) - \frac{1}{τ} o (1) \\ = & {\bar{d}}_{τ}^{n} (a_{0}, b_{0}) + o (1), \end{matrix}

with

\begin{matrix} {\bar{d}}_{τ}^{n} (a_{0}, b_{0}) & = & \frac{1}{τ + 1} ln (\frac{1}{n^{2}} \sum_{i = 1}^{n} f_{U}^{τ} (a_{0}^{T} x_{i}) \sum_{i = 1}^{n} f_{V}^{τ} (b_{0}^{T} y_{i})) \\ + \frac{1}{τ (τ + 1)} ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})) \\ - \frac{1}{τ} ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U}^{τ} (a_{0}^{T} x_{i}) f_{V}^{τ} (b_{0}^{T} y_{i})) . \end{matrix}

Note that

{\bar{d}}_{τ}^{n} (a_{0}, b_{0}) - d_{τ} (a_{0}, b_{0})

is given by

\frac{1}{τ (τ + 1)} ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})) - \frac{1}{τ (τ + 1)} ln (E_{f_{U V}} (f_{U V}^{τ} (a_{0}^{T} X, b_{0}^{T} Y)))

+ \frac{1}{(τ + 1)} ln (\frac{1}{n^{2}} \sum_{i = 1}^{n} f_{U}^{τ} (a_{0}^{T} x_{i}) \sum_{i = 1}^{n} I (i \in χ_{b}) f_{V}^{τ} (b_{0}^{T} y_{i})) - \frac{1}{(τ + 1)} ln (E_{f_{U}} f_{U}^{τ} (a_{0}^{T} X) E_{f_{V}} (f_{V}^{α} (b_{0}^{T} Y)))

- \frac{1}{τ} ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U}^{τ} (a_{0}^{T} x_{i}) f_{V}^{τ} (b_{0}^{T} y_{i})) + \frac{1}{τ} ln (E_{f_{U V}} (f_{U}^{τ} (a_{0}^{T} X) f_{V}^{τ} (b_{0}^{T} Y))) .

As ln is continuous and applying the Strong Law of Large Numbers, it follows

ln (\frac{1}{n} \sum_{i = 1}^{n} f_{U V}^{τ} (a_{0}^{T} x_{i}, b_{0}^{T} y_{i})) \underset{n \to \infty}{\overset{a . s .}{⟶}} ln (E_{f_{U V}} (f_{U V}^{τ} (a_{0}^{T} X, b_{0}^{T} Y))) .

We can perform this similarly for the two other lines. We conclude that

{\bar{d}}_{τ}^{n} (a_{0}, b_{0}) \underset{n \to \infty}{\overset{P}{⟶}} d_{τ} (a_{0}, b_{0})

, and hence

{\hat{d}}_{τ}^{n} (a_{0}, b_{0}) \underset{n \to \infty}{\overset{P}{⟶}} d_{τ} (a_{0}, b_{0}) .

On the other hand,

\hat{d_{τ}^{n}} ({\hat{a}}_{n}, {\hat{b}}_{n}) \geq \hat{d_{τ}^{n}} (a^{*}, b^{*})

because

({\hat{a}}_{n}, {\hat{b}}_{n})

is the optimum by definition.

Taking limits,

d_{τ} (a_{0}, b_{0}) = lim_{n \to \infty} \hat{d_{τ}^{n}} ({\hat{a}}_{n}, {\hat{b}}_{n}) \geq lim_{n \to \infty} \hat{d_{τ}^{n}} (a^{*}, b^{*}) = d_{τ} (a^{*}, b^{*}) .

However,

d_{τ} (a^{*}, b^{*}) \geq d_{τ} (a_{0}, b_{0})

because

(a^{*}, b^{*})

is the optimum. Hence, as

(a^{*}, b^{*})

is the only maximum, we conclude that

(a^{*}, b^{*}) = (a_{0}, b_{0}),

a contradiction. □

4. Robustness

To motivate the inherent robustness property of the RPCCA procedure, we examine the behavior of the estimated divergence in Equation (4) for small values of the tuning parameter. The presented heuristic argument was first discussed in [6] for the density power divergence generalization of ICCA. Consider the estimated RP

\begin{matrix} \hat{d_{τ}^{n}} (a, b) & : = & \frac{1}{τ + 1} ln [(\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i})) (\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{V}^{n}}}^{τ} (v_{i}))] - \frac{1}{τ} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i})] \\ + \frac{1}{τ (τ + 1)} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i})] . \end{matrix}

and let the tuning parameter be

τ ↓ 0 .

Taking limits in the estimated divergence defined in Equation (4), the first term vanishes, and therefore

\hat{d_{τ}^{n}} (a, b) \approx - \frac{1}{τ} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i})] + \frac{1}{τ (τ + 1)} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i})] .

We first study the limiting behavior of the first term,

l_{τ} : = - \frac{1}{τ} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i})] .

For

τ ↓ 0

, this term is a limit of indeterminate form (0/0). Applying L’Hôpital’s rule, we obtain

l_{τ} \approx \frac{\frac{1}{n} [\sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i}) ln (\hat{f_{U}^{n}} (u_{i})) + {\hat{f_{V}^{n}}}^{τ} (v_{i}) {\hat{f_{U}^{n}}}^{τ} (u_{i}) ln (\hat{f_{V}^{n}} (v_{i}))]}{\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i})} .

Now, the denominator tends to 1 when

τ ↓ 0

so that

l_{τ} \approx \frac{1}{n} [\sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i}) ln (\hat{f_{U}^{n}} (u_{i})) + {\hat{f_{V}^{n}}}^{τ} (v_{i}) {\hat{f_{U}^{n}}}^{τ} (u_{i}) ln (\hat{f_{V}^{n}} (v_{i}))] .

Similarly, consider

m_{τ} : = \frac{1}{τ (τ + 1)} ln [\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i})]

and consider its L’Hôpital approximation given by

m_{τ} \approx \frac{1}{2 τ + 1} \frac{\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i}) ln \hat{f_{U V}^{n}} (u_{i}, v_{i})}{\frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i})} \approx \frac{1}{n} \sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i}) ln \hat{f_{U V}^{n}} (u_{i}, v_{i}) .

Consequently,

\hat{d_{τ}^{n}} (a, b) \approx \frac{1}{n} [\sum_{i = 1}^{n} {\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i}) ln \hat{f_{U V}^{n}} (u_{i}, v_{i}) - \sum_{i = 1}^{n} {\hat{f_{U}^{n}}}^{τ} (u_{i}) {\hat{f_{V}^{n}}}^{τ} (v_{i}) ln (\hat{f_{U}^{n}} (u_{i}) \hat{f_{V}^{n}} (v_{i}))] .

Note that this approximation is valid for

τ

closed to 0, but, for

τ = 0,

d_{0} (a, b) = lim_{τ ↓ 0} {\tilde{d}}_{τ} (a, b) = \frac{1}{n} \sum_{i = 1}^{n} ln (\frac{\hat{f_{U V}^{n}} (u_{i}, v_{i})}{\hat{f_{U}^{n}} (u_{i}) \hat{f_{V}^{n}} (v_{i})}) = \frac{1}{n} (\sum_{i = 1}^{n} ln \hat{f_{U V}^{n}} (u_{i}, v_{i}) - \sum_{i = 1}^{n} ln \hat{f_{U}^{n}} (u_{i}) \hat{f_{V}^{n}} (v_{i})) .

This implies that

\hat{d_{τ}^{n}} (f_{U} \times f_{V}, f_{U V})

can be seen as a weighted value of the empirical Kullback–Leibler divergence and the weights depend on

{\hat{f_{U V}^{n}}}^{τ} (u_{i}, v_{i}), {\hat{f_{U}^{n}}}^{τ} (u_{i})

and

{\hat{f_{V}^{n}}}^{τ} (v_{i}) .

Therefore, if the observations

x_{i}, y_{i}

or both are outliers, the corresponding density estimations would decrease, so the corresponding weights would not be considered as important as other data on the estimated distance, thus making Renyi’s pseudodistance more robust to outliers than the Kullback–Leibler.

5. Testing to Determine the Number of Pairs

In this section, a dimension reduction algorithm is described for determining the number of significant pairs of canonical vectors: the non-parametric sequential test [41,42].

In the classical approach of CCA, the maximum number of pairs

(a_{i}, b_{i})

is determined by the greatest index j such that

(a_{j}, b_{j})

is the first pair satisfying

ρ (a_{j}^{T} X, b_{j}^{T} Y) = 0 .

That is, the CCA should run until the best estimated pair leads to linear independence. A natural extension for the RPCCA formulation is then replicated in the CCA dimension reduction algorithm, but using the RP divergence as a measure of dependence.

Let us denote by

d_{τ}^{i}

the maximum value achieved at the i-th iteration,

d_{τ}^{i} = max_{a_{i}, b_{i}} d_{τ} (a_{i}, b_{i}), i = 1, \dots, l = min (q, p),

such that

a_{i}^{T} Σ_{11} a_{i} =

b_{i}^{T} Σ_{22} b_{i} = 1

and

a_{j}^{T} Σ_{11} a_{i} =

b_{j}^{T} Σ_{22} b_{i} = 0 .

The sequence of maximums is decreasing and lower-bounded by

0,

indicating independence between the estimated canonical variables,

d_{τ}^{1} \geq d_{τ}^{2} \geq \dots \geq d_{τ}^{l} \geq 0 .

Then, a stopping criterion for the maximum number of canonical correlations is naturally determined by the testing problem

H_{0} : d_{τ}^{i} = 0 vs H_{1} : d_{τ}^{i} > 0, i = 1, \dots, l .

If

H_{0}

is not rejected, then all subsequent canonical variables from the i-th onward are not significantly related. Otherwise, the relation is significant, and the maximum number of significant canonical correlations is at least

i .

It is difficult to obtain the exact sample distribution of

d_{τ}^{i}

, but a non-parametric permutation test can be applied, as proposed in [24], for estimating the p-value of the test. Let us explain this procedure with some detail. Suppose there is a relationship between

a_{i}^{T} X

and

b_{i}^{T} Y

for some vectors

a_{i}, b_{i}

, i.e.,

H_{0}

does not hold. This means that there exists a function f such that

f (a_{i}^{T} X) \approx b_{i}^{T} Y .

Our procedure will estimate vectors

a_{i}

and

b_{i}

and will consider some (near!) vectors

{\hat{a}}_{i}

and

{\hat{b}}_{i},

respectively. Consequently, we expect that for the sample

(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n}),

we will obtain

f ({\hat{a}}_{i}^{T} X_{j}) \approx {\hat{b}}_{i}^{T} Y_{j}, j = 1, \dots, n .

This will translate in a large value of

d_{τ}^{n} ({\hat{a_{i}}}^{T} X, {\hat{b_{i}}}^{T} Y)

and, consequently, the corresponding estimation

\hat{d_{τ}^{n}} ({\hat{a_{i}}}^{T} X, {\hat{b_{i}}}^{T} Y) .

Now, if we consider a permutation of the data corresponding to

X

but maintaining the order for the data corresponding to

Y,

any possible relationship is destroyed because the data corresponding to

X_{i}

do not correspond to individual i in the sample, so that they have nothing to do with

Y_{i}

. In other words, if we denote the reordered sample for

(X_{1}, \dots, X_{n})

by

(X_{1}^{*}, \dots, X_{n}^{*}),

it follows that for any

c, d

, then

\hat{d_{τ}^{n}} (c^{T} X^{*}, d^{T} Y) \approx 0

showing independence. Consequently, when the procedure looks for some vectors

\hat{a_{i}^{*}}, \hat{b_{i}^{*}}

s.t.

\hat{d_{τ}^{n}} ({\hat{a_{i}^{*}}}^{T} X^{*}, {\hat{b_{i}^{*}}}^{T} Y) = max \hat{d_{τ}^{n}} (a^{T} X^{*}, b^{T} Y),

these values are not expected to model a strong relation (because it does not exist), so that we expect

\hat{d_{τ}^{n}} ({\hat{a_{i}^{*}}}^{T} X^{*}, {\hat{b_{i}^{*}}}^{T} Y) \approx 0 .

Hence, if

H_{0}

does not hold and a relationship between the canonical variables exists, we expect that for (almost) any permutation

\hat{d_{τ}^{n}} ({\hat{a_{i}}}^{T} X, {\hat{b_{i}}}^{T} Y) > \hat{d_{τ}^{n}} ({\hat{a_{i}^{*}}}^{T} X^{*}, {\hat{b_{i}^{*}}}^{T} Y) .

On the other hand, if

H_{0}

holds, then there is independence between

c^{T} X

and

d^{T} Y

for any

c, d .

Consequently, for the best possible estimated vectors

\hat{a_{i}}, \hat{b_{i}}

, we will obtain

\hat{d_{τ}^{n}} ({\hat{a_{i}}}^{T} X, {\hat{b_{i}}}^{T} Y) \approx 0 .

When considering a permutation of the values corresponding to

X,

independence will arise again, and hence, in this case, we expect

\hat{d_{τ}^{n}} ({\hat{a_{i}}}^{T} X, {\hat{b_{i}}}^{T} Y) \approx \hat{d_{τ}^{n}} ({\hat{a_{i}^{*}}}^{T} X^{*}, {\hat{b_{i}^{*}}}^{T} Y) .

Of course, the number of possible permutations is

n!

, and this is not affordable for large values of n. Hence, we are going to consider just a subset of randomly chosen permutations. Then, if

{\hat{d}}_{τ}^{i, w}

denotes the value of the index corresponding to the w-th randomly permuted sample, the estimated p-value of the test is given by

\frac{1}{R} \sum_{w = 1}^{R} I_{[{\hat{d}}_{τ}^{i, w} > {\hat{d}}_{τ}^{i}]},

where R denotes the number of permutations considered. Yin [12] used

R = 1000

for a permutation test for ICCA. If the p-value is smaller than a certain significance level, then the null hypothesis

d_{τ}^{i} = 0

should be rejected implying a significant relationship for the i-th canonical variables, and the process should be repeated for

i + 1 .

Conversely, if the null hypothesis is not rejected, then we should assume that the canonical variables are independent and conclude that there are only i estimated canonical variables exhibiting significant relationships. More details about this dimension reduction method can be seen in [24].

6. Simulation Study

6.1. Computational Methods

Consider

X = (x_{1}, \dots, x_{n})

and

Y = (y_{1}, \dots, y_{n})

as

p \times n

and

q \times n

matrices with n observations of the random variables

X

and

Y,

respectively. The estimation of the i-th pair of canonical vectors

{\hat{a}}_{i}^{τ}, {\hat{b}}_{i}^{τ}

based on the RP with tuning parameter

τ

is computed through the constrained maximization problem

\begin{matrix} ({\hat{a}}_{n}^{τ}, {\hat{b}}_{n}^{τ}) = arg max_{a, b} \hat{d_{τ}^{i}} (a, b), \\ s . t . {(a_{i}^{τ})}^{T} {\hat{Σ}}_{X} a_{i}^{τ} = 1 and {(b_{i}^{τ})}^{T} {\hat{Σ}}_{Y} b_{n}^{τ} = 1, \\ {(a_{i}^{τ})}^{T} {\hat{Σ}}_{X} b_{j}^{τ} = 0 and {(b_{i}^{τ})}^{T} {\hat{Σ}}_{Y} b_{j}^{τ} = 0, j = 1, \dots, i - 1 \end{matrix}

(11)

where

{\hat{Σ}}_{X}

and

{\hat{Σ}}_{Y}

are the empirical estimators of the variance–covariance matrices of the multidimensional random variables

X

and

Y,

respectively.

The optimization problem constraints can be simplified by scaling the sample matrices to have zero mean and unit variance as follows:

\begin{matrix} \tilde{X} & = {\hat{Σ}}_{X}^{- 1 / 2} (X - \bar{X}) \\ \tilde{Y} & = {\hat{Σ}}_{22}^{- 1 / 2} (Y - \bar{Y}), \end{matrix}

(12)

where

\bar{X}

and

\bar{Y}

denote the corresponding sample mean vectors. From Proposition 1, the RPCCA is invariant under such linear transformations, and, consequently, the problem constraints are transformed into

\begin{matrix} {(a_{i}^{τ})}^{T} a_{i}^{τ} = 1 and {(b_{i}^{τ})}^{T} b_{n}^{τ} = 1, \\ {(a_{i}^{τ})}^{T} b_{j}^{τ} = 0 and {(b_{i}^{τ})}^{T} b_{j}^{τ} = 0, j = 1, \dots, i - 1 . \end{matrix}

(13)

For empirical covariance matrices, it may appear as disease degeneration resulting in uninvertible matrices. In those cases, we can skip the scaling transform and apply the estimation algorithm under the original restrictions.

From the transformed canonical vectors,

{\tilde{a}}_{i}

and

{\tilde{b}}_{i},

the estimated canonical vectors in the original space can be easily recovered as

{\hat{a}}_{i} = {\hat{Σ}}_{X}^{- 1 / 2} {\tilde{a}}_{i}

and

{\hat{b}}_{i} = {\hat{Σ}}_{Y}^{- 1 / 2} {\tilde{b}}_{i} .

Here, the constrained optimization is carried out iteratively using the non-linear constrained optimizer optimize from the scipy package in Python, which implements a Sequential Quadratic Programming (SQP) method. The source code for the implementation is publicly available on https://github.com/MariaJaenada/Robust-Canonical-Correlations (Github) (accessed on 29 January 2023).

6.2. Monte Carlo Simulation

We empirically examine the robustness of the RPCCA method through a Monte Carlo simulation. We consider a pair of random vectors,

X = (X_{1}, \dots, X_{8})

and

Y = (Y_{1}, Y_{2}, Y_{3}),

whose components satisfy a linear and a non-linear relationship of the form:

Y_{1} = {(2 X_{1} + X_{2} + X_{3})}^{2} and Y_{2} = X_{2} - X_{3} .

(14)

The rest of the variables are independent and they are defined as follows:

X_{1}, X_{2}, X_{6}, X_{7}

and

X_{8}

. They are standard normal variables.

X_{3}

comes from a chi-square distribution with 7 degrees of freedom,

X_{4}

follows a t-Student distribution with 5 degrees of freedom and

X_{5}

comes from a Fisher–Snedecor distribution with 3 and 12 degrees of freedom, respectively. Finally,

Y_{3}

comes from a t-Student with 9 degrees of freedom.

The true underlying canonical vectors are then

a_{1} = (0, 1, - 1, 0, 0, 0, 0, 0),

b_{1} = (0, 1, 0)

and

a_{2} = (2, 1, 1, 0, 0, 0, 0, 0),

b_{2} = (1, 0, 0) .

Note that they are orthogonal, and so are the related variables

Y_{1}

and

Y_{2} .

Although in the procedure we compute unit-norm vectors, we have considered the description vectors with natural coefficients as they look easier to understand. We named the first canonical vector

a_{1}

because we empirically detected that the linear relationship is first captured.

We generate a random sample of the pairs

X

and

Y

of size

n = 100

, and we estimate the pairs of canonical vectors

{\hat{a}}_{i}

and

{\hat{b}}_{i},

i = 1, 2,

such that the random variables

U_{i} = a_{i}^{T} X

and

V_{i} = b_{i}^{T} Y

are functionally interrelated. To examine the performance of the RPCCA method under contamination, we randomly switch the functional relationships in Equation (14) for an

ε %

of the observations, with

ε = 5, 10, 15

and 20 denoting the contamination proportion. That is, for a random

ε %

of the observations, the values of

Y_{1}

and

Y_{2}

are exchanged, generating orthogonal outliers; the functions defining the

Y_{2}

and

Y_{1}

are orthogonal to each other. Therefore, this contamination will worsen both relationships at the same time in orthogonal directions. We repeat the simulations over

R = 500

replications and compute averages of the following performance measures: We quantify the accuracy of the estimates with the absolute correlations between the estimated and true canonical variables,

| ρ (a_{i}, {\hat{a}}_{i}) | = | ρ (a_{i}^{T} X, {\hat{a}}_{i}^{T} X) |

and

| ρ (b_{i}, {\hat{b}}_{i}) | = | ρ (b_{i}^{T} Y, {\hat{b}}_{i}^{T} Y) | .

Additionally, to evaluate the robustness of the method, we compute the

L_{2}

-norm between the canonical vectors fitted under uncontaminated and contaminated data,

\hat{a}

and

{\hat{a}}^{c},

L_{2} (\hat{a}, {\hat{a}}^{c}) = | | \hat{a} - {\hat{a}}^{c} {| |}_{2}

as well as the projection of

{\hat{a}}^{c}

into the orthogonal subspace spanned by the uncontaminated estimate,

\hat{a},

P_{2} (\hat{a}, {\hat{a}}^{c}) = | | (I - \hat{a} {\hat{a}}^{T}) {\hat{a}}_{c}^{T} {| |}_{2} .

The distance measures

L_{2} (\hat{a}, {\hat{a}}^{c})

and

P_{2} (\hat{a}, {\hat{a}}^{c})

are smaller the more stable the estimate is, implying that the estimates are not largely affected by the contamination; hence the corresponding method is more robust. Summarizing, the correlations between true and estimated canonical variables

ρ (\cdot, \cdot)

aim to represent the accuracy of the method, whereas the distance measures between estimated canonical vectors for pure data and for contaminated data,

L_{2}

and

P_{2}

, aim to represent the robustness of the method.

Table 1 and Table 2 present all performance measures for the RPCCA method over a grid of tuning parameters ranging from 0 (corresponding to ICCA) to

0.8 .

All methods perform suitably well in terms of accuracy, achieving high absolute correlations between true and estimated canonical variables, even under contaminated scenarios. However, the linear relationship in the first component is captured worse by the ICCA in the presence of contamination, as shown by the lower absolute correlations between the canonical variables,

ρ ({\hat{a}}_{1}^{c}, {\hat{b}}_{1}^{c}) .

Moreover, the RPCCA method with positive values of the tuning parameter produces more stable estimations of the canonical vectors, having smaller

P_{2}

and

L_{2}

distances between the uncontaminated and contaminated estimated canonical vectors in both components,

(\hat{b_{1}}, {\hat{b}}_{1}^{c})

and

(\hat{b_{2}}, {\hat{b}}_{2}^{c}),

thus demonstrating the advantage in terms of robustness. Although the differences in performance are not impressive, the gain in robustness with very little loss of accuracy with respect to the ICCA makes the RPCCA very attractive.

On the other hand, if the underlying relationship is easily identified, the proposed robust RPCCA performs as good as the ICCA under pure data and outperforms the ICCA in the presence of contamination (Table 1). However, for

τ > 0

, the loss in accuracy in the relationship identification under pure data would be unavoidable (although not very significant); hence, the tuning parameter should be chosen sufficiently close to zero (from the literature, less than 1) to provide an adequate compromise between efficiency loss and robustness gain. Moderate values of the tuning parameter, around 0.3, offer the best compromise producing canonical estimators that are robust against data contamination with a small loss of efficiency with respect to the ICCA in the absence of contamination.

6.3. Real Data Application

We finally illustrate the applicability of our method with real-life data on the heredity of head shape in men. For such a purpose, we use a well-known dataset from Frets [43] that collects the head length and head breadth for the first and second sons for

n = 25

families. Then, the first and second set of variables,

X

and

Y

, respectively, have the dimension 2 and represent the head length and head breadth of the corresponding son. From the dataset, we want to analyze whether there is a relationship between the head shape among male offspring. The data have been widely used in the literature, and Mardia et al. [2] and Yin [12] analyzed the canonical correlations between the first and second sons’ head shapes using CCA and ICCA, respectively. In their analyses, they found one significant pair of canonical variables with a strong linear relationship. Figure 2 shows the plots of the first (left) and second (right) pair of canonical variables for the head data estimated by RPCCA with

τ = 0

(top) and

τ = 0.5

(bottom). As shown, both methods coincide on the first pair of canonical variables (estimate the same observations for the first pair of canonical variables),

x_{1}

and

y_{1},

having linear correlation coefficients of

ρ = 78.67 %

(

τ = 0

) and

ρ = 76.86 %

(

τ = 0.5

) as illustrated on the corresponding plots. For the second pair of canonical variables, none of the methods find any clear functional relationship between linear combinations of the variables, and the two procedures considered estimate very different canonical variables without a clear functional relationship between them (as shown in Figure 2). Thus, we also conclude that there is only one pair of canonical variables.

Additionally, to illustrate the advantage of our method in terms of robustness (with a small loss of deficiency), we contaminate a single observation (obs. 24) in both vector variables, generating an outlying observation. Then, we apply RPCCA at

τ = 0

(corresponding to ICCA) and

τ = 0.5

with the uncontaminated and the contaminated data. Table 3 presents

P_{2}

and

N_{2}

distances between the first pair of canonical vectors (identifying the linear relationship) estimated under uncontaminated and contaminated data, with only one outlying observation. Because the sample size is small, an outlying observation heavily influences the ICCA estimation, whereas the RPCCA method with

τ = 0.5

shows a great stability in the canonical vector estimation. These results illustrate the advantage of the RPCCA in real-life applications, producing robust estimates of the canonical variables with a small loss of efficiency with respect to the ICCA estimation in the absence of data contamination.

7. Conclusions

We have presented a robust generalization of the ICCA based on RP for identifying linear and non-linear relationships between two sets of variables. We have derived sample versions for estimating the canonical vectors in practice, and we have demonstrated the consistency of such estimators. Further, the robustness advantage of the RPCCA has been examined theoretically and empirically, concluding that the proposed RPCCA offers an appealing alternative to ICCA, competitive in terms of estimation accuracy and more robust against data contamination. The method manages to detect hidden functional relationships between linear combinations of the variables and suitably approximates the true underlying relationships, even under contaminated scenarios. Moreover, a permutation test for determining the number of significant pairs of canonical vectors is presented. Since the RPCCA is a parametric family, a data-driven algorithm for determining optimal values of the tuning parameter is a worthwhile pursuit for future research. Also, the methodology presented here can be extended in future works for identifying relationships between more than two sets of variables. The idea is to consider not only two random vectors but k random vectors as considered in [24] and to look for the linear combinations in all of them so that the RP between the marginal distributions and the whole distribution is as large as possible.

Author Contributions

Conceptualization, M.J., P.M., L.P. and K.Z.; methodology, M.J., P.M., L.P. and K.Z.; software, M.J., P.M., L.P. and K.Z.; validation, M.J., P.M., L.P. and K.Z.; formal analysis, M.J., P.M., L.P. and K.Z.; investigation, M.J., P.M., L.P. and K.Z.; resources, M.J., P.M., L.P. and K.Z.; data curation, M.J., P.M., L.P. and K.Z.; writing—original draft preparation, M.J., P.M., L.P. and K.Z.; writing—review and editing, M.J., P.M., L.P. and K.Z.; visualization, M.J., P.M., L.P. and K.Z.; supervision, M.J., P.M., L.P. and K.Z.; project administration, M.J., P.M., L.P. and K.Z.; funding acquisition, M.J., P.M., L.P. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Spanish Grants PID2021-124933NB-I00 and FPU/018240.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are very grateful to the referees and associate editor for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest. The founders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

CCA	Canonical Analysis
ICCA	Informational Canonical Analysis
RP	Rényi Pseudodistance
RPCCA	Rényi Pseudodistance Canonical Analysis

References

Hotelling, H. Relations between two sets of variables. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
Mardia, K.; Kent, J.; Bibby, J. Multivariate Analysis; Academic Press: New York, NY, USA, 1979. [Google Scholar]
Rencher, A.C.; Christensen, W.F. Methods of Multivariate Analysis, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Ouali, D.; Chebana, F.; Ouarda, T.B.M.J. Non-linear canonical correlation analysis in regional frequency analysis. Stoch. Environ. Res. Risk Assess 2016, 30, 449–462. [Google Scholar] [CrossRef]
Cannon, A.J.; Hsieh, W.W. Robust nonlinear canonical correlation analysis: Application to seasonal climate forecasting. Nonlinear Process. Geophys. 2008, 15, 221–232. [Google Scholar] [CrossRef]
Iaci, R.; Sriram, T.N. Robust multivariate association and dimension reduction using density divergences. J. Multivar. Anal. 2013, 117, 281–295. [Google Scholar] [CrossRef]
Gifi, A. Nonlinear Multivariate Analysis; Wiley-Blackwell: Hoboken, NJ, USA, 1990. [Google Scholar]
Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
Lai, P.L.; Fyfe, C. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 2000, 10, 365–377. [Google Scholar] [CrossRef] [PubMed]
Painsky, A.; Feder, M.; Tishby, N. Nonlinear canonical correlation analysis: A compressed representation approach. Entropy 2020, 22, 208. [Google Scholar] [CrossRef] [PubMed]
Van Der Burg, E.; de Leeuw, J. Non-linear canonical correlation. Br. J. Math. Stat. Psychol. 1983, 36, 54–80. [Google Scholar] [CrossRef]
Yin, X. Canonical correlation analysis based on information theory. J. Multivar. Anal. 2004, 91, 161–176. [Google Scholar] [CrossRef]
Pardo, L. Statistical Inference Based on Divergence Measures; Chapman and Hall: Boca Raton, FL, USA, 2006. [Google Scholar]
Mandal, A.; Cichocki, A. Non-Linear Canonical Correlation Analysis Using Alpha-Beta Divergence. Entropy 2013, 15, 2788–2804. [Google Scholar] [CrossRef]
Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
Karasuyama, M.; Sugiyama, M. Canonical dependence analysis based on squared-loss mutual information. Neural Netw. 2012, 34, 46–55. [Google Scholar] [CrossRef] [PubMed]
Nielsen, A.; Vestergaard, J.S. Canonical analysis based on mutual information. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 1068–1071. [Google Scholar]
Romanazzi, M. Influence in canonical correlaion analysis. Psychometrika 1992, 57, 237–259. [Google Scholar] [CrossRef]
Sakar, C.O.; Kursun, O. An hybrid method for feature selection based on mutual information and canonical correlation analysis. In Proceedings of the 20th International Conference on Pattern Recognition, Istambul, Turkey, 23–26 August 2010. [Google Scholar]
Sakar, C.O.; Kursun, O. A method for combining mutual information and canonical correlation analysis: Predictive mutual information and its use in feature selection. Expert Syst. Appl. 2012, 39, 3333–3344. [Google Scholar] [CrossRef]
Wang, Y.; Cang, S.; Yu, H. Mutual information inspired on feature selection using kernel canonical correlation analysis. Expert Syst. 2019, 4, 100014. [Google Scholar] [CrossRef]
Bell, C.B. Mutual information and maximal correlation as measures of dependence. Ann. Math. Stat. 1962, 33, 587–595. [Google Scholar] [CrossRef]
Iaci, R.; Yin, X.; Sriram, T.N.; Klingerberg, C.P. An informational measure of association and dimension reduction for multiple sets and groups with applications in morphometric analysis. J. Am. Stat. Assoc. 2008, 103, 1166–1176. [Google Scholar] [CrossRef]
Jones, M.C.; Hjort, N.L.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 2001, 88, 865–873. [Google Scholar] [CrossRef]
Broniatowski, M.; Toma, A.; Vajda, I. Decomposable pseudodistance and applications in statistical estimation. J. Stat. Plan. Inference 2012, 142, 2574–2585. [Google Scholar] [CrossRef]
Castilla, E.; Martín, N.; Muñoz, S.M.; Pardo, L. Robust Wald-type tests based on minimum Rényi pseudodistances estimators for the multiple regresion model. J. Stat. Comput. Simul. 2020, 90, 2655–2680. [Google Scholar] [CrossRef]
Castilla, E.; Jaenada, M.; Pardo, L. Estimation and testing on independent not identically distributed observations based on Rényi’s pseudodistances. IEEE Trans. Inf. Theory 2022, 68, 4588–4609. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceeding of the 4th Symposium on Probability and Statistics; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Toma, A.; Leoni-Aubin, S. Optimal robust M-estimators using Rényi pseudodistances. J. Multivar. Anal. 2013, 115, 259–273. [Google Scholar] [CrossRef]
Toma, A.; Karagrigoriou, A.; Trentou, P. Robust model selection criteria based on pseudodistances. Entropy 2020, 22, 304. [Google Scholar] [CrossRef] [PubMed]
Jaenada, M.; Pardo, L. The minimum Renyi’s Pseudodistances estimators for Generalized Linear Models. In Data Analysis and Related Applications: Theory and Practice; Proceeding of the ASMDA; Wiley: Athens, Greece, 2021. [Google Scholar]
Jaenada, M.; Pardo, L. Robust statistical inference in generalized linear models based on minimum Renyi pseudistance estimators. Entropy 2022, 24, 123. [Google Scholar] [CrossRef] [PubMed]
Castilla, E.; Jaenada, M.; Martín, N.; Pardo, L. Robust approach for comparing two dependent normal populations through Wald-type tests based on Rényi’s pseudodistance estimators. Stat. Comput. 2023, 32, 100. [Google Scholar] [CrossRef]
Jaenada, M.; Miranda, P.; Pardo, L. Robust Test Statistics Based on Restricted Minimum Rényi’s Pseudodistance Estimators. Entropy 2022, 24, 616. [Google Scholar] [CrossRef]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, UK, 1986. [Google Scholar]
Kim, J.S.; Scott, C. Robust kernel density estimation. J. Mach. Learn. Res. 2012, 13, 2529–2565. [Google Scholar]
Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; Wiley: New York, NY, USA, 1992; Volume 1. [Google Scholar]
Rüschendorf, L. Consistency of estimators for multivariate density functions and for the mode. Sankhya Ser. A 1977, 39, 243–250. [Google Scholar]
Davison, A.C.; Hinkley, D.V. Bootstrap Methods and Their Application; Cambridge University Press: Cambridge, UK, 1997; Volume 1. [Google Scholar]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, NY, USA, 1993; Volume 57. [Google Scholar]
Frets, G.P. Heredity of head form in man. Genetica 1921, 3, 193–384. [Google Scholar] [CrossRef]

Figure 1.

f_{τ} (ρ)

for different values of

τ .

τ = 0.1

(red),

τ = 0.3

(green) and

τ = 0.9

(black).

Figure 1.

f_{τ} (ρ)

for different values of

τ .

τ = 0.1

(red),

τ = 0.3

(green) and

τ = 0.9

(black).

Figure 2. Pairs of canonical variables obtained from RPCCA with

τ = 0

(top) and

τ = 0.5

(bottom) for the head dataset.

Figure 2. Pairs of canonical variables obtained from RPCCA with

τ = 0

(top) and

τ = 0.5

(bottom) for the head dataset.

Table 1. RPCCA error measures for the first canonical vector under different values of the tuning parameter

τ .

Table 1. RPCCA error measures for the first canonical vector under different values of the tuning parameter

τ .

$τ$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8
Pure data
$ρ ({\hat{a}}_{1}, {\hat{b}}_{1})$	0.92878	0.96111	0.95505	0.97044	0.97099	0.98082	0.97090	0.98256	0.98606
$ρ (a_{1}, {\hat{a}}_{1})$	0.99908	0.99907	0.99893	0.99877	0.99856	0.99831	0.99799	0.99767	0.99728
$ρ (b_{1}, {\hat{b}}_{1})$	0.99946	0.99951	0.99933	0.99938	0.99936	0.99931	0.99906	0.99916	0.99899
5% contamination
$ρ ({\hat{a}}_{1}^{c}, {\hat{b}}_{1}^{c})$	0.65796	0.68689	0.71486	0.75556	0.76384	0.80646	0.80876	0.81990	0.81201
$ρ (a_{1}, {\hat{a}}_{1}^{c})$	0.99801	0.99815	0.99805	0.99786	0.99749	0.99645	0.99353	0.98368	0.96927
$ρ (b_{1}, {\hat{b}}_{1}^{c})$	0.99548	0.99571	0.99531	0.99538	0.99461	0.99345	0.99084	0.97936	0.96222
$P_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.41125	0.38246	0.36111	0.31247	0.30856	0.25109	0.25272	0.22822	0.23050
$P_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.35685	0.32477	0.31442	0.27277	0.26949	0.22345	0.22962	0.20914	0.21654
$L_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.56677	0.52712	0.49669	0.42880	0.42197	0.33991	0.34109	0.30249	0.30241
$L_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.41496	0.37628	0.36696	0.31681	0.31561	0.26241	0.27298	0.24999	0.26391
10% contamination
$ρ ({\hat{a}}_{1}^{c}, {\hat{b}}_{1}^{c})$	0.41963	0.43878	0.46604	0.49193	0.53443	0.56155	0.57944	0.59613	0.60565
$ρ (a_{1}, {\hat{a}}_{1}^{c})$	0.99698	0.99714	0.99712	0.99686	0.99563	0.99225	0.98450	0.96631	0.95789
$ρ (b_{1}, {\hat{b}}_{1}^{c})$	0.99054	0.99018	0.98974	0.98968	0.98719	0.98401	0.97274	0.95167	0.93961
$P_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.68137	0.65989	0.63978	0.60585	0.57014	0.53623	0.51301	0.48214	0.47437
$P_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.56499	0.54370	0.53006	0.51765	0.48283	0.46007	0.44608	0.43238	0.43151
$L_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.94237	0.91248	0.88343	0.83560	0.78292	0.73427	0.69877	0.64896	0.63289
$L_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.63783	0.61685	0.60406	0.59298	0.55556	0.53199	0.52065	0.51211	0.51843
15% contamination
$ρ ({\hat{a}}_{1}^{c}, {\hat{b}}_{1}^{c})$	0.26726	0.29776	0.32650	0.34987	0.40472	0.42547	0.43099	0.43524	0.45246
$ρ ({\hat{a}}_{1}, {\hat{a}}_{1}^{c})$	0.99662	0.99682	0.99670	0.99631	0.99405	0.98763	0.97539	0.95782	0.93472
$ρ ({\hat{b}}_{1}, {\hat{b}}_{1}^{c})$	0.98740	0.98702	0.98627	0.98541	0.98364	0.97541	0.95686	0.93536	0.91078
$P_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.83673	0.81093	0.78650	0.75948	0.70320	0.68752	0.68190	0.67188	0.64529
$P_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.67519	0.65622	0.64107	0.62355	0.58194	0.57677	0.58054	0.58166	0.57108
$L_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	1.16074	1.12367	1.08848	1.04925	0.96766	0.94157	0.92937	0.90983	0.86243
$L_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.75424	0.73561	0.72109	0.70581	0.66064	0.66160	0.67484	0.68431	0.68025
20% contamination
$ρ ({\hat{a}}_{1}^{c}, {\hat{b}}_{1}^{c})$	0.19852	0.21736	0.22524	0.23800	0.27663	0.28001	0.30514	0.32119	0.31858
$ρ ({\hat{a}}_{1}, {\hat{a}}_{1}^{c})$	0.99661	0.99683	0.99652	0.99593	0.99429	0.99003	0.96500	0.94362	0.90237
$ρ ({\hat{b}}_{1}, {\hat{b}}_{1}^{c})$	0.98527	0.98503	0.98336	0.98251	0.97950	0.97282	0.94210	0.91641	0.86912
$P_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.90198	0.90524	0.89112	0.89756	0.85384	0.85092	0.82023	0.80367	0.79283
$P_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.72434	0.71723	0.71807	0.72166	0.69817	0.70049	0.69510	0.69435	0.70935
$L_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.45211	0.45326	0.44686	0.45094	0.42851	0.42746	0.41497	0.41101	0.41076
$L_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.61339	0.61723	0.60407	0.60759	0.59650	0.59591	0.58441	0.59536	0.59885

Table 2. RPCCA error measures for the second canonical vector under different values of the tuning parameter

τ .

Table 2. RPCCA error measures for the second canonical vector under different values of the tuning parameter

τ .

$τ$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8
Pure data
$ρ ({\hat{a}}_{2}, {\hat{b}}_{2})$	0.22829	0.19656	0.20033	0.18556	0.18386	0.17334	0.17650	0.16407	0.15426
$ρ (a_{2}, {\hat{a}}_{2})$	0.99687	0.99704	0.99514	0.99387	0.98976	0.96917	0.94815	0.93493	0.88471
$ρ (b_{2}, {\hat{b}}_{2})$	0.99464	0.99490	0.99122	0.98980	0.98375	0.96336	0.93706	0.91885	0.85931
5% contamination
$ρ ({\hat{a}}_{2}^{c}, {\hat{b}}_{2}^{c})$	0.30303	0.29521	0.27510	0.24815	0.24522	0.21903	0.22096	0.19770	0.20928
$ρ (a_{2}, {\hat{a}}_{2}^{c})$	0.91167	0.92137	0.91977	0.91112	0.91556	0.90693	0.88127	0.84600	0.82650
$ρ (b_{2}, {\hat{b}}_{2}^{c})$	0.88049	0.89166	0.88662	0.88904	0.89241	0.88801	0.87192	0.83197	0.80588
$P_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.41815	0.38347	0.36114	0.32231	0.31958	0.29261	0.33041	0.32611	0.36974
$P_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.44509	0.41272	0.40340	0.36543	0.36662	0.33627	0.36107	0.34812	0.36182
$L_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.56221	0.51686	0.48609	0.43221	0.42597	0.38524	0.42792	0.41586	0.46555
$L_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.54952	0.50631	0.48727	0.43321	0.43065	0.38858	0.41934	0.39996	0.41901
10% contamination
$ρ ({\hat{a}}_{2}^{c}, {\hat{b}}_{2}^{c})$	0.31085	0.29651	0.28563	0.28324	0.25836	0.24660	0.25507	0.23689	0.22360
$ρ (a_{2}, {\hat{a}}_{2}^{c})$	0.76633	0.76784	0.77030	0.77828	0.75962	0.77061	0.78303	0.75727	0.71201
$ρ (b_{2}, {\hat{a}}_{2}^{c})$	0.77035	0.75135	0.75242	0.75593	0.74394	0.75508	0.74693	0.74538	0.70453
$P_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.69844	0.67264	0.64760	0.62480	0.58957	0.56764	0.57146	0.56234	0.60946
$P_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.70333	0.68232	0.65851	0.63732	0.60797	0.57351	0.55445	0.54736	0.53981
$L_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.94522	0.91180	0.87884	0.84254	0.79453	0.76052	0.75866	0.73627	0.79154
$L_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.87344	0.84137	0.80061	0.76733	0.72179	0.67148	0.64907	0.64158	0.63347
15% contamination
$ρ ({\hat{a}}_{2}^{c}, {\hat{b}}_{2}^{c})$	0.29041	0.26604	0.25341	0.24867	0.22370	0.22678	0.22668	0.21698	0.20791
$ρ ({\hat{a}}_{2}, {\hat{b}}_{2}^{c})$	0.62670	0.62511	0.62775	0.63798	0.63931	0.67347	0.66306	0.63881	0.61437
$ρ ({\hat{b}}_{2}, {\hat{b}}_{2}^{c})$	0.67113	0.65281	0.63298	0.64345	0.64183	0.66804	0.66294	0.66655	0.64393
$P_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.86840	0.83480	0.80722	0.78144	0.73384	0.71580	0.71970	0.73208	0.73667
$P_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.83973	0.81076	0.78283	0.74695	0.69787	0.68081	0.65889	0.63941	0.61576
$L_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	1.17901	1.13476	1.09703	1.06197	0.99294	0.96654	0.96359	0.97065	0.96509
$L_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	1.04721	1.00181	0.95442	0.89936	0.83020	0.80219	0.77413	0.74676	0.72249
20% contamination
$ρ ({\hat{a}}_{2}^{c}, {\hat{b}}_{2}^{c})$	0.23981	0.23457	0.22347	0.22122	0.21361	0.20732	0.20959	0.19139	0.20603
$ρ ({\hat{a}}_{2}, {\hat{b}}_{2}^{c})$	0.51483	0.51794	0.52276	0.53520	0.54956	0.53813	0.56693	0.54726	0.56153
$ρ ({\hat{b}}_{2}, {\hat{b}}_{2}^{c})$	0.62369	0.59115	0.56066	0.56134	0.58636	0.58110	0.60681	0.61363	0.63613
$P_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.94170	0.92377	0.91740	0.90923	0.86834	0.86758	0.83873	0.82330	0.83708
$P_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.89672	0.87587	0.85688	0.82781	0.78143	0.75837	0.72490	0.70158	0.69711
$L_{2} (\hat{a_{2}}, {\hat{a}}_{2}^{c})$	0.47541	0.46413	0.46168	0.45820	0.43798	0.43988	0.43203	0.42330	0.43539
$L_{2} (\hat{b_{2}}, {\hat{b}}_{2}^{c})$	0.76644	0.73624	0.72641	0.69977	0.67411	0.63579	0.61011	0.61020	0.59094

Table 3.

P_{2}

and

N_{2}

distances between the estimated canonical vectors under uncontaminated and contaminated data.

Table 3.

P_{2}

and

N_{2}

distances between the estimated canonical vectors under uncontaminated and contaminated data.

$τ$	0	0.5
$P_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.222	0.064
$P_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	0.131	0.059
$L_{2} (\hat{a_{1}}, {\hat{a}}_{1}^{c})$	0.224	0.064
$L_{2} (\hat{b_{1}}, {\hat{b}}_{1}^{c})$	1.995	0.059

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jaenada, M.; Miranda, P.; Pardo, L.; Zografos, K. An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances. Entropy 2023, 25, 713. https://doi.org/10.3390/e25050713

AMA Style

Jaenada M, Miranda P, Pardo L, Zografos K. An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances. Entropy. 2023; 25(5):713. https://doi.org/10.3390/e25050713

Chicago/Turabian Style

Jaenada, María, Pedro Miranda, Leandro Pardo, and Konstantinos Zografos. 2023. "An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances" Entropy 25, no. 5: 713. https://doi.org/10.3390/e25050713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Approach to Canonical Correlation Analysis Based on Rényi’s Pseudodistances

Abstract

1. Introduction

2. Rényi’s Pseudodistance Canonical Correlation Analysis

3. Consistency

4. Robustness

5. Testing to Determine the Number of Pairs

6. Simulation Study

6.1. Computational Methods

6.2. Monte Carlo Simulation

6.3. Real Data Application

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI