Geometric Estimation of Multivariate Dependency

Yasaei Sekeh, Salimeh; Hero, Alfred O.

doi:10.3390/e21080787

Open AccessArticle

Geometric Estimation of Multivariate Dependency

by

Salimeh Yasaei Sekeh

^*,†

and

Alfred O. Hero

Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA

^*

Author to whom correspondence should be addressed.

^†

Current address: School of Computing and Information Science, University of Maine, Orono, ME 04469, USA.

Entropy 2019, 21(8), 787; https://doi.org/10.3390/e21080787

Submission received: 21 May 2019 / Revised: 2 August 2019 / Accepted: 8 August 2019 / Published: 12 August 2019

(This article belongs to the Special Issue Women in Information Theory 2018)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a geometric estimator of dependency between a pair of multivariate random variables. The proposed estimator of dependency is based on a randomly permuted geometric graph (the minimal spanning tree) over the two multivariate samples. This estimator converges to a quantity that we call the geometric mutual information (GMI), which is equivalent to the Henze–Penrose divergence. between the joint distribution of the multivariate samples and the product of the marginals. The GMI has many of the same properties as standard MI but can be estimated from empirical data without density estimation; making it scalable to large datasets. The proposed empirical estimator of GMI is simple to implement, involving the construction of an minimal spanning tree (MST) spanning over both the original data and a randomly permuted version of this data. We establish asymptotic convergence of the estimator and convergence rates of the bias and variance for smooth multivariate density functions belonging to a Hölder class. We demonstrate the advantages of our proposed geometric dependency estimator in a series of experiments.

Keywords:

Henze–Penrose mutual information; Friedman–Rafsky test statistic; geometric mutual information; convergence rates; bias and variance tradeoff; optimization; minimal spanning trees

1. Introduction

Estimation of multivariate dependency has many applications in fields such as information theory, clustering, structure learning, data processing, feature selection, time series prediction, and reinforcement learning, see [1,2,3,4,5,6,7,8,9,10], respectively. It is difficult to accurately estimate the mutual information in high-dimensional settings, specially where the data is multivariate with an absolutely continuous density with respect to Lebesgue measure—the setting considered in this paper. An important and regular measure of dependency is the Shannon mutual information (MI), which has seen extensive use across many application domains. However, the estimation of mutual information can often be challenging. In this paper, we focus on a measure of MI that we call the Geometric MI (GMI). This MI measure is defined as the asymptotic large sample limit of a randomized minimal spanning tree (MST) statistic spanning the multivariate sample realizations. The GMI is related to a divergence measure called the Henze–Penrose divergence [11,12], and related to the multivariate runs test [13]. In [14,15], it was shown that this divergence measure can be used to specify a tighter bound for the Bayes error rate for testing if a random sample comes from one of two distributions the bound in [14,15] is tighter than previous divergence-type bounds such as the Bhattacharrya bound [16]. Furthermore, the authors of [17] proposed a non-parametric bound on multi-class classification Bayes error rate using a global MST graph.

Let

X

and

Y

be random variables with unknown joint density

f_{X Y}

and marginal densities

f_{X}

and

f_{Y}

, respectively, and consider two hypotheses:

H_{0}

,

X

and

Y

are independent and

H_{1}

,

X

and

Y

are dependent,

H_{0} : f_{X Y} = f_{X} f_{Y}, versus H_{1} : f_{X Y} \neq f_{X} f_{Y} .

The GMI is defined as the Henze–Penrose divergence between

f_{X Y}

and

f_{X} f_{Y}

which can be used as a dependency measure. In this paper, we prove that for large sample size n the randomized MST statistic spanning the original multivariate sample realizations and a randomly shuffled data set converges almost surely to the GMI measure. A direct implication of [14,15] is that the GMI provides a tighter bound on the Bayes misclassification rate for the optimal test of independence. In this paper, we propose an estimator based on a random permutation modification of the Friedman–Rafsky multivariate test statistic and show that under certain conditions the GMI estimator achieves the parametric mean square error (MSE) rate when the joint density is bounded and smooth. Importantly unlike other measures of MI, our proposed GMI estimator does not require explicit estimation of the joint and marginal densities.

Computational complexity is an important challenge in machine learning and data science. Most plug-in-based estimators, such as the kernel density estimator (KDE) or the K-nearest-neighbor (KNN) estimator with known convergence rate, require runtime complexity of

O (n^{2})

, which is not suitable for large scale applications. Noshad et al. proposed a graph theoretic direct estimation method based on nearest-neighbor ratios (NNR) [18]. The NNR estimator is based on k-NN graph and computationally more tractable than other competing estimators with complexity

O (k n log n)

. The construction of the minimal spanning tree lies at the heart of the GMI estimator proposed in this paper. Since the GMI estimator is based on the Euclidean MST the dual-tree algorithm by March et al. [19] can be applied. This algorithm is based on the construction of Borůvka [20] and implements the Euclidean MST in approximately

O (n l o g n)

time. In this paper, we experimentally show that for large sample size the proposed GMI estimator has faster runtime than the KDE plug-in method.

1.1. Related Work

Estimation of mutual information has a rich history. The most common estimators of MI are based on plug-in density estimation, e.g., using the histogram, kernel density or kNN density estimators [21,22]. Motivated by ensemble methods applied to divergence estimation [23,24], in [22] an ensemble method for combining multiple KDE bandwidths was proposed for estimating MI. Under certain smoothness conditions this ensemble MI estimator was shown to achieve parametric convergence rates.

Another class of estimators of multivariate dependency bypasses the difficult density estimation task. This class includes the statistically consistent estimators of Rényi-

α

and KL mutual information which are motivated by the asymptotic limit of the length of the KNN graph, [25,26] when joint density is smooth. The estimator of [27] builds on KNN methods for Rényi entropy estimation. The authors of [26], showed that when MI is large the KNN and KDE approaches are ill-suited for estimating MI since the joint density may be insufficiently smooth when there are strong dependencies. To overcome this issue an assumption on the smoothness of the density is required, see [28,29], and [23,24]. For all these methods, the optimal parametric rate of MSE convergence is achieved when the densities are either d,

(d + 1) / 2

or

d / 2

times differentiable [30]. In this paper, we assume that joint and marginal densities are smooth in the sense that they belong to Hölder continuous classes of densities

Σ_{d} (η, K)

, where the smoothness parameter

η \in (0, 1]

and the Lipschitz constant

K > 0

.

A MI measure based on the Pearson chi-square divergence was considered in [31] that is computational efficient and numerically stable. The authors of [27,32] used nearest-neighbor graph and minimal spanning tree approaches, respectively, to estimate Rényi mutual information. In [22], a non-parametric mutual information estimator was proposed using a weighted ensemble method with

O (1 / n)

parametric convergence rate. This estimator was based on plug-in density estimation, which is challenging in high dimension.

Our proposed dependency estimator differs from previous methods in the following ways. First, it estimates a different measure of mutual information, the GMI. Second, instead of using the KNN graph the estimator of GMI uses a randomized minimal spanning tree that spans the multivariate realizations. The proposed GMI estimator is motivated by the multivariate runs test of Friedman and Rafsky (FR) [33] which is a multivariate generalization of the univariate Smirnov maximum deviation test [34] and the Wald-Wolfowitz [35] runs test in one dimension. We also emphasize that the proposed GMI estimator does not require boundary correction, in contrast to other graph-based estimators, such as, the NNR estimator [18], scalable MI estimator [36], or cross match statistic [37].

1.2. Contribution

The contribution of this paper has three components

(1): We propose a novel non-parametric multivariate dependency measure, referred to as geometric mutual information (GMI), which is based on graph-based divergence estimation. The geometric mutual information is constructed using a minimal spanning tree and is a function of the Friedman–Rafsky multivariate test statistic.
(2): We establish properties of the proposed dependency measure analogous to those of Shannon mutual information, such as, convexity, concavity, chain rule, and a type of data-processing inequality.
(3): We derive a bound on the MSE rate for the proposed geometric estimator. An advantage of the estimator is that it achieves the optimal MSE rate without the need for boundary correction, which is required for most plug-in estimators.

1.3. Organization

The rest of the paper is organized as follows. In Section 2, we define the geometric mutual information and establish some of its mathematical properties. In Section 2.2 and Section 2.3, we introduce a statistically consistent GMI estimator and derive a bound on its mean square error convergence rate. In Section 3 we verify the theory through experiments.

Throughout the paper, we denote statistical expectation by

E

and the variance by abbreviation

Var

. Bold face type indicates random vectors. All densities are assumed to be absolutely continuous with respect to non-atomic Lebesgue measure.

2. The Geometric Mutual Information (GMI)

In this section, we first review the definition of the Henze–Penrose (HP) divergence measure defined by Berisha and Hero in [13,14]. The Henze–Penrose divergence between densities f and g with domain

R^{d}

for parameter

p \in (0, 1)

is defined as follows (see [13,14,15]):

D_{p} (f, g) = \frac{1}{4 p q} [\int \frac{{(p f (x) - q g (x))}^{2}}{p f (x) + q g (x)} d x - {(p - q)}^{2}],

(1)

where

q = 1 - p

. This functional is an f-divergence [38], equivalently, as an Ali-Silvey distance [39], i.e., it satisfies the properties of non-negativity, monotonicity, and joint convexity [15]. The measure (1) takes values in

[0, 1]

and

D_{p} (f, g) = 0

if and only if

f = g

almost surely.

The mutual information measure is defined as follows. Let

f_{X}

,

f_{Y}

, and

f_{X Y}

be the marginal and joint distributions, respectively, of random vectors

X \in R^{d_{x}}

,

Y \in R^{d_{y}}

where

d_{x}

and

d_{y}

are positive integers. Then by using (1), a Henze–Penrose generalization of the mutual information between

X

and

Y

, is defined by

\begin{matrix} I_{p} (X; Y) & = D_{p} (f_{X Y}, f_{X} f_{Y}) \\ = \frac{1}{4 p q} [\int \int \frac{{(p f_{X Y} (x, y) - q f_{X} (y) f_{Y} (y))}^{2}}{p f_{X Y} (x, y) + q f_{X} (x) f_{Y} (y)} d x d y - {(p - q)}^{2}] . \end{matrix}

(2)

We will show below that

I_{p} (X; Y)

has a geometric interpretation in terms of the large sample limit of a minimal spanning tree spanning n sample realizations of the merged labeled samples

X \cup Y

. Thus, we call

I_{p} (X; Y)

the GMI between

X

and

Y

. The GMI satisfies similar properties to other definitions of mutual information, such as Shannon and Rényi mutual information. Recalling (3) in [14], an alternative form of

I_{p}

is given by

\begin{matrix} I_{p} (X; Y) = 1 - A_{p} (X; Y) = \frac{u_{p} (X; Y)}{4 p q} - \frac{{(p - q)}^{2}}{4 p q}, \end{matrix}

(3)

where

\begin{matrix} A_{p} (X; Y) = \int \int \frac{f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{p f_{X Y} (x, y) + q f_{X} (x) f_{Y} (y)} d x d y = E_{X Y} [{(p \frac{f_{X Y} (X, Y)}{f_{X} (X) f_{Y} (Y)} + q)}^{- 1}], and \\ u_{p} (X; Y) = \int \int \frac{{(p f_{X Y} (x, y) - q f_{X} (x) f_{Y} (y))}^{2}}{p f_{X Y} (x, y) + q f_{X} (x) f_{Y} (y)} d x d y = 1 - 4 p q A_{p} (X; Y) . \end{matrix}

(4)

The function

A_{p} (X; Y)

was defined in [13] and is called the geometric affinity between

X

and

Y

. The next subsection of the paper is dedicated to the basic inequalities and properties of the proposed GMI measure (2).

2.1. Properties of the Geometric Mutual Information

In this subsection we establish basic inequalities and properties of the GMI,

I_{p}

, given in (2). The following theorem shows that

I_{p} (X; Y)

is a concave function in

f_{X}

and a convex function in

f_{Y | X}

. The proof is given in Appendix A.1.

Theorem 1.

Denote by

{\tilde{I}}_{p} (f_{X Y})

the GMI

I_{p} (X; Y)

when

X \in R^{d_{x}}

and

Y \in R^{d_{y}}

have joint density

f_{X Y}

. Then the GMI satisfies

(i): Concavity in $f_{X}$ : Let $f_{Y | X}$ be conditional density of $Y$ given $X$ and let $g_{X}$ and $h_{X}$ be densities on $R^{d_{x}}$ . Then for $λ_{1}, λ_{2} \in [0, 1]$ , $λ_{1} + λ_{2} = 1$

${\tilde{I}}_{p} (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}) \geq λ_{1} {\tilde{I}}_{p} (f_{Y | X} g_{X}) + λ_{2} {\tilde{I}}_{p} (f_{Y | X} h_{X}) .$

(5)

The inequality is strict unless either $λ_{1}$ or $λ_{2}$ are zero or $h_{X} = g_{X}$ .
(ii): Convexity in $f_{Y | X}$ : Let $g_{Y | X}$ and $h_{Y | X}$ be conditional densities of $Y$ given $X$ and let $f_{X}$ be marginal density. Then for $λ_{1}, λ_{2} \in [0, 1]$ , $λ_{1} + λ_{2} = 1$

${\tilde{I}}_{p} (λ_{1} g_{Y | X} f_{X} + λ_{2} h_{Y | X} f_{X}) \leq λ_{1} {\tilde{I}}_{p} (g_{Y | X} f_{X}) + λ_{2} {\tilde{I}}_{p} (h_{Y | X} f_{X}) .$

(6)

The inequality is strict unless either $λ_{1}$ or $λ_{2}$ are zero or $h_{Y | X} = g_{Y | X}$ .

The GMI,

I_{p} (X; Y)

, satisfies properties analogous to the standard chain rule and the data-processing inequality [40]. For random variables

X \in R^{d_{x}}, Y \in R^{d_{y}}

, and

Z \in R^{d_{z}}

with conditional density

f_{X Y | Z}

we define the conditional GMI

\begin{matrix} I_{p} (X; Y | Z) = E_{Z} [{\tilde{I}}_{p} (f_{X Y | Z})], where \\ {\tilde{I}}_{p} (f_{X Y | Z}) = 1 - \int \int \frac{f_{X Y | Z} (x, y | z) f_{X | Z} (x | z) f_{Y | Z} (y | z)}{p f_{X Y | Z} (x, y | z) + q f_{X | Z} (x | z) f_{Y | Z} (y | z)} d x d y . \end{matrix}

(7)

The next theorem establishes a relation between the joint and conditional GMI.

Theorem 2.

For given d-dimensional random vector

X

with components

X_{1}, X_{2}, \dots, X_{d}

and random variable Y,

I_{p} (X; Y) \geq I_{p} (X_{1}; Y) - \sum_{i = 1}^{d - 1} (1 - I_{p} (X_{i}; Y | X^{i - 1})),

(8)

where

X^{i} : = X_{1}, X_{2}, \dots, X_{i}

and the conditional GMI

I_{p} (X_{i}; Y | X^{i - 1})

is defined in (7).

For

d = 2

Theorem 2 reduces to

I_{p} (X_{1}, X_{2}; Y) \geq I_{p} (X_{1}; Y) - (1 - I_{p} (X_{2}; Y | X_{1})),

(9)

Please note that when

\sum_{i = 1}^{d - 1} (1 - I_{p} (X_{i}; Y | X^{i - 1})) \geq 1

, the inequality (8) is trivial since

0 \leq I_{p} (X_{1}; Y) \leq 1

. The proof of Theorem 2 is given in Appendix A.2. Theorem 2 is next applied to the case where

X

and

Y

form a Markov chain. The proof of the following “leany” data-processing inequality (Proposition 1) is provided in Appendices section, Appendix A.3.

Proposition 1.

Suppose random vectors

X, Y, Z

form a Markov chain denoted,

X \to Y \to Z

, in the sense that

f_{X Y Z} = f_{X | Y} f_{Y | Z} f_{Z}

. Then for

p \in (0, 1)

I_{p} (Y; X) \geq I_{p} (Z; X) - {(p E_{X Y} [δ_{X, Y}] + (1 - p))}^{- 1},

(10)

where

δ_{X, Y} = \int \frac{f_{X | Y} (X | Y) f_{Z | Y} (z | Y)}{f_{X | Z} (X | z)} d z .

Furthermore, if both

X \to Y \to Z

and

X \to Z \to Y

together hold true, we have

I_{p} (Y; X) = I_{p} (Z; X)

.

The inequality in (10) becomes interpretable as the standard data-processing inequality

I_{p} (Y; X) \geq I_{p} (Z; X)

, when

E_{Z} [\frac{f (Z | Y)}{f (Z | X)}] = \infty,

since

E_{X Y} [δ_{X, Y}] = E_{X Y} (\frac{f (X | Y)}{f (X)} E_{Z} [\frac{f (Z | Y)}{f (Z | X)}]) .

2.2. The Friedman–Rafsky Estimator

Let a random sample

{x_{i}, y_{i}}_{i = 1}^{n}

from

f_{X Y} (x, y)

be available. Here we show that the GMI

I_{p} (X; Y)

can be directly estimated without estimating the densities. The estimator is inspired by the MST construction of [33] that provides a consistent estimate of the Henze–Penrose divergence [14,15]. We denote by

z_{i}

the i-th joint sample

x_{i}, y_{i}

and by

Z_{n}

the sample set

{z_{i}}_{i = 1}^{n}

. Divide the sample set

Z_{n}

into two subsets

Z_{n^{'}}^{'}

and

Z_{n^{″}}^{″}

with the proportion

α = n^{'} / n

and

β = n^{″} / n

, where

α + β = 1

.

Denote by

{\tilde{Z}}_{n^{″}}

the set

\{(x_{i_{k}}, y_{j_{k}}), k = 1, \dots, n^{″}, selected at random from Z_{n^{″}}^{″}\} :

This means that for each

z_{i k} = (x_{i k}, y_{i k}) \in Z_{n^{″}}^{″}

given the first element

x_{i k}

the second element

y_{i k}

is replaced by a randomly selected

y \in {y_{j k}}_{j = 1}^{n^{″}}

. This results in a random shuffling of the binary relation relating

y_{i k}

in

y_{j k}

. The estimator of

I_{p} (X; Y)

is derived based on the Friedman–Rafsky (FR) multivariate runs test statistic [33] on the concatenated data set,

Z_{n^{'}}^{'} \cup {\tilde{Z}}_{n^{″}}

. The FR test statistic is defined as the number of edges in the MST spanning the merged data set that connect a point in

Z_{n^{'}}^{'}

to a point in

{\tilde{Z}}_{n^{″}}

. This test statistic is denoted by

R_{n^{'}, n^{″}} : = R_{n^{'}, n^{″}} (Z_{n^{'}}^{'}, {\tilde{Z}}_{n^{″}})

. Please note that since the MST is unique with probability one (under the assumption that all density functions are Lebesgue continuous) then all inter point distances between nodes are distinct. This estimator converges to

I_{p} (X; Y)

almost surely as

n \to \infty

. The procedure is summarized in Algorithm 1.

Algorithm 1: MST-based estimator of GMI

Input: Data set

Z_{n} : = \{{(x_{i}, y_{i})}_{i = 1}^{n}\}

1: Find

\tilde{α}

using arguments in Section 2.4

2:

n^{'} \leftarrow \tilde{α} n

,

n^{″} \leftarrow (1 - \tilde{α}) n

3: Divide

Z_{n}

into two subsets

Z_{n^{'}}^{'}

and

Z_{n^{″}}^{″}

4:

{\tilde{Z}}_{n^{″}} \leftarrow {{(x_{i k}, y_{j k})}_{k = 1}^{n^{″}}

: shuffle first and second elements of pairs in

Z_{n^{″}}^{″}}

5:

\hat{Z} \leftarrow Z_{n^{'}}^{'} \cup {\tilde{Z}}_{n^{″}}^{″}

6: Construct MST on

\hat{Z}

7:

R_{n^{'}, n^{″}} \leftarrow #

edges connecting a node in

Z_{n^{'}}^{'}

to a node of

{\tilde{Z}}_{n^{″}}

8:

{\hat{I}}_{p} \leftarrow 1 - R_{n^{'}, n^{″}} \frac{n^{'} + n^{″}}{2 n^{'} n^{″}}

Output:

{\hat{I}}_{p}

, where

p = \tilde{α}

Theorem 3 shows that the output in Algorithm 1 estimates the GMI with parameter

p = α

. The proof is provided in Appendix A.4.

Theorem 3.

For given proportionality parameter

α \in (0, 1)

, choose

n^{'}

,

n^{″}

such that

n^{'} + n^{″} = n

and, as

n \to \infty

, we have

n^{'} / n \to α

and

n^{″} / n \to β = 1 - α

. Then

1 - R_{n^{'}, n^{″}} (Z_{n^{'}}^{'}, {\tilde{Z}}_{n^{″}}) \frac{n}{2 n^{'} n^{″}} \to I_{α} (X; Y), a . s .

(11)

Please note that the asymptotic limit in (11) depends on the proportionality parameter

α

. Later in Section 2.4, we discuss the choice of an optimal parameter

\tilde{α}

. In Figure 1, we illustrate the MST constructed over merged independent (

ρ = 0

) and highly dependent (

ρ = 0.9

) data sets drawn from two-dimensional normal distributions with correlation coefficients

ρ

. Notice that the edges of the MST connecting samples with different colors, corresponding to independent and dependent samples, respectively, are indicated in green. The total number of green edges is the FR test statistic

R_{n^{'}, n^{″}} (Z_{n^{'}}^{'}, {\tilde{Z}}_{n^{″}})

.

2.3. Convergence Rates

In this subsection we characterize the MSE convergence rates of the GMI estimator of Section 2.2 in the form of upper bounds on the bias and the variance. This MSE bound is given in terms of the sample size n, the dimension d, and the proportionality parameter

α

. Deriving convergence rates for mutual information estimators has been of interest in information theory and machine learning [22,27]. The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition [41]. Here we assume

f_{X}

,

f_{Y}

and

f_{X Y}

have support sets

S_{X}

,

S_{Y}

, and

S_{X Y} : = S_{X} \times S_{Y}

, respectively, and are smooth in the sense that they belong to Hölder continuous classes of densities

Σ_{d}^{s} (η,

K),

0 < η \leq 1

[42,43]:

Definition 1.

(Hölder class): Let

X \subset R^{d}

be a compact space. The Hölder class of functions

Σ_{d} (η, K)

, with Hölder parameters η and K, consists of functions g that satisfy

\begin{matrix} \{g : ∥ g (z) - p_{x}^{⌊ η ⌋} (z) ∥_{d} \leq K ∥ x - z ∥_{d}^{η}, x, z \in X\}, \end{matrix}

(12)

where

p_{x}^{k} (z)

is the Taylor polynomial (multinomial) of g of order k expanded about the point

x

and

⌊ η ⌋

is defined as the greatest integer strictly less than η.

To explore the optimal choice of parameter

α

we require bounds on the bias and variance bounds, provided in Appendix A.5. To obtain such bounds, we will make several assumptions on the absolutely continuous densities

f_{X}

,

f_{Y}

,

f_{X Y}

and support sets

S_{X}

,

S_{Y}

,

S_{X Y}

:

(A.1): Each of the densities belong to $Σ_{d} (η, K)$ with smoothness parameters $η$ and Lipschitz constant K.
(A.2): The volumes of the support sets are finite, i.e., $0 < V (S_{X}) < \infty, 0 < V (S_{Y}) < \infty$ .
(A.3): All densities are bounded i.e., there exist two sets of constants $C_{X}^{L}, C_{Y}^{L}, C_{X Y}^{L}$ and $C_{X}^{U}, C_{Y}^{U}, C_{X Y}^{U}$ such that $0 < C_{X}^{L} \leq f_{X} \leq C_{X}^{U} < \infty$ , $0 < C_{Y}^{L} \leq f_{Y} \leq C_{Y}^{U} < \infty$ and $0 < C_{X Y}^{L} \leq f_{X Y} \leq C_{X Y}^{U} < \infty$ .

The following theorem on the bias follows under assumptions (A.1) and (A.3):

Theorem 4.

For given

α \in (0, 1)

,

β = 1 - α

,

d \geq 2

, and

0 < η \leq 1

the bias of the

R_{n^{'}, n^{″}} : = R_{n^{'}, n^{″}} (Z_{n^{'}}^{'}, {\tilde{Z}}_{n^{″}})

satisfies

\begin{matrix} | \frac{E [R_{n^{'}, n^{″}}]}{n} - 2 α β \int \int \frac{f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x d y | \\ \leq O (max \{n^{- η^{2} / (d (1 + η))}, {(β n)}^{- η / (1 + η)}, c_{d} n^{- 1}\}), \end{matrix}

(13)

where

c_{d}

is the largest possible degree of any vertex of MST on

Z_{n^{'}}^{'} \cup {\tilde{Z}}_{n^{″}}

. The explicit form of (13) is provided in Appendix A.5.

Please note that according to Theorem 13 in [44], the constant

c_{d}

is lower bounded by

Ω (\sqrt{d} 2^{n (1 - H (γ))})

,

γ = 2^{- d}

and

H (γ)

is the binary entropy i.e.,

H (γ) = - γ log γ - (1 - γ) log (1 - γ) .

A proof of Theorem 4 is given in Appendix A.5. The next theorem gives an upper bound on the variance of the FR estimator

R_{n^{'}, n^{″}}

. The proof of the variance result requires a different approach than the bias bound (the Efron–Stein inequality [45]). It is similar to arguments in ([46], Appendix A.3), and is omitted. In Theorem 5 we assume that the densities

f_{X}

,

f_{Y}

, and

f_{X Y}

are absolutely continuous and bounded (A.3).

Theorem 5.

Given

α \in (0, 1)

, the variance of the estimator

R_{n^{'}, n^{″}} : = R_{n^{'}, n^{″}} (Z_{n^{'}}^{'}, {\tilde{Z}}_{n^{″}})

is bounded by

V a r (\frac{R_{n^{'}, n^{″}}}{n}) \leq \frac{(1 - α) c_{d}}{n}, α = n^{'} / n,

(14)

where

c_{d}

is a constant depending only on the dimension d.

2.4. Minimax Parameter $α$

Recall assumptions (A.1), (A.2), and (A.3) in Section 2.3. The constant

α

can be chosen to minimize the maximum the MSE converges rate where the maximum is taken over the space of Hölder smooth joint densities

f_{X Y}

.

Throughout this subsection we use the following notations:

$ϵ_{X Y} : = f_{X Y} / f_{X} f_{Y}$ ,
$C_{ϵ}^{L} : = C_{X Y}^{L} / C_{X}^{U} C_{Y}^{U}$ and $C_{ϵ}^{U} : = C_{X Y}^{U} / C_{X}^{L} C_{Y}^{L}$ ,
$C_{n} : = C_{X Y}^{L} n / 2$ ,
$α_{0}^{L} : = \frac{2}{C_{n}}$ and $α_{0}^{U} : = min \{\frac{1}{4}, \frac{1 + 1 / C_{n}}{4 + 2 C_{ϵ}^{U}}, 1 - n^{η / d - 1}\}$ , where $η$ is the smoothness parameter,
$l_{n} : = ⌊n^{η / (d^{2} (1 + η))}⌋$ .

Now define

{\tilde{G}}_{ϵ_{X Y}, n}^{α, β} (x, y)

by

\frac{(ϵ_{X Y} (x, y) + 1 / (β C_{n})) (1 + ϵ_{X Y} (x y) + 1 / (β C_{n}))}{{(α + β ϵ_{X Y} (x, y))}^{2}}, β = 1 - α .

(15)

Consider the following optimization problem:

\begin{matrix} \min_{α} \max_{ϵ_{X Y}} & \tilde{Δ} (α, ϵ_{X Y}) + c_{d} (1 - α) n^{- 1} \\ subject to & C_{ϵ}^{L} \leq ϵ_{X Y} \leq C_{ϵ}^{U}, \\ α_{0}^{L} \leq α \leq α_{0}^{U}, \end{matrix}

(16)

where

\tilde{Δ} (α, ϵ_{X Y}) : = D (n, l_{n}, d, η) + \tilde{D} (n, l_{n}, d) C_{X Y}^{U} {\int \int}_{S_{X Y}} {\tilde{G}}_{ϵ_{X Y}, n}^{α, β} (x, y) d x d y,

(17)

and

\begin{matrix} D (n, l_{n}, d, η) = c_{2} l_{n}^{d} n^{- 1} + c_{d} 2^{d} n^{- 1} + c^{'} l_{n}^{d} n^{- η / d} + c l_{n}^{d} n^{- 1 / d} + 2 c_{1} l_{n}^{d - 1} n^{1 / d - 1} + c_{3} l_{n}^{- d η}, \end{matrix}

(18)

\begin{matrix} \tilde{D} (n, l_{n}, d) = 2 + n^{- 1} 2 c^{″} \sum_{i = 1}^{M} l_{n} l_{n}^{d} a_{i}^{- 1} + n^{- 3 / 2} 2 c_{1}^{'} \sum_{i = 1}^{M} l_{n} l_{n}^{d / 2} \sqrt{b_{i}} a_{i}^{2} \\ + n^{- 1} \sum_{i = 1}^{M} 2 n^{- 3 / 2} l_{n}^{- d / 2} \frac{\sqrt{b_{i}}}{a_{i}^{2}} {(n a_{i} l_{n}^{d} + n^{2} a_{i}^{2})}^{1 / 2} {(n b_{i} l_{n}^{d} + n^{2} b_{i}^{2})}^{1 / 2} . \end{matrix}

(19)

Please note that in (18),

c, c^{'}, c_{1}, c_{2}

are constants, and

c_{d}

only depends on the dimension d. Also, in (19),

a_{i}

and

b_{i}

are constants. Let

ϵ_{X Y}^{*}

be the optimal

ϵ_{X Y}

i.e.,

ϵ_{X Y}^{*}

be the solution of the optimization problem (16). Set

\begin{matrix} Ξ (α) : = \frac{d}{d α} (\tilde{Δ} (α, ϵ_{X Y}^{*}) + c_{d} (1 - α) n^{- 1}), \end{matrix}

(20)

such that

\tilde{Δ} (α, ϵ_{X Y}^{*})

is (17) when

ϵ_{X Y} = ϵ_{X Y}^{*}

. For

α \in [α_{0}^{L}, α_{0}^{U}]

, the optimal choice of

ϵ_{X Y}

in terms of maximizing the MSE is

ϵ_{X Y}^{*} = C_{ϵ}^{U}

and the saddle point for the parameter

α

, denoted by

\tilde{α}

, is given as follows:

$\tilde{α} = α_{0}^{U}$ , if $Ξ (α_{0}^{U}) < 0$ .
$\tilde{α} = α_{0}^{L}$ , if $Ξ (α_{0}^{L}) > 0$ .
$\tilde{α} = Ξ^{- 1} (0)$ , if $α_{0}^{L} \leq Ξ^{- 1} (0) \leq α_{0}^{U}$ .

Further details are given in Appendix A.6.

3. Simulation Study

In this section, numerical simulations are presented that illustrate the theory in Section 2. We perform multiple experiments to demonstrate the utility of the proposed GMI estimator of the HP-divergence in terms of the dimension d and the sample size n. Our proposed MST-based estimator of the GMI is compared to density plug-in estimators of the GMI, in particular the standard KDE density plug-in estimator of [22], where the convergence rates of Theorems 4 and 5 are validated. We use multivariate normal simulated data in the experiments. In this section, we also discuss the choice of the proportionality parameter

α

and compare runtime of the proposed GMI estimator approach with KDE method.

Here we perform four sets of experiments to illustrate the proposed GMI estimator. For the first set of experiments the MSE of the GMI estimator in Algorithm 1 is shown in Figure 2-left. The samples were drawn from d-dimensional normal distribution, with various sample sizes and dimensions

d = 6, 10, 12

. We selected the proportionality parameter

α = 0.3

and computed the MSE in terms of the sample size n. We show the log–log plot of MSE when n varies in

[100, 1500]

. Please note that the empirically optimal proportion

α

depends on n, so to avoid the computational complexity we fixed

α

for this experiment. The experimental result shown in Figure 2-left validates the theoretical MSE growth rates derived from (13) and (14), i.e., decreasing sub-linearly in n and increasing exponentially in d.

In Figure 2-right, we compare the proposed MST-based GMI estimator with the KDE-GMI estimator [22]. For the KDE approach, we estimated the joint and marginal densities and then plugged them into the proposed expression (2). The bandwidth h used for the KDE plug-in estimator was set as

h = n^{- 1 / (d + 1)}

. The choice of h minimizes the bound on the MSE of the plug-in estimator. We generated data from the two-dimensional normal distribution with zero mean and covariance matrix

(\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix}) .

(21)

The coefficient

ρ

is varied in range

[0.1, 0.9]

. The true GMI was computed by the Monte Carlo approximation to the integral (2). Please note that as

ρ

increases, the MST-GMI outperforms the KDE-GMI approach. In this set of experiments

α = 0.6

.

Figure 3 again compares the MST-GMI estimator with the KDE-GMI estimator. samples are drawn from the multivariate standard normal distribution with dimensions

d = 4

and

d = 12

. In both cases the proportionality parameter

α = 0.5

. The left plots in Figure 3 show the MSE (100 trials) of the GMI estimator implemented with an KDE estimator (with bandwidth as in Figure 2 i.e.,

h = n^{- 1 / (d + 1)}

) for dimensions

d = 4, 12

and various sample sizes. For all dimensions and sample sizes the MST-GMI estimator also outperforms the plug-in KDE-GMI estimator based on the estimated log–log MSE slope given in Figure 3 (left plots). The right plots in Figure 3 compares the MST-GMI with the KDE-GMI. In this experiment, the error bars denote standard deviations with 100 trials. We observe that for higher dimension

d = 12

and larger sample size n, the KDE-GMI approaches the true GMI at a slower rate than the MST-GMI estimator. This reflects the power of the proposed graph-based approach to estimating GMI.

The comparison between MSEs for various dimension d is shown in Figure 4 (left). This experiment highlights the impact of higher dimension on the GMI estimators. As expected, for larger sample size n, MSE decreases while for higher dimension it increases. In this setting, we have generated samples from standard normal distribution with size

n \in [10^{2}, 4 \times 10^{3}]

and

α = 0.5

. From Figure 4 (left) we observe that for larger sample size, MSE curves are ordered based on their corresponding dimensions. Results in Section 2.4 strongly depend on the lower bounds

C_{X}^{L}, C_{Y}^{L}, C_{X Y}^{L}

and upper bounds

C_{X}^{U}, C_{Y}^{U}, C_{X Y}^{U}

and provide optimal parameter

α

in the range

[α_{0}^{L}, α_{0}^{U}]

, therefore in the experiment section we only analyze one case where the lower bounds

C_{X}^{L}, C_{Y}^{L}, C_{X Y}^{L}

and upper bounds

C_{X}^{U}, C_{Y}^{U}, C_{X Y}^{U}

are known and the optimal

α

becomes

α_{0}^{L}

. Figure 4 (right) illustrates the MSE vs proportion parameter

α

when

n = 500, 10^{4}

samples are generated from truncated normal distribution with

ρ = 0.7, 0.5

. First, following Section 2.4, we compute the bound

[α_{0}^{L}, α_{0}^{U}]

and then derive the optimal

α

in this range. Therefore, each experiment with different sample size and

ρ

provides different range

[α_{0}^{L}, α_{0}^{U}]

. We observe that the MSE does not appeared a monotonic function in

α

and its behavior strongly depends on sample size n, d, and density functions’ bounds. Additional study of the dependency is described in Appendix A.6. In this set of experiments

Ξ (α_{0}^{L}) > 0

, therefore following the results in Section 2.4, we have

\tilde{α} = α_{0}^{L}

. In this experiment the optimal value of

α

is always the lower bound

α_{0}^{L}

and indicated in the Figure 4 (right).

The parameter

α

is studied further for three scenarios where the lower bounds

C_{X}^{L}, C_{Y}^{L}, C_{X Y}^{L}

and upper bounds

C_{X}^{U}, C_{Y}^{U}, C_{X Y}^{U}

are assumed unknown, therefore results in Section 2.4 are not applicable. In this set of experiments, we varied

α

in the range

(0, 1)

to divide our original sample. We generated sample from an isotropic multivariate standard normal distribution

(ρ = 0)

in all three scenarios (all features are independent). Therefore, the true GMI is zero and in all scenarios the GMI column, corresponding to the MST-GMI, is compared with zero. In each scenario we fixed dimension d and sample size n and varied

α = 0.2, 0.5, 0.8

. The dimension and sample size in Scenarios 1,2, and 3 are

d = 6, 8, 10

and

n = 1000, 1500, 2000

, respectively. In Table 1 the last column (

α

) stars the parameter

α \in {0.2, 0.5, 0.8}

with the minimum MSE and GMI

(I_{α})

in each scenario. Table 1 shows that in these sets of experiments when

α = 0.5

, the GMI estimator has less MSE (i.e., is more accurate) than when

α = 0.2

or

α = 0.8

. This experimentally demonstrates that if we split our training data, the proposed Algorithm 1 performs better with

α = 0.5

.

Finally, Figure 5 shows the runtime as a function of sample size n. We vary sample size in the range

[10^{3}, 10^{4}]

. Observe that for smaller number of samples the KDE-GMI method is slightly faster but as n becomes large we see significant relative speedup of the proposed MST-GMI method.

4. Conclusions

In this paper, we have proposed a new measure of mutual information, called Geometric MI (GMI), which is related to the Henze–Penrose divergence. The GMI can be viewed as dependency measure that is the limit of the Friedman–Rafsky test statistic, which depends on the MST over all data points. We established some properties of the GMI in terms of convexity/concavity, chain rule, and a type of data-processing inequality. A direct estimator of the GMI, called the MST-GMI, was introduced that uses random permutations of observed relationships between variables in the multivariate samples. An explicit form for the MSE convergence rate bound was derived that depends on a free parameter called the proportionality parameter. An asymptotically optimal form for this free parameter was given that minimizes the MSE convergence rate. Simulation studies were performed that illustrate and verify the theory.

Author Contributions

S.Y.S. wrote this article primarily under the supervision of A.O.H. as principle investigator (PI), and A.O.H. edited the paper. S.Y.S. provided the primary contributions for the proofs of all theorems and performed all experiments.

Funding

The work presented in this paper was partially supported by ARO grant W911NF-15-1-0479 and DOE grant DE-NA0002534.

Acknowledgments

The authors would like to thank Brandon Oselio for the helpful comments.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Appendix A

We organize the appendices as follows: Theorem 1 which establishes convexity/concavity is proved in Appendix A.1. Appendix A.2 and Appendix A.3 establish the inequality (9) and (10) for given

p \in (0, 1)

, respectively. In Appendix A.4, we first prove that the set

{\tilde{Z}}_{n^{″}}

which is randomly generated from original dependent data, contains asymptotically independent samples. Later by using the generated independent sample

{\tilde{Z}}_{n^{″}}

we show that for given

α

the FR estimator of the GMI given in Algorithm 1 tends to

I_{α}

. Appendix A.5 dedicates proof of Theorem 4. The proportionality parameter (

α

) optimization strategy is presented in Appendix A.6.

Appendix A.1. Theorem 1

Proof.

The proof is similar to the result for standard (Shannon) mutual information. However, we require the following lemma, proven in analogous manner to the log-sum inequality:

Lemma A1.

For non-negative real numbers

α_{1}, \dots, α_{n}

and

β_{1}, \dots, β_{n}

, given

p \in (0, 1)

,

q = 1 - p

,

\sum_{i = 1}^{n} α_{i} {(p (\frac{β_{i}}{α_{i}}) + q)}^{- 1} \geq \sum_{i = 1}^{n} α_{i} {(p (\frac{\sum_{i = 1}^{n} β_{i}}{\sum_{i = 1}^{n} α_{i}}) + q)}^{- 1} .

Notice this follows by using the convex function

u (y) = y^{2} / (p + q y)

for any

p \in (0, 1)

,

q = 1 - p

, and the Jensen inequality.

Define the shorthand

\int_{x}

,

\int_{y}

, and

\int_{x y}

for

\int d x

,

\int d y

and

\int \int d x d y

, respectively. To prove part (i) of Theorem 1, we represent the LHS of (5) as:

\begin{matrix} {\tilde{I}}_{p} (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}) = 1 - \int_{x y} (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}) \times \\ {[p \frac{λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}}{(\int_{x} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}) (\int_{y} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X})} + q]}^{- 1} \\ = 1 - \int_{x y} (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}) {[p \frac{f_{Y | X}}{(\int_{x} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X})} + q]}^{- 1} . \end{matrix}

Furthermore, the RHS of (5) can be rewritten as

\begin{matrix} λ_{1} {\tilde{I}}_{p} (f_{Y | X} g_{X}) + λ_{2} {\tilde{I}}_{p} (f_{Y | X} h_{X}) \\ = 1 - \int_{x y} (λ_{1} f_{Y | X} g_{X} {[p \frac{f_{Y | X} g_{X}}{(\int_{x} f_{Y | X} g_{X}) (\int_{y} f_{Y | X} g_{X})} + q]}^{- 1} + λ_{2} f_{Y | X} h_{X} {[p \frac{f_{Y | X} h_{X}}{(\int_{x} f_{Y | X} h_{X}) (\int_{y} f_{Y | X} h_{X})} + q]}^{- 1}) \\ = 1 - \int_{x y} (λ_{1} f_{Y | X} g_{X} {[p \frac{f_{Y | X}}{\int_{x} f_{Y | X} g_{X}} + q]}^{- 1} + λ_{2} f_{Y | X} h_{X} {[p \frac{f_{Y | X}}{\int_{x} f_{Y | X} h_{X}} + q]}^{- 1}) . \end{matrix}

Thus, to prove

LHS \geq RHS

, we use the inequality below:

\begin{matrix} (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}) {[p \frac{f_{Y | X}}{\int_{x} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}} + q]}^{- 1} \\ \leq (λ_{1} f_{Y | X} g_{X}) {[p \frac{π}{\int_{x} f_{Y | X} g_{X}} + q]}^{- 1} + (λ_{2} f_{Y | X} h_{X}) {[p \frac{π}{\int_{x} f_{Y | X} h_{X}} + q]}^{- 1} . \end{matrix}

In Lemma A1, let

\begin{matrix} α_{1} = \frac{λ_{1} (\int_{x} f_{Y | X} g_{X}) (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X})}{\int_{x} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}}, \\ α_{2} = \frac{λ_{2} (\int_{x} f_{Y | X} h_{X}) (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X})}{\int_{x} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}}, \end{matrix}

and for

i = 1, 2

,

β_{i} = \frac{λ_{i} f_{Y | X} (λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X})}{\int_{x} λ_{1} f_{Y | X} g_{X} + λ_{2} f_{Y | X} h_{X}} .

Then the claimed assertion (i) is obtained. Part (ii) follows by convexity of

D_{p}

and the following expression:

\begin{matrix} {\tilde{I}}_{p} (λ_{1} g_{Y | X} f_{X} + λ_{2} h_{Y | X} f_{X}) \\ = D_{p} (λ_{1} f_{X} g_{Y | X} + λ_{2} f_{X} h_{Y | X}, (\int_{x} λ_{1} f_{X} g_{Y | X} + λ_{2} f_{X} h_{Y | X}) (\int_{y} λ_{1} ϕ π_{1} + λ_{2} ϕ π_{2})) \\ = D_{p} (λ_{1} f_{X} g_{Y | X} + λ_{2} f_{X} h_{Y | X}, f_{X} (\int_{x} λ_{1} f_{X} g_{Y | X} + λ_{2} f_{X} h_{Y | X}) \\ = D_{p} (λ_{1} f_{X} g_{Y | X} + λ_{2} f_{X} h_{Y | X}, λ_{1} (\int_{x} f_{X} g_{Y | X}) (\int_{y} f_{X} g_{Y | X}) + λ_{2} (\int_{x} f_{X} h_{Y | X}) (\int_{y} f_{X} h_{Y | X})) . \end{matrix}

Therefore, the claim in (6) is proved. □

Appendix A.2. Theorem 2

Proof.

We start with (9). Given

p \in (0, 1)

and

q = 1 - p

, we can easily check that for positive

t > q

,

s > q

, such that

t, s \neq 1

:

(t + s) [(t s) + q (1 - t - s)] - p (t s) \geq 0 .

This implies

p (\frac{t - q}{p}) (\frac{s - q}{p}) + q \geq \frac{t s}{t + s} .

By substituting

\frac{f_{X_{1} Y} (x_{1}, y)}{f_{X_{1}} (x_{1}) f_{Y} (y)} = \frac{t - q}{p}, \frac{f_{X_{2} Y | X_{1}} (x_{2}, y | x_{1})}{f_{X_{2} | X_{1}} (x_{2} | x_{1}) f_{Y | X_{1}} (y | x_{1})} = \frac{s - q}{p},

we get

\begin{matrix} {(p \frac{f_{X_{1} X_{2} Y} (x_{1}, x_{2}, y)}{f_{X_{1} X_{2}} (x_{1}, x_{2}) f_{Y} (y)} + q)}^{- 1} \leq {(p \frac{f_{X_{1} Y} (x_{1}, y)}{f_{X_{1}} (x_{1}) f_{Y} (y)} + q)}^{- 1} \\ + {(p \frac{f_{X_{2} Y | X_{1}} (x_{2}, y | x_{1})}{f_{X_{2} | X_{1}} (x_{2} | x_{1}) f_{Y | X_{1}} (y | x_{1})} + q)}^{- 1} . \end{matrix}

(A1)

Consequently

\begin{matrix} I_{p} (X_{1}, X_{2}; Y) \geq I_{p} (X_{1}; Y) - E_{f} [{(p \frac{f_{X_{2} Y | X_{1}} (x_{2}, y | x_{1})}{f_{X_{2} | X_{1}} (x_{2} | x_{1}) f_{Y | X_{1}} (y | x_{1})} + q)}^{- 1}] . \end{matrix}

(A2)

Here f is the joint PDF of random vector

(X_{1}, X_{2}, Y)

. From the conditional GMI definition in (7) the expectation term in (A2) is equivalent to

1 - I_{p} (X_{2}; Y | X_{1})

. This completes the proof. □

Appendix A.3. Proposition 1

Proof.

Recall the Theorem 2, part (i). First from

X \to Y \to Z

we have

f_{X Y Z} = f_{X Y} f_{Z | Y}

and then by applying the Jensen inequality,

\begin{matrix} I_{p} (X; Y) = I_{p} (X; Y, Z) and \\ I_{p} (X; Y, Z) \geq I_{p} (Z; X) - E [{(p π (X, Y, Z) + q)}^{- 1}] \\ \geq I_{p} (Z; X) - {(p E [π (X, Y, Z)] + q)}^{- 1}, \end{matrix}

(A3)

where

π (x, y, z) = \frac{f_{Y X | Z} (y, x | z)}{f_{Y | Z} (y | z) f_{X | Z} (x | z)} .

Now by Markovian property we can immediately simplify the RHS in (A3) to the RHS in (10).

Furthermore, we can easily show that if

X \to Z \to Y

, we have

f_{X Y Z} = f_{Z X} f_{Y | Z}

and therefore

I_{p} (Z; X) = I_{p} (X; Y, Z)

. This together with (A3) proves that under both conditions

X \to Y \to Z

and

X \to Z \to Y

, the equality

I_{p} (X; Y) = I_{p} (Z; X)

holds true. □

Appendix A.4. Theorem 3

Proof.

We first derive two required Lemmas A2 and A3 below:

Lemma A2.

Consider random vector

Z = (X, Y)

with joint probability density function (pdf)

f_{X Y}

. Let

Z_{n} = {z_{1}, \dots, z_{n}} = {(x_{i}, y_{i})}_{i = 1}^{n}

be a set of samples with pdf

f_{X Y}

. Let

Z_{n^{'}}^{'}

and

Z_{n^{″}}^{″}

be two distinct subsets of

Z_{n}

such that

n^{'} + n^{″} = n

and sample proportion is

α = n^{'} / n

and

β = 1 - α

. Next, let

{\tilde{Z}}_{n^{″}} = {{\tilde{z}}_{1}, \dots, {\tilde{z}}_{n^{″}}}

be a set of pairs such that

{\tilde{z}}_{k} = (x_{i_{k}}, y_{j_{k}})

,

k = 1, \dots, n^{″}

are selected at random from

Z_{n^{″}}^{'}

. Denote

\tilde{Z} = (\tilde{X}, \tilde{Y})

as the random vector corresponding to samples in

{\tilde{Z}}_{n^{″}}

. Then as

n \to \infty

such that

n^{″}

also grows in a linked manner that

β \neq 0

then the distribution of

\tilde{Z}

convergences to

f_{X} \times f_{Y}

i.e., random vectors

\tilde{X}

and

\tilde{Y}

become mutually independent.

Proof. Consider two subsets

A, B \subset R^{n}

, then we have

\begin{matrix} P (\tilde{X} \in A, \tilde{Y} \in B) = E [I_{A} (\tilde{X}) . I_{B} (\tilde{Y})] = E [\sum_{i, j} I_{A} (X_{i}) . I_{B} (Y_{j}) . P ((\tilde{X}, \tilde{Y}) = (X_{i}, Y_{j}) | Z_{n})] . \end{matrix}

Here

I_{A}

stands for the indicator function. Please note that

P ((\tilde{X}, \tilde{Y}) = (X_{i}, Y_{j}) | Z_{n}) = \frac{1}{{n^{″}}^{2}},

and

X_{i}

and

Y_{j}

,

i \neq j

are independent, therefore

\begin{matrix} P (\tilde{X} \in A, \tilde{Y} \in B) = \frac{1}{{n^{″}}^{2}} \sum_{i \neq j} P (X_{i} \in A) P (Y_{j} \in B) + \frac{1}{{n^{″}}^{2}} \sum_{i = 1}^{n} P (X_{i} \in A, Y_{i} \in B) \\ = P (X_{i} \in A) P (Y_{j} \in B) + \frac{1}{n^{″}} \{P (X_{i} \in A, Y_{i} \in B) - P (X_{i} \in A) P (Y_{i} \in B)\}, \end{matrix}

this implies that

\begin{matrix} | P (\tilde{X} \in A, \tilde{Y} \in B) - P (\tilde{X} \in A) P (\tilde{Y} \in B) | \leq \frac{1}{n^{″}} \int \int | f_{X Y} (x, y) - f_{X} (x) f_{Y} (y) | d x d y . \end{matrix}

(A4)

On the other hand, we know that

n^{″} = β n

, so we get

\begin{matrix} | P (\tilde{X} \in A, \tilde{Y} \in B) - P (\tilde{X} \in A) P (\tilde{Y} \in B) | \leq \frac{1}{β n} \int \int | f_{X Y} (x, y) - f_{X} (x) f_{Y} (y) | d x d y . \end{matrix}

(A5)

From (A5), we observe that when

β

takes larger values the bound becomes tighter. So, if

n \to \infty

such that

n^{″}

also becomes large enough in a linked manner so that

β = constant

then the RHS in (A5) tends to zero. This implies that

\tilde{X}

and

\tilde{Y}

become independent when

n \to \infty

. □

An immediate result of Lemma A2 is the following:

Lemma A3.

For given random vector

Z_{n} = (X_{n}, Y_{n})

from joint density function

f_{X Y}

and with marginal density functions

f_{X}

and

f_{Y}

let

{\tilde{Z}}_{n^{″}} = \{{\tilde{z}}_{1}, \dots, {\tilde{z}}_{n^{″}}\}

be realization of random vector

\tilde{Z}

as in Lemma A2 with parameter

β = n^{″} / n

. Then for given points of

{\tilde{Z}}_{n^{″}}

at

\tilde{z} = (\tilde{x}, \tilde{y})

, we have

| f_{\tilde{Z}} (\tilde{x}, \tilde{y}) - f_{X} (\tilde{x}) f_{Y} (\tilde{y}) | = O (\frac{1}{β n}) .

(A6)

Now, we want to provide a proof of assertion (11). Consider two subsets

Z_{n^{'}}^{'}

and

{\tilde{Z}}_{n^{″}}

as described in Section 2.2. Assume that the components of sample

{\tilde{Z}}_{n^{″}}

follow density function

{\tilde{f}}_{\tilde{X} \tilde{Y}}

. Therefore by owing to Lemmas A2 and A3, when

n \to \infty

then

{\tilde{f}}_{\tilde{X} \tilde{Y}} \to f_{X} f_{Y}

. Let

M_{n^{'}}

and

N_{n^{″}}

be Poisson variables with mean

n^{'}

and

n^{″}

independent of one another and

{Z_{i}^{'}}

and

{{\tilde{Z}}_{j}}

. Assume two Poisson processes

Z_{n^{'}}^{'} = \{Z_{1}^{'}, \dots, Z_{M_{n^{'}}}^{'}\}

and

{\tilde{Z}}_{n^{″}} = \{{\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{N_{n^{″}}}\}

, and denote the FR statistic

R_{n^{'}, n^{″}}^{'}

on these processes. Following the arguments in [13,46] we shall prove the following:

\frac{E [R_{n^{'}, n^{″}}^{'}]}{n^{'} + n^{″}} \to 2 α β \int \int \frac{f_{X, Y} (x, y) f_{X} (x) f_{Y} (y)}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x d y .

This follows due to

| R_{n^{'}, n^{″}}^{'} - R_{n^{'}, n^{″}} | \leq K_{d} (| M_{n^{'}} - n^{'} | + | N_{n^{″}} - n^{″} |)

, where

K_{d}

is a constant defined in Lemma 1, [13] and

n^{'} + n^{″} = n

. Thus,

{(n^{'} + n^{″})}^{- 1} E | R_{n^{'}, n^{″}}^{'} - R_{n^{'}, n^{″}} | \to 0

as

n \to \infty

. Let

W_{1}^{n^{'}, n^{″}}, W_{2}^{n^{'}, n^{″}}, \dots

be independent variables with common density

ϕ_{n^{'}, n^{″}} (x, y) = (n^{'} f_{X Y} (x, y) + n^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x, y)) / (n^{'} + n^{″}),

for

(x, y) \in R^{d} \times R^{d}

. Let

L_{n^{'}, n^{″}}

be an independent Poisson variable with mean

n^{'} + n^{″}

. Let

F_{n^{'}, n^{″}}^{'} = \{W_{1}^{n^{'}, n^{″}}, \dots, W_{L_{n^{'}, n^{″}}}^{n^{'}, n^{″}}\}

a non-homogeneous Poisson process of rate

n^{'} f_{X Y} + n^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}}

. Assign mark 1 to a point in

F_{n^{'}, n^{″}}^{'}

with probability

n^{'} f_{X Y} (x, y) / (n^{'} f_{X Y} (x, y) + n^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x, y)),

and mark 2 otherwise. By the marking theorem [13,47], the FR test statistic

{\tilde{R}}_{n^{'}, n^{″}}

has the same distribution as

R_{n^{'}, n^{″}}^{'}

. Given points of

F_{n^{'}, n^{″}}^{'}

at

z^{'} = (x^{'}, y^{'})

and

z^{″} = (x^{″}, y^{″})

, the probability that they have different marks is given by (A7).

g_{n^{'}, n^{″}} (z^{'}, z^{″}) = \frac{n^{'} f_{X Y} (x^{'}, y^{'}) n^{″} {\tilde{f}}_{\tilde{X}, \tilde{Y}} (x^{″}, y^{″}) + n^{″} {\tilde{f}}_{\tilde{X}, \tilde{Y}} (x^{'}, y^{'}) n^{'} f_{X Y} (x^{″}, y^{″})}{(n^{'} f_{X Y} (x^{'}, y^{'}) + n^{″} {\tilde{f}}_{\tilde{X}, \tilde{Y}} (x^{'}, y^{'})) (n^{'} f_{X Y} (x^{″}, y^{″}) + n^{″} {\tilde{f}}_{\tilde{X}, \tilde{Y}} (x^{″}, y^{″}))},

(A7)

define

g (z^{'}, z^{″}) = \frac{α β (f_{X Y} (x^{'}, y^{'}) f_{X} (x^{″}) f_{Y} (y^{″}) + f_{X} (x^{'}) f_{Y} (y^{'}) f_{X Y} (x^{″}, y^{″}))}{(α f_{X Y} (x^{″}, y^{″}) + β f_{X} (x^{″}) f_{Y} (y^{″})) (α f_{X Y} (x^{'}, y^{'}) + β f_{X} (x^{'}) f_{Y} (y^{'}))},

(A8)

then

\begin{matrix} E [{\tilde{R}}_{n^{'}, n^{″}}^{'} | F_{n^{'}, n^{″}}^{'}] = \underset{i < j \leq L_{n^{'}, n^{″}}}{\sum \sum} g_{n} (W_{i}^{n^{'}, n^{″}}, W_{j}^{n^{'}, n^{″}}) I_{F_{n^{'}, n^{″}}^{'}} (W_{i}^{n^{'}, n^{″}}, W_{j}^{n^{'}, n^{″}}) . \end{matrix}

(A9)

Now recall (A8). We observe that

g_{n^{'}, n^{″}} (z^{'}, z^{″}) \to g (z^{'}, z^{″})

. Going back to (A9), we can write

\begin{matrix} E [{\tilde{R}}_{n^{'}, n^{″}}^{'}] = \underset{i < j \leq L_{n^{'}, n^{″}}}{\sum \sum} g_{n^{'}, n^{″}} (W_{i}^{n^{'}, n^{″}}, W_{j}^{n^{'}, n^{″}}) I_{F_{n^{'}, n^{″}}^{'}} (W_{i}^{n^{'}, n^{″}}, W_{j}^{n^{'}, n^{″}}) + o (n^{'} + n^{″}) . \end{matrix}

(A10)

For fixed

n^{'}

,

n^{″}

consider the collection:

F_{n^{'}, n^{″}} = \{W_{1}^{n^{'}, n^{″}}, \dots, W_{n^{'}, n^{″}}^{n^{'} + n^{″}}\} .

By the fact that

E [M_{n^{'}} + N_{n^{″}} - (n^{'} + n^{″})] = o (n^{'} + n^{″})

, we have

\begin{matrix} E [{\tilde{R}}_{n^{'}, n^{″}}^{'}] = \underset{i < j \leq n^{'} + n^{″}}{\sum \sum} g_{n^{'}, n^{″}} (W_{i}^{n^{'}, n^{″}}, W_{j}^{n^{'}, n^{″}}) I_{F_{n^{'}, n^{″}}} (W_{i}^{n^{'}}, W_{j}^{n^{″}}) + o (n^{'} + n^{″}) . \end{matrix}

(A11)

Introduce

ϕ (x, y) = α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y) .

Then

ϕ_{n^{'}, n^{″}} (x, y) \to ϕ (x, y)

uniformly as

n^{'} / n \to α

and

n^{″} / n \to β

. Thus, using Proposition 1 in [13], we get

\begin{matrix} \frac{E [{\tilde{R}}_{n^{'}, n^{″}}^{'}]}{n} \to \int g (z, z) ϕ (z) d z = \int \int \frac{2 α β f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x d y . \end{matrix}

(A12)

□

Appendix A.5. Theorem 4

Proof.

We begin by providing a family of bias rate bounds for the FR test statistic

R_{n^{'}, n^{″}}

in terms of a parameter l. Assume

f_{X Y}

,

f_{X}

, and

f_{Y}

are in

Σ_{d} (η, K)

. Then by plugging the optimal l, we prove the bias rate bound given in (13).

Theorem A1.

Let

R_{n^{'}, n^{″}} : = R (Z_{n^{'}}, Z_{n^{″}})

be the FR test statistic. Then a bound on the bias rate of the

R_{n^{'}, n^{″}}

estimator for

0 < η \leq 1

,

d \geq 2

is given by

\begin{matrix} | \frac{E [R_{n^{'}, n^{″}}]}{n} - 2 α β \int \int \frac{f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x | \\ \leq O (l^{d} {(n)}^{- η / d}) + O (l^{- d η}) + O (l^{d} β^{- 1} n^{- 1}) + O (c_{d} n^{- 1}), \end{matrix}

(A13)

where

0 < η \leq 1

is the Hölder smoothness parameter and

c_{d}

is the largest possible degree of any vertex of MST.

Set

\begin{matrix} α_{i} = α n a_{i} l^{d} (1 - a_{i} l^{- d}) + {(α n)}^{2} a_{i}^{2}, \\ β_{i} = β n b_{i} l^{d} (1 - b_{i} l^{- d}) + {(β n)}^{2} b_{i}^{2} . \end{matrix}

and

A_{f, n}^{β, α} (x, y) = \frac{2 f_{X Y} (x, y) (f_{X} (x) f_{Y} (y) + δ_{f} / (β n)) (f_{X Y} (x, y) \sqrt{α} + (f_{X} (x) f_{Y} (y) + δ_{f} / (β n)) \sqrt{β})}{a_{i}^{2} l^{- d} {(α f_{X Y} (x, y) + β (f_{X} (x) f_{Y} (y) + δ_{f} / (β n)))}^{2}},

(A14)

where

δ_{f} = \int \int | f_{X Y} (x, y) - f_{X} (x) f_{Y} (y) | d x d y,

(A15)

A more explicit form for the bound on the RHS is given below:

\begin{matrix} Δ (α, f_{X Y}, f_{X} f_{Y}) : = c_{2} l^{d} {(n)}^{- 1} + c_{d} 2^{d} {(n)}^{- 1} + O (l^{d} {(n)}^{- η / d}) + O (l^{d} {(n)}^{- 1 / 2}) \\ + O (c_{d} {(n)}^{- 1 / 2}) + 2 c_{1} l^{d - 1} {(n)}^{(1 / d) - 1} + δ_{f} ({(β n)}^{- 1}) \int \int \frac{2 α β f_{X Y} (x, y)}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x d y \\ + {(n)}^{- 1} \sum_{i = 1}^{M} 2 \int \int f_{X Y} (x, y) (f_{X} (x) f_{Y} (y) + δ_{f} / (β n)) (α_{i} β_{i} (α n a_{i} l^{- d} f_{X Y}^{2} (x, y) \\ + β n b_{i} l^{- d} {(f_{X} (x) f_{Y} (y) + δ_{f} / (β n))}^{2}))^{1 / 2} / {(α n a_{i} f_{X Y} (x, y) + β n b_{i} f_{X} (x) f_{Y} (y))}^{2} d x d y \\ + {(n)}^{- 1} \sum_{i = 1}^{M} O (l) \int \int l^{d} {(a_{i})}^{- 1} \frac{2 f_{X Y} (x, y) (f_{X} (x) f_{Y} (y) + δ_{f} / (β n))}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x d y + O (l^{- d η}) \\ + {(n)}^{- 3 / 2} \sum_{i = 1}^{M} O (l) \int \int l^{- d / 2} \sqrt{b_{i}} A_{f, n}^{β, α} (x, y) d x d y . \end{matrix}

(A16)

Proof. Consider two Poisson variables

M_{n^{'}}

and

N_{n^{″}}

with mean

n^{'}

and

n^{″}

respectively and independent of one another and

{Z_{i}^{'}}

and

{{\tilde{Z}}_{j}}

. Let

Z_{n^{'}}^{'}

and

{\tilde{Z}}_{n^{″}}

be the Poisson processes

{Z_{1}^{'}, \dots, Z_{M_{n^{'}}}^{'}}

and

{{\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{N_{n^{″}}}}

. Likewise Appendix A.4, set

R_{n^{'}, n^{″}}^{'} = R (Z_{n^{'}}^{'}, {\tilde{Z}}_{n^{″}})

. Applying Lemma 1, and (12) in [13], we can write

| R_{n^{'}, n^{″}}^{'} - R_{n^{'}, n^{″}} | \leq c_{d} (| M_{n^{'}} - n^{'} | + | N_{n^{″}} - n^{″} |) .

(A17)

Here

c_{d}

denotes the largest possible degree of any vertex of the MST in

R^{d}

. Following the arguments in [46], we have

E [| M_{n^{'}} - n^{'} |] = O ({n^{'}}^{1 / 2})

and

E [| N_{n^{″}} - n^{″} |] = O ({n^{″}}^{1 / 2})

. Hence

\frac{E [R_{n^{'}, n^{″}}]}{n^{'} + n^{″}} = \frac{E [R_{n^{'}, n^{″}}^{'}]}{n^{'} + n^{″}} + O (c_{d} {(n^{'} + n^{″})}^{- 1 / 2}) .

(A18)

Next let

n_{i}^{'}

and

n_{i}^{″}

be independent binomial random variables with marginal densities

B (n^{'}, a_{i} l^{- d})

and

B (n^{″}, b_{i} l^{- d})

such that

a_{i}, b_{i}

are non-negative constants

a_{i} \leq b_{i}

and

\sum_{i = 1}^{l^{d}} a_{i} l^{- d} = \sum_{i = 1}^{l^{d}} b_{i} l^{- d} = 1

. Therefore, using the subadditivity property in Lemma 2.2, [46], we can write

\begin{matrix} E [R_{n^{'}, n^{″}}^{'}] \leq \sum_{i = 1}^{M} E [E [R_{n_{i}^{'}, n_{i}^{″}}^{'} | n_{i}^{'}, n_{i}^{″}]] + 2 c_{1} l^{d - 1} {(n^{'} + n^{″})}^{1 / d}, \end{matrix}

(A19)

where

M = l^{d}

, and

η > 0

is the Hölder smoothness parameter. Furthermore, for given

n_{i}^{'}

,

n_{i}^{″}

, let

W_{1}^{n_{i}^{'}, n_{i}^{″}}, W_{2}^{n_{i}^{'}, n_{i}^{″}}, \dots

be independent variables with common densities for

(x, y) \in R^{d} \times R^{d}

:

g_{n_{i}^{'}, n_{i}^{″}} (x, y) = (n_{i}^{'} f_{X Y} (x, y) + n_{i}^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x, y)) / (n_{i}^{'} + n_{i}^{″}) .

Denote

L_{n_{i}^{'}, n_{i}^{″}}

be an independent Poisson variable with mean

n_{i}^{'} + n_{i}^{″}

and

F_{n_{i}^{'}, n_{i}^{″}}^{'} = \{W_{1}^{n_{i}^{'}, n_{i}^{″}}, \dots, W_{L_{n_{i}^{'} . n_{i}^{″}}}^{n_{i}^{'}, n_{i}^{″}}\}

a non-homogeneous Poisson of rate

n_{i}^{'} f_{X Y} + n_{i}^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}}

. Let

F_{n_{i}^{'}, n_{i}^{″}}

be the non-Poisson point process

\{W_{1}^{n_{i}^{'}, n_{i}^{″}}, \dots W_{n_{i}^{'} + n_{i}^{″}}^{n_{i}^{'}, n_{i}^{″}}\}

. Assign a mark from the set

{1, 2}

to each point of

F_{n_{i}^{'}, n_{i}^{″}}^{'}

. Let

{\tilde{Z}}_{n_{i}^{'}}^{'}

be the sets of points marked 1 with each probability

n_{i}^{'} f_{X Y} (x, y) / (n_{i}^{'} f_{X Y} (x, y) + n_{i}^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x, y))

and let

{\tilde{Z}}_{n_{i}^{″}}^{″}

be the set points with mark 2. Please note that owing to the marking theorem [47],

{\tilde{Z}}_{n_{i}^{'}}^{'}

and

{\tilde{Z}}_{n_{i}^{″}}^{″}

are independent Poisson processes with the same distribution as

Z_{n_{i}^{'}}^{'}

and

{\tilde{Z}}_{n_{i}^{″}}

, respectively. Considering

{\tilde{R}}_{n_{i}^{'}, n_{i}^{″}}^{'}

as FR test statistic on nodes in

{\tilde{Z}}_{n_{i}^{'}}^{'} \cup {\tilde{Z}}_{n_{i}^{″}}^{″}

, we have

E [R_{n_{i}^{'}, n_{i}^{″}}^{'} | n_{i}^{'}, n_{i}^{″}] = E [{\tilde{R}}_{n_{i}^{'}, n_{i}^{″}}^{'} | n_{i}^{'}, n_{i}^{″}] .

By the fact that

E [| M_{n^{'}} + N_{n^{″}} - n^{'} - n^{″} |] = O ({(n^{'} + n^{″})}^{1 / 2})

, we have

\begin{matrix} E [{\tilde{R}}_{n_{i}^{'}, n_{i}^{″}}^{'} | n_{i}^{'}, n_{i}^{″}] = E [E [{\tilde{R}}_{n_{i}^{'}, n_{i}^{″}}^{'} | F_{n_{i}^{'}, n_{i}^{″}}^{'}]] \\ = E [\underset{s < j < n_{i}^{'} + n_{i}^{″}}{\sum \sum} P_{n_{i}^{'}, n_{i}^{″}} (W_{s}^{n_{i}^{'}, n_{i}^{″}}, W_{j}^{n_{i}^{'}, n_{i}^{″}}) 1 \{(W_{s}^{n_{i}^{'}, n_{i}^{″}}, W_{j}^{n_{i}^{'}, n_{i}^{″}}) \in F_{n_{i}^{'}, n_{i}^{″}}\}] + O ({(n_{i}^{'} + n_{i}^{″})}^{1 / 2})) . \end{matrix}

Here

z^{'} = (x^{'}, y^{'})

,

z^{″} = (x^{″}, y^{″})

, and

P_{n_{i}^{'}, n_{i}^{″}} (z^{'}, z^{″})

is given in below:

\begin{matrix} P_{n_{i}^{'}, n_{i}^{″}} (z^{'}, z^{″}) : = P_{r} \{mark z^{'} \neq mark z^{″}, (z^{'}, z^{″}) \in F_{n_{i}^{'}, n_{i}^{″}}\} \\ = \frac{n_{i}^{'} f_{X Y} (x^{'}, y^{'}) n_{i}^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x^{″}, y^{″}) + n_{i}^{'} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x^{'}, y^{'}) n_{i}^{″} f_{X, Y} (x^{″}, y^{″})}{(n_{i}^{″} f_{X Y} (x^{'}, y^{'}) + n_{i}^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x^{'}, y^{'})) (n_{1}^{'} f_{X Y} (x^{″}, y^{″}) + n_{i}^{″} {\tilde{f}}_{\tilde{X} \tilde{Y}} (x^{″}, y^{″}))} . \end{matrix}

Next set

\begin{matrix} α_{i} = n^{'} a_{i} l^{d} (1 - a_{i} l^{- d}) + {n^{'}}^{2} a_{i}^{2}, β_{i} = n^{″} b_{i} l^{d} (1 - b_{i} l^{- d}) + {n^{″}}^{2} b_{i}^{2} . \end{matrix}

By owing to the Lemma B.6 in [46] and applying analogous arguments, we can rewrite the expression in (A20):

\begin{matrix} E [{R^{'}}_{n^{'}, n^{″}}] \leq \sum_{i = 1}^{M} a_{i} b_{i} l^{- d} \int \int \frac{2 n^{'} n^{″} f_{X Y} (x, y) f_{\tilde{X} \tilde{Y}} (x, y)}{n^{'} a_{i} f_{X, Y} (x, y) + n^{″} b_{i} f_{\tilde{X} \tilde{Y}} (x, y)} d x d y + 2 c_{1} l^{d - 1} {(n^{'} + n^{″})}^{1 / d} \\ + \sum_{i = 1}^{M} 2 \int \int \frac{f_{X Y} (x, y) f_{\tilde{X} \tilde{Y}} (x, y) {(α_{i} β_{i} (n^{'} a_{i} l^{- d} f_{X Y}^{2} (x, y) + n^{″} b_{i} l^{- d} f_{\tilde{X} \tilde{Y}}^{2} (x, y)))}^{1 / 2}}{{(n^{'} a_{i} f_{X Y} (x, y) + n^{″} b_{i} f_{\tilde{X} \tilde{Y}} (x, y))}^{2}} d x d y \\ + \sum_{i = 1}^{M} E_{n_{i}^{'}, n_{i}^{″}} [(n_{i}^{'} + n_{i}^{″}) ς_{η} (l, n_{i}^{'}, n_{i}^{″})] + O (l^{d} {(n^{'} + n^{″})}^{1 - η / d}) + O (l^{d} {(n^{'} + n^{″})}^{1 / 2}), \end{matrix}

(A20)

where

ς_{η} (l, n_{i}^{'}, n_{i}^{″}) = (O (\frac{l}{n_{i}^{'} + n_{i}^{″}}) - \frac{2 l^{d}}{n_{i}^{'} + n_{i}^{″}}) \int g_{n_{i}^{'}, n_{i}^{″}} (z^{'}) P_{n_{i}^{'}, n_{i}^{″}} (z^{'}, z^{'}) d z^{'} + O (l^{- d η}) .

Going back to Lemma A3, we know that

f_{\tilde{X} \tilde{Y}} (x, y) = f_{X} (x) f_{Y} (y) + O (\frac{1}{β n}) .

Therefore, the first term on the RHS of (A20) is less and equal to

\begin{matrix} \sum_{i = 1}^{M} a_{i} b_{i} l^{- d} \int \int \frac{2 n^{'} n^{″} f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{n^{'} a_{i} f_{X, Y} (x, y) + n^{″} b_{i} f_{X} (x) f_{Y} (y)} d x d y \\ + (\frac{δ_{f}}{β n}) \sum_{i = 1}^{M} a_{i} b_{i} l^{- d} \int \int \frac{2 n^{'} n^{″} f_{X Y} (x, y)}{n^{'} a_{i} f_{X Y} (x, y) + n^{″} b_{i} f_{X} (x) f_{Y} (y)} d x d y, \end{matrix}

and the second term is less and equal to

\begin{matrix} \sum_{i = 1}^{M} 2 \int \int f_{X Y} (x, y) (f_{X} (x) f_{Y} (y) + δ_{f} / (β n)) (α_{i} β_{i} (n^{'} a_{i} l^{- d} f_{X Y}^{2} (x, y) + n^{″} b_{i} l^{- d} (f_{X} (x) f_{Y} (y) \\ + δ_{f} / (β n))^{2}))^{1 / 2} / {(n^{'} a_{i} f_{X Y} (x, y) + n^{″} b_{i} f_{X} (x) f_{Y} (y))}^{2} d x d y, \end{matrix}

where

δ_{f} = \int \int | f_{X Y} (x, y) - f_{X} (x) f_{Y} (y) | d x d y .

Recall the definition of the dual MST and FR statistic denoted by

R_{n^{'}, n^{″}}^{*}

following [46]:

Definition A1.

(Dual MST,

{MST}^{*}

and dual FR statistic

R_{m, n}^{*}

) Let

F_{i}

be the set of corner points associated with a particular subsection

Q_{i}

,

1 \leq i \leq l^{d}

of

{[0, 1]}^{d}

. Define the dual

{MST}^{*} (X_{m} \cup Y_{n} \cap Q_{i})

as the boundary MST graph in partition

Q_{i}

[48], which contains

X_{m}

and

Y_{n}

points falling inside partition cell

Q_{i}

and those corner points in

F_{i}

which minimize total MST length. Please note that it is allowed to connect the MSTs in

Q_{i}

and

Q_{j}

through points strictly contained in

Q_{i}

and

Q_{j}

and therefore corner points. Thus, the dual MST can connect the points in

Q_{i} \cup Q_{j}

by direct edges to pair to another point in

Q_{i} \cup Q_{j}

or by passing through the corner the corner points which are all connected in order to minimize the total weights. To clarify, assume that there are two points in

Q_{i} \cup Q_{j}

, then the dual MST consists of the two edges connecting these points to the corner if they are closer to a corner point otherwise the dual MST connects them to each other.

Furthermore,

R_{m, n}^{*} (X_{m}, Y_{n} \cap Q_{i})

is defined as the number of edges in the

{MST}^{*}

graph connecting nodes from different samples and number of edges connecting to the corner points. Please note that the edges connected to the corner nodes (regardless of the type of points) are always counted in the dual FR test statistic

R_{m, n}^{*}

.

Similarly, consider the Poisson processes samples and the FR test statistic over these samples, denoted by

{R^{'}}_{n^{'}, n^{″}}^{*}

. By superadditivity of the dual

R_{n^{'}, n^{″}}^{*}

[46], we have

\begin{matrix} E [R_{n^{'}, n^{″}}^{' *}] \geq \sum_{i = 1}^{M} a_{i} l^{- d} \int \int \frac{2 n^{'} n^{″} f_{X Y} (x, y) (f_{X} (x) f_{Y} (y) - δ_{f} / (β n))}{n^{'} f_{X Y} (x, y) + n^{″} (f_{X} (x) f_{Y} (y) - δ_{f} / (β n))} d x d y \\ - \sum_{i = 1}^{M} E_{n_{i}^{'}, n_{i}^{″}} [(n_{i}^{'} + n_{i}^{″}) ς_{η} (l, n_{i}^{'}, n_{i}^{″})] - O (l^{d} {(n^{'} + n^{″})}^{1 - η / d}) - O (l^{d} {(n^{'} + n^{″})}^{1 / 2}) - c_{2} l^{d} . \end{matrix}

(A21)

The first term of RHS in (A21) is greater or equal to

\begin{matrix} \int \int \frac{2 n^{'} n^{″} f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{n^{'} f_{X Y} (x, y) + n^{″} f_{X} (x) f_{Y} (y)} d x d y - \frac{δ_{f}}{β n} \int \int \frac{2 n^{'} n^{″} f_{X Y} (x, y)}{n^{'} f_{X Y} (x, y) + n^{″} f_{X} (x) f_{Y} (y)} d x d y . \end{matrix}

Furthermore,

\frac{E [R_{n^{'}, n^{″}}^{'}]}{n} + \frac{c_{d} 2^{d}}{n} \geq \frac{E [R_{n^{'}, n^{″}}^{' *}]}{n},

where

c_{d}

is the largest possible degree of any vertex of the MST in

R^{d}

, as before. Consequently, we have

\begin{matrix} | \frac{E [R_{n^{'}, n^{″}}^{'}]}{n} - \int \int \frac{2 α β f_{X Y} (x, y) f_{X} (x) f_{Y} (y)}{α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y)} d x d y | \leq B (α, f_{X Y}, f_{X} f_{Y}), \end{matrix}

(A22)

where

B

is defined in (A16) and

A_{f, n}^{β, α} (x, y)

has been introduced in (A14). The last line in (A22) follows from the fact that

\begin{matrix} \sum_{i = 1}^{M} E_{n_{i}^{'}, n_{i}^{″}} [(n_{i}^{'} + n_{i}^{″}) ς_{η} (l, n_{i}^{'}, n_{i}^{″})] \leq \sum_{i = 1}^{M} O (l) \int \int l^{- d / 2} \sqrt{b_{i}} A_{f, n}^{β, n^{'} / n} (x, y) d x d y \\ + \sum_{i = 1}^{M} O (l) \int \int l^{d} {(a_{i})}^{- 1} \frac{2 f_{X Y} (x, y) (f_{X} (x) f_{Y} (y) + O (δ_{f} / (β n)))}{n^{'} f_{X Y} (x, y) + n^{″} f_{X} (x) f_{Y} (y)} d x d y . \end{matrix}

Here

A_{f, n}^{β, n^{'} / n} (x, y)

is given as (A14) by substituting

n^{'} / n

in

α

such that

β = 1 - α

. Hence, the proof of Theorem A1 is completed. □

Going back to the proof of (13), without loss of generality assume that

(n) l^{- d} > 1

, for

d \geq 2

and

0 < η \leq 1

. We select l as a function of n and

β

to be the sequence increasing in n which minimizes the maximum of these rates:

l (n, β) = a r g min_{l} max \{l^{d} {(n)}^{- η / d}, l^{- η d}, l^{d} β^{- 1} n^{- 1}, c_{d} 2^{d} n^{- 1}\} .

(A23)

The solution

l = l (n, β)

is obtained when

l^{d} {(n)}^{- η / d} = l^{- η d}

, or equivalently

l = ⌊ {(n)}^{η / (d^{2} (η + 1))} ⌋

or when

l^{d} β^{- 1} n^{- 1} = l^{- η d}

, which implies

l = ⌊ {(β n)}^{1 / (d (1 + η))} ⌋

. Substitute this l in the bound (A13) to obtain the RHS expression in (13) for

d \geq 2

. □

Appendix A.6.

Our main goal in Section 2.4 was to find proportion

α

such that the parametric MSE rate depending on the joint density

f_{X Y}

and marginal densities

f_{X}, f_{Y}

is minimized. Recalling the explicit bias bound in (A16), it can be seen that this function will be a complicated function of

f_{X Y}

,

f_{X} f_{Y}

and

α

. By rearrangement of terms in (A13), we first find an upper bound for

Δ

in (A16), denoted by

\bar{Δ}

, as follows:

\begin{matrix} \bar{Δ} (α, f_{X Y}, f_{X} f_{Y}) = D (n, l_{n}, d, η) + \tilde{D} (n, l_{n}, d) E_{X Y} [G_{f, n}^{α, β} (X, Y)], \end{matrix}

(A24)

where

l_{n} : = ⌊n^{η / (d^{2} (1 + η))}⌋

. From Appendix A.5 we know that optimal l is given by (A23). One can check that for

α \leq 1 - n^{(η / d) - 1}

, the optimal

l = ⌊n^{η / (d^{2} (1 + η))}⌋

provides a tighter bound. In (A24), the constants D and

\bar{D}

are

\begin{matrix} D (n, l_{n}, d, η) = c_{2} l_{n}^{d} n^{- 1} + c_{d} 2^{d} n^{- 1} + c^{'} l_{n}^{d} n^{- η / d} + c l_{n}^{d} n^{- 1 / d} + 2 c_{1} l_{n}^{d - 1} n^{1 / d - 1} + c_{3} l_{n}^{- d η}, \end{matrix}

(A25)

\begin{matrix} \tilde{D} (n, l_{n}, d) = 2 + n^{- 1} 2 c^{″} \sum_{i = 1}^{M} l_{n} l_{n}^{d} a_{i}^{- 1} + n^{- 3 / 2} 2 c_{1}^{'} \sum_{i = 1}^{M} l_{n} l_{n}^{d / 2} \sqrt{b_{i}} a_{i}^{2} \\ + n^{- 1} \sum_{i = 1}^{M} 2 n^{- 3 / 2} l_{n}^{- d / 2} \frac{\sqrt{b_{i}}}{a_{i}^{2}} {(n a_{i} l_{n}^{d} + n^{2} a_{i}^{2})}^{1 / 2} {(n b_{i} l_{n}^{d} + n^{2} b_{i}^{2})}^{1 / 2} . \end{matrix}

(A26)

And the function

G_{f, n}^{α, β} (x, y)

is given as the following:

\begin{matrix} G_{f, n}^{α, β} (x, y) = (f_{X} (x) f_{Y} (y) + δ_{f} / (n β)) (\sqrt{α} f_{X Y} (x, y) \\ + \sqrt{β} (f_{X} (x) f_{Y} (y) + δ_{f} / (β n)) / {(α f_{X Y} (x, y) + β f_{X} (x) f_{Y} (y))}^{2}, \end{matrix}

(A27)

where

δ_{f}

is given in (A15). Next After all still the expression (A27) is complicated to optimize therefore we use the fact that

0 \leq α, β \leq 1

to bound the function

G_{f, n}^{α, β} (x, y)

. Define the set

Γ

Γ : = \{ϵ_{X Y} : | ϵ_{X Y} (t) - ϵ_{X Y} (t^{'}) | \leq \bar{K} {∥ t - t^{'} ∥}_{d}^{η}\},

where

\bar{K} = C_{ϵ}^{U} K \{C_{X Y}^{L} + C_{X}^{L} + C_{Y}^{L} C_{X}^{L} C_{X}^{U}\} .

Here K is the smoothness constant in the Hölder class. Notice that set

Γ

is a convex set. We bound

\bar{Δ}

by

\tilde{Δ} (α, ϵ_{X Y}) = D (n, l_{n}, d, η) + \tilde{D} (n, l_{n}, d) C_{X Y}^{U} \underset{S_{X Y}}{\int \int} {\tilde{G}}_{ϵ_{X Y}, n}^{α, β} (x, y) d x d y .

(A28)

Set

C_{n} = C_{X Y}^{L} n / 2

,

{\tilde{G}}_{n}^{α, β} (ϵ_{X Y}) = \frac{(ϵ_{X Y}^{- 1} (x, y) + {(β C_{n})}^{- 1}) (1 + ϵ_{X Y}^{- 1} (x y) + {(β C_{n})}^{- 1})}{{(α + β ϵ_{X Y}^{- 1} (x, y))}^{2}} .

(A29)

This simplifies to

{\tilde{G}}_{n}^{α, β} (ϵ_{X Y}) = \frac{(1 + {(β C_{n})}^{- 1} ϵ_{X Y}) (1 + ϵ_{X Y} + {(β C_{n})}^{- 1} ϵ_{X Y})}{{(α ϵ_{X Y} + β)}^{2}} .

(A30)

Under the condition

\frac{2}{C_{n}} \leq α \leq min \{\frac{1}{2} + \frac{1}{2 C_{n}}, \frac{1}{3} + \frac{2}{3 C_{n}}\},

(A31)

{\tilde{G}}_{n}^{α, β} (ϵ_{X Y})

is an increasing function in

ϵ

. Furthermore, for

α \leq \frac{1}{4}

and

C_{ϵ}^{L} \leq ϵ_{X Y} \leq min \{C_{ϵ}^{U}, θ^{U} (α)\}, where θ^{U} (α) = \frac{1 - 4 α + 1 / C_{n}}{2 α},

(A32)

the function

{\tilde{G}}_{n}^{α, β} (ϵ_{X Y})

is strictly concave. Next, to find an optimal

α

we consider the following optimization problem:

\begin{matrix} \min_{α} \max_{ϵ_{X Y} \in Γ} & \tilde{Δ} (α, ϵ_{X Y}) + c_{d} (1 - α) n^{- 1} \\ subject to & C_{ϵ}^{L} \leq ϵ_{X Y} \leq C_{ϵ}^{U}, \end{matrix}

(A33)

here

ϵ_{X Y} = f_{X Y} / f_{X} f_{Y}

,

C_{ϵ}^{U} = C_{X Y}^{U} / C_{X}^{L} C_{Y}^{L}

and

C_{ϵ}^{L} = C_{X Y}^{L} / C_{X}^{U} C_{Y}^{U}

, such that

C_{ϵ}^{L} \leq 1

. We know that under conditions (A31) and (A32), the function

{\tilde{G}}_{n}^{α, β}

is strictly concave and increasing in

ϵ_{X Y}

. We first solve the optimization problem:

\begin{matrix} \max_{ϵ_{X Y} \in Γ} & \underset{S_{X Y}}{\int \int} {\tilde{G}}_{n}^{α, β} (ϵ_{X Y} (x, y)) d x d y \\ subject to & θ_{ϵ}^{L} (α) V (S_{X Y}) \leq \underset{S_{X Y}}{\int \int} ϵ_{X Y} (x, y) d x d y \\ \leq θ_{ϵ}^{U} (α) V (S_{X Y}), \end{matrix}

(A34)

where

θ_{ϵ}^{L} (α) : = C_{ϵ}^{L}, θ_{ϵ}^{U} (α) : = min {C_{ϵ}^{U}, θ^{U} (α)} .

(A35)

The Lagrangian for this problem is

\begin{matrix} L (ϵ_{X Y}, λ_{1}, λ_{2}) = {\int \int}_{S_{X Y}} {\tilde{G}}_{n}^{α, β} (ϵ_{X Y} (x, y)) d x d y - λ_{1} ({\int \int}_{S_{X Y}} ϵ_{X Y} (x, y) d x d y - θ_{ϵ}^{U} (α) V (S_{X Y})) \\ - λ_{2} (θ_{ϵ}^{L} (α) V (S_{X Y}) - {\int \int}_{S_{X Y}} ϵ_{X Y} (x, y) d x d y) . \end{matrix}

In this case, the optimum

ϵ_{X Y}^{*}

is bounded,

θ_{ϵ}^{L} (α) \leq ϵ_{X Y}^{*} \leq θ_{ϵ}^{U} (α),

and the Lagrangian multiplier

λ_{1}^{*}, λ_{2}^{*} \geq 0

is such that

\min_{λ_{1}, λ_{2} \geq 0} \max_{ϵ_{X Y} \in Γ} L (ϵ_{X Y}, λ_{1}, λ_{2}) = L (ϵ_{X Y}^{*}, λ_{1}^{*}, λ_{2}^{*}) .

Set

G_{n}^{'} (ϵ_{X Y}) = \frac{d}{d ϵ_{X Y}} {\tilde{G}}_{n}^{α, β} (ϵ_{X Y}) .

In view of the concavity of

{\tilde{G}}_{ϵ_{X Y}, n}^{α, β}

and Lemma 1, page 227 in [49], maximizing

L (ϵ_{X Y}, λ_{1}^{*}, λ_{2}^{*})

over

ϵ_{X Y}

is equivalent to

{\int \int}_{S_{X Y}} \{G_{n}^{'} (ϵ_{X Y}^{*} (x, y)) - (λ_{1}^{*} - λ_{2}^{*})\} ϵ_{X Y} (x, y) d x d y \leq 0,

(A36)

for all

θ_{ϵ}^{L} (α) \leq ϵ_{X Y}^{*} \leq θ_{ϵ}^{U} (α)

, and

{\int \int}_{S_{X Y}} \{G_{n}^{'} (ϵ_{X Y}^{*} (x, y)) - (λ_{1}^{*} - λ_{2}^{*})\} ϵ_{X Y}^{*} (x, y) d x d y = 0 .

(A37)

Denote

{G^{'}}_{n}^{- 1}

the inverse function of

G_{n}^{'}

. Since

G_{n}^{'}

is strictly decreasing in

ϵ_{X Y}^{*}

(this is because

{\tilde{G}}_{n}^{α, β} (ϵ_{X Y})

is strictly concave, so that

{G^{'}}_{n}^{- 1}

is continuous and strictly decreasing in

ϵ_{X Y}^{*}

). From (A36) and (A37), we see immediately that on any interval

θ_{ϵ}^{L} (α) \leq ϵ_{X Y}^{*} \leq θ_{ϵ}^{U} (α)

, we have

ϵ_{X Y}^{*} = {G^{'}}_{n}^{- 1} (λ_{1}^{*} - λ_{2}^{*})

. We can write then

G_{n}^{'} (θ_{ϵ}^{U} (α)) \leq λ_{1}^{*} - λ_{2}^{*} \leq G_{n}^{'} (θ_{ϵ}^{L} (α)),

and

λ_{1}^{*}, λ_{2}^{*} \geq 0

. Next, we find the solution of

\min_{λ_{1}, λ_{2} \geq 0} {\bar{G}}_{n}^{α, β} (λ_{1}, λ_{2}), where

\begin{matrix} {\bar{G}}_{n}^{α, β} (λ_{1}, λ_{2}) = V (S_{X Y}) \{{\tilde{G}}_{n}^{α, β} ({G^{'}}_{n}^{- 1} (λ_{1} - λ_{2})) - (λ_{1} - λ_{2}) {G^{'}}_{n}^{- 1} (λ_{1} - λ_{2}) + λ_{1} θ_{ϵ}^{U} (α) - λ_{2} θ_{ϵ}^{L} (α)\} . \end{matrix}

The function

{\bar{G}}_{n}^{α, β} (λ_{1}, λ_{2})

is increasing in

λ_{1}

and

λ_{2}

, and therefore it takes its minimum at

(λ_{1}^{*}, λ_{2}^{*}) = (G_{n}^{'} (θ_{ϵ}^{U} (α)), 0)

. This implies that

ϵ_{X Y}^{*} = θ_{ϵ}^{U} (α)

. Returning to our primary minimization over

α

:

\begin{matrix} \min_{α} & \tilde{Δ} (α, ϵ_{X Y}^{*}) + c_{d} (1 - α) n^{- 1} \\ subject to & α_{0}^{L} \leq α \leq α_{0}^{U}, \end{matrix}

(A38)

where

α_{0}^{L} = \frac{2}{C_{n}}

and

α_{0}^{U} = min \{\frac{1}{4}, 1 - n^{η / d - 1}\}

. We know that

\frac{1}{4} \leq \frac{1}{3} + \frac{2}{3 C_{n}}

and

\frac{1}{4} \leq \frac{1}{2} + \frac{1}{2 C_{n}}

, therefore the condition below

\frac{2}{C_{n}} \leq α \leq min \{\frac{1}{4}, 1 - n^{η / d - 1}\},

implies the constraint

α_{0}^{L} \leq α \leq α_{0}^{U}

. Since the objective function (A38) is a complicated function in

α

, it is not feasible to determine whether it is a convex function in

α

. For this reason, let us solve the optimization problem in (A38) in a special case when

C_{ϵ}^{U} \leq θ^{U} (α)

. This implies

ϵ_{X Y}^{*} = C_{ϵ}^{U}

. Under assumption

C_{ϵ}^{U}

the objective function in (A38) is convex in

α

. Also, the case

C_{ϵ}^{U} \leq θ^{U} (α)

is equivalent to

α \leq \frac{1 + 1 / C_{n}}{4 + 2 C_{ϵ}^{U}}

. Therefore, in the optimization problem we have constraint

\frac{2}{C_{n}} \leq α \leq min \{\frac{1}{4}, \frac{1 + 1 / C_{n}}{4 + 2 C_{ϵ}^{U}}, 1 - n^{η / d - 1}\} .

We know that

\tilde{Δ} (α, ϵ_{X Y}^{*}) + c_{d} (1 - α) n^{- 1}

is convex over

α \in [α_{0}^{L}, α_{0}^{U}]

. Therefore, the problem becomes ordinary convex optimization problem. Let

\tilde{α}

,

{\tilde{λ}}_{1}

and

{\tilde{λ}}_{2}

be any points that satisfy the KKT conditions for this problem:

\begin{matrix} α_{0}^{L} - \tilde{α} \leq 0, \tilde{α} - α_{0}^{U} \leq 0, {\tilde{λ}}_{1}, {\tilde{λ}}_{2} \geq 0, \\ {\tilde{λ}}_{1} (α_{0}^{L} - \tilde{α}) = 0, {\tilde{λ}}_{2} (\tilde{α} - α_{0}^{U}) = 0, \\ \frac{d}{d α} (\tilde{Δ} (\tilde{α}, ϵ_{X Y}^{*}) + c_{d} (1 - \tilde{α}) n^{- 1}) - {\tilde{λ}}_{1} + {\tilde{λ}}_{2} = 0 . \end{matrix}

(A39)

Recall

Ξ (α)

from (20):

\begin{matrix} Ξ (α) = \frac{d}{d α} (\tilde{Δ} (α, ϵ_{X Y}^{*}) + c_{d} (1 - α) n^{- 1}), \end{matrix}

where

\tilde{Δ}

is given in (A28). So, the last condition in (A39) becomes

Ξ (\tilde{α}) = {\tilde{λ}}_{1} - {\tilde{λ}}_{2}

. We then have

α_{0}^{L} \leq Ξ^{- 1} ({\tilde{λ}}_{1} - {\tilde{λ}}_{2}) \leq α_{0}^{U},

where

Ξ^{- 1}

is inverse function of

Ξ

. Since

α_{0}^{L} \neq α_{0}^{U}

, at least one of

{\tilde{λ}}_{1}

or

{\tilde{λ}}_{2}

should be zero:

${\tilde{λ}}_{1} = 0$ , ${\tilde{λ}}_{2} \neq 0$ . Then $\tilde{α} = α_{0}^{U}$ and implies ${\tilde{λ}}_{2} = - Ξ (α_{0}^{U})$ . Since ${\tilde{λ}}_{2} > 0$ , so this leads to $Ξ (α_{0}^{U}) < 0$ .
${\tilde{λ}}_{2} = 0$ , ${\tilde{λ}}_{1} \neq 0$ . Then $\tilde{α} = α_{0}^{L}$ and implies ${\tilde{λ}}_{1} = Ξ (α_{0}^{L})$ . We know that ${\tilde{λ}}_{1} > 0$ , hence $Ξ (α_{0}^{L}) > 0$ .
${\tilde{λ}}_{1} = 0$ , ${\tilde{λ}}_{2} = 0$ . Then $\tilde{α} = Ξ^{- 1} (0)$ and so $α_{0}^{L} \leq Ξ^{- 1} (0) \leq α_{0}^{U}$ .

Consequently, by following the behavior of

Ξ (α)

with respect to

α_{0}^{L}

and

α_{0}^{U}

, we can often find the optimal

\tilde{α}

,

{\tilde{λ}}_{1}

and

{\tilde{λ}}_{2}

. For instance, if

Ξ (α)

is positive for all

α \in [α_{0}^{L}, α_{0}^{U}]

then we conclude that

\tilde{α} = α_{0}^{L}

.

References

Lewi, J.; Butera, R.; Paninski, L. Real-time adaptive information theoretic optimization of neurophysiology experiments. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2006; pp. 857–864. [Google Scholar]
Peng, H.C.; Herskovits, E.H.; Davatzikos, C. Bayesian Clustering Methods for Morphological Analysis of MR Images. In Proceedings of the IEEE International Symposium on Biomedical Imaging, Washington, DC, USA, 7–10 July 2002; pp. 485–488. [Google Scholar]
Moon, K.R.; Noshad, M.; Yasaei Sekeh, S.; Hero, A.O. Information theoretic structure learning with confidence. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Brillinger, D.R. Some data analyses using mutual information. Braz. J. Probab. Stat. 2004, 18, 163–183. [Google Scholar]
Torkkola, K. Feature extraction by non parametric mutual information maximization. J. Mach. Learn. Res. 2003, 3, 1415–1438. [Google Scholar]
Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
Peng, H.; Long, F.; Ding, C. Evaluation, Application, and Small Sample Performance. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 153–158. [Google Scholar]
Sorjamaa, A.; Hao, J.; Lendasse, A. Mutual information and knearest neighbors approximator for time series prediction. Lecture Notes Comput. Sci. 2005, 3697, 553–558. [Google Scholar]
Mohamed, S.; Rezende, D.J. Variational information maximization for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2015; pp. 2116–2124. [Google Scholar]
Neemuchwala, H.; Hero, A.O. Entropic graphs for registration. In Multi-Sensor Image Fusion and its Applications; CRC Press Book: Boca Raton, FL, USA, 2005; pp. 185–235. [Google Scholar]
Neemuchwala, H.; Hero, A.O.; Zabuawala, S.; Carson, P. Image registration methods in high-dimensional space. Int. J. Imaging Syst. Technol. 2006, 16, 130–145. [Google Scholar] [CrossRef] [Green Version]
Henze, N.; Penrose, M.D. On the multivarite runs test. Ann. Stat. 1999, 27, 290–298. [Google Scholar]
Berisha, V.; Hero, A.O. Empirical non-parametric estimation of the Fisher information. IEEE Signal Process. Lett. 2015, 22, 988–992. [Google Scholar] [CrossRef]
Berisha, V.; Wisler, A.; Hero, A.O.; Spanias, A. Empirically estimable classification bounds based on a nonparametric divergence measure. IEEE Trans. Signal Process. 2016, 64, 580–591. [Google Scholar] [CrossRef]
Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
Yasaei Sekeh, S.; Oselio, B.; Hero, A.O. Multi-class Bayes error estimation with a global minimal spanning tree. In Proceedings of the 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018. [Google Scholar]
Noshad, M.; Moon, K.R.; Yasaei Sekeh, S.; Hero, A.O. Direct Estimation of Information Divergence Using Nearest Neighbor Ratios. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017. [Google Scholar]
March, W.; Ram, P.; Gray, A. Fast Euclidean minimum spanning tree: Algorithm, analysis, and applications. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2010; pp. 603–612. [Google Scholar]
Borůvka, O. O jistém problému minimálním. Práce Moravské Pridovedecké Spolecnosti 1926, 3, 37–58. [Google Scholar]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066–138. [Google Scholar] [CrossRef] [PubMed]
Moon, K.R.; Sricharan, K.; Hero, A.O. Ensemble Estimation of Mutual Information. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 3030–3034. [Google Scholar]
Moon, K.R.; Hero, A.O. Multivariate f-divergence estimation with confidence. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2420–2428. [Google Scholar]
Moon, K.R.; Sricharan, K.; Greenewald, K.; Hero, A.O. Improving convergence of divergence functional ensemble estimators. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016. [Google Scholar]
Leonenko, N.; Pronzato, L.; Savani, V. A class of Rényi information estimators for multidimensional densities. Ann. Stat. 2008, 36, 2153–2182. [Google Scholar] [CrossRef]
Gao, S.; Ver Steeg, G.; Galstyan, A. Efficient estimation of mutual information for strongly dependent variables. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 277–286. [Google Scholar]
Pál, D.; Póczos, B.; Szapesvári, C. Estimation of Rényi entropy and mutual information based on generalized nearest-neighbor graphs. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems 2010, Vancouver, BC, Canada, 6–9 December 2010. [Google Scholar]
Krishnamurthy, A.; Kandasamy, K.; Póczos, B.; Wasserman, L. Nonparametric estimation of Rényi divergence and friends. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 919–927. [Google Scholar]
Kandasamy, K.; Krishnamurthy, A.; Póczos, B.; Wasserman, L.; Robins, J. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 397–405. [Google Scholar]
Singh, S.; Póczos, B. Analysis of k nearest neighbor distances with application to entropy estimation. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1217–1225. [Google Scholar]
Sugiyama, M. Machine learning with squared-loss mutual information. Entropy 2012, 15, 80–112. [Google Scholar] [CrossRef]
Costa, A.; Hero, A.O. Geodesic entropic graphs for dimension and entropy estimation in manifold learning. IEEE Trans. Signal Process. 2004, 52, 2210–2221. [Google Scholar] [CrossRef]
Friedman, J.H.; Rafsky, L.C. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann. Stat. 1979, 7, 697–717. [Google Scholar] [CrossRef]
Smirnov, N.V. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Moscow Univ. 1939, 2, 3–6. [Google Scholar]
Wald, A.; Wolfowitz, J. On a test whether two samples are from the same population. Ann. Math. Stat. 1940, 11, 147–162. [Google Scholar] [CrossRef]
Noshad, M.; Hero, A.O. Scalable mutual information estimation using dependence graphs. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTAT), Brighton, UK, 12–17 May 2019. [Google Scholar]
Yasaei Sekeh, S.; Oselio, B.; Hero, A.O. A Dimension-Independent discriminant between distributions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
Ali, S.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [Google Scholar] [CrossRef]
Cover, T.; Thomas, J. Elements of Information Theory, 1st ed.; John Wiley & Sons: Chichester, UK, 1991. [Google Scholar]
Härdle, W. Applied Nonparametric Regression; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Lorentz, G.G. Approximation of Functions; Holt, Rinehart and Winston: New York, NY, USA; Chicago, IL, USA; Toronto, ON, Canada, 1966. [Google Scholar]
Andersson, P. Characterization of pointwise Hölder regularity. Appl. Comput. Harmon. Anal. 1997, 4, 429–443. [Google Scholar] [CrossRef]
Robins, G.; Salowe, J.S. On the maximum degree of minimum spanning trees. In Proceedings of the SCG 94 Tenth Annual Symposium on Computational Geometry, Stony Brook, NY, USA, 6–8 June 1994; pp. 250–258. [Google Scholar]
Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
Yasaei Sekeh, S.; Noshad, M.; Moon, K.R.; Hero, A.O. Convergence Rates for Empirical Estimation of Binary Classification Bounds. arXiv 2018, arXiv:1810.01015. [Google Scholar]
Kingman, J.F.C. Poisson Processes; Oxford Univ. Press: Oxford, UK, 1993. [Google Scholar]
Yukich, J.E. Probability Theory of Classical Euclidean Optimization; Vol. 1675 of Lecture Notes in Mathematics; Springer: Berlin, Germany, 1998. [Google Scholar]
Luenberger, D.G. Optimization by Vector Space Methods; Wiley Professional Paperback Series; Wiley-Interscience: Hoboken, NJ, USA, 1969. [Google Scholar]

Figure 1. The MST and FR statistic of spanning the merged set of normal points when

X

and

Y

are independent (denoted in blue points) and when

X

and

Y

are highly dependent (denoted in red points). The FR test statistic is the number of edges in the MST that connect samples from different color nodes (denoted in green) and it is used to estimate the GMI

I_{p}

.

Figure 1. The MST and FR statistic of spanning the merged set of normal points when

X

and

Y

are independent (denoted in blue points) and when

X

and

Y

are highly dependent (denoted in red points). The FR test statistic is the number of edges in the MST that connect samples from different color nodes (denoted in green) and it is used to estimate the GMI

I_{p}

.

Figure 2. (left) Log–log plot of theoretical and experimental MSE of the proposed MST-based GMI estimator as a function of sample size n for

d = 6, 10, 12

and fixed smoothness parameter

η

. (right) The GMI estimator was implemented using two approaches, Algorithm 1 and KDE method where the KDE-GMI used KDE density estimators in the formula (2). In this experiment, samples are generated from the two-dimensional normal distribution with zero mean and covariance matrix (21) for various value of

ρ \in [0.1, 0.9]

.

Figure 2. (left) Log–log plot of theoretical and experimental MSE of the proposed MST-based GMI estimator as a function of sample size n for

d = 6, 10, 12

and fixed smoothness parameter

η

. (right) The GMI estimator was implemented using two approaches, Algorithm 1 and KDE method where the KDE-GMI used KDE density estimators in the formula (2). In this experiment, samples are generated from the two-dimensional normal distribution with zero mean and covariance matrix (21) for various value of

ρ \in [0.1, 0.9]

.

Figure 3. MSE log–log plots as a function of sample size n (left) for the proposed MST-GMI estimator (“Estimated GMI”) and the standard KDE-GMI plug-in estimator of GMI. The right column of plots correspond to the GMI estimated for dimension

d = 4

(top) and

d = 12

(bottom). In both cases the proportionality parameter

α

is

0.5

. The MST-GMI estimator in both plots for sample size n in

[700, 1600]

outperforms the KDE-GMI estimator, especially for larger dimensions.

Figure 3. MSE log–log plots as a function of sample size n (left) for the proposed MST-GMI estimator (“Estimated GMI”) and the standard KDE-GMI plug-in estimator of GMI. The right column of plots correspond to the GMI estimated for dimension

d = 4

(top) and

d = 12

(bottom). In both cases the proportionality parameter

α

is

0.5

. The MST-GMI estimator in both plots for sample size n in

[700, 1600]

outperforms the KDE-GMI estimator, especially for larger dimensions.

Figure 4. MSE log–log plots as a function of sample size n for the proposed FR estimator. We compare the MSE of our proposed FR estimator for various dimensions

d = 15, 20, 50

(left). As d increases, the blue curve takes larger values than green and orange curves i.e., MSE increases as d grows. However, this is more evidential for large sample size n. The second experiment (right) focuses on optimal proportion

α

for

n = 500, 10^{4}

and

ρ = 0.7, 0.5

.

\tilde{α}

is the optimal

α

for

α \in [α_{0}^{L}, α_{0}^{U}]

.

Figure 4. MSE log–log plots as a function of sample size n for the proposed FR estimator. We compare the MSE of our proposed FR estimator for various dimensions

d = 15, 20, 50

(left). As d increases, the blue curve takes larger values than green and orange curves i.e., MSE increases as d grows. However, this is more evidential for large sample size n. The second experiment (right) focuses on optimal proportion

α

for

n = 500, 10^{4}

and

ρ = 0.7, 0.5

.

\tilde{α}

is the optimal

α

for

α \in [α_{0}^{L}, α_{0}^{U}]

.

Figure 5. Runtime of KDE approach and proposed MST-based estimator of GMI vs sample size. The proposed GMI estimator achieves significant speedup, while for small sample size, the KDE method becomes overly fast. Please note that in this experiment the sample is generated from the Gaussian distribution in dimension

d = 2

.

Figure 5. Runtime of KDE approach and proposed MST-based estimator of GMI vs sample size. The proposed GMI estimator achieves significant speedup, while for small sample size, the KDE method becomes overly fast. Please note that in this experiment the sample is generated from the Gaussian distribution in dimension

d = 2

.

Table 1. Comparison between different scenarios of various dimensions and sample sizes in terms of parameter

α

. We applied the MST-GMI estimator to estimate the GMI (

I_{α}

) with

α = 0.2, 0.5, 0.8

. We varied dimension

d = 6, 8, 10

and sample size

n = 1000, 1500, 2000

in each scenario. We observe that for

α = {0.2, 0.5, 0.8}

, the MST-GMI estimator provides lowest MSE when

α = 0.5

indicated by star (*).

Table 1. Comparison between different scenarios of various dimensions and sample sizes in terms of parameter

α

. We applied the MST-GMI estimator to estimate the GMI (

I_{α}

) with

α = 0.2, 0.5, 0.8

. We varied dimension

d = 6, 8, 10

and sample size

n = 1000, 1500, 2000

in each scenario. We observe that for

α = {0.2, 0.5, 0.8}

, the MST-GMI estimator provides lowest MSE when

α = 0.5

indicated by star (*).

Overview Table for Different d, n, and $α$
Experiments	Dimension ( $d$ )	Sample Size ( $n$ )	GMI ( $I_{α}$ )	MSE ( $\times 10^{- 4}$ )	Parameter ( $α$ )
Scenario 1–1	6	1000	0.0229	12	0.2
Scenario 1–2	6	1000	0.0143	4.7944	0.5 *
Scenario 1–3	6	1000	0.0176	6.3867	0.8
Scenario 2–1	8	1500	0.0246	11	0.2
Scenario 2–2	8	1500	0.0074	1.6053	0.5 *
Scenario 2–3	8	1500	0.0137	5.3863	0.8
Scenario 3–1	10	2000	0.0074	2.3604	0.2
Scenario 3–2	10	2000	0.0029	0.54180	0.5 *
Scenario 3–3	10	2000	0.0262	11	0.8

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yasaei Sekeh, S.; Hero, A.O. Geometric Estimation of Multivariate Dependency. Entropy 2019, 21, 787. https://doi.org/10.3390/e21080787

AMA Style

Yasaei Sekeh S, Hero AO. Geometric Estimation of Multivariate Dependency. Entropy. 2019; 21(8):787. https://doi.org/10.3390/e21080787

Chicago/Turabian Style

Yasaei Sekeh, Salimeh, and Alfred O. Hero. 2019. "Geometric Estimation of Multivariate Dependency" Entropy 21, no. 8: 787. https://doi.org/10.3390/e21080787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geometric Estimation of Multivariate Dependency

Abstract

1. Introduction

1.1. Related Work

1.2. Contribution

1.3. Organization

2. The Geometric Mutual Information (GMI)

2.1. Properties of the Geometric Mutual Information

2.2. The Friedman–Rafsky Estimator

2.3. Convergence Rates

2.4. Minimax Parameter $α$

3. Simulation Study

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Theorem 1

Appendix A.2. Theorem 2

Appendix A.3. Proposition 1

Appendix A.4. Theorem 3

Appendix A.5. Theorem 4

Appendix A.6.

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Geometric Estimation of Multivariate Dependency

Abstract

1. Introduction

1.1. Related Work

1.2. Contribution

1.3. Organization

2. The Geometric Mutual Information (GMI)

2.1. Properties of the Geometric Mutual Information

2.2. The Friedman–Rafsky Estimator

2.3. Convergence Rates

2.4. Minimax Parameter α

3. Simulation Study

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Theorem 1

Appendix A.2. Theorem 2

Appendix A.3. Proposition 1

Appendix A.4. Theorem 3

Appendix A.5. Theorem 4

Appendix A.6.

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.4. Minimax Parameter $α$