Neural Estimator of Information for Time-Series Data with Dependency

Molavipour, Sina; Ghourchian, Hamid; Bassi, Germán; Skoglund, Mikael

doi:10.3390/e23060641

Open AccessArticle

Neural Estimator of Information for Time-Series Data with Dependency

¹

School of Electrical Engineering and Computer Science (EECS), KTH Royal Institute of Technology, 100 44 Stockholm, Sweden

²

Ericsson Research, 164 83 Stockholm, Sweden

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(6), 641; https://doi.org/10.3390/e23060641

Submission received: 26 April 2021 / Revised: 15 May 2021 / Accepted: 18 May 2021 / Published: 21 May 2021

(This article belongs to the Special Issue Deep Artificial Neural Networks Meet Information Theory)

Download

Browse Figures

Versions Notes

Abstract

:

Novel approaches to estimate information measures using neural networks are well-celebrated in recent years both in the information theory and machine learning communities. These neural-based estimators are shown to converge to the true values when estimating mutual information and conditional mutual information using independent samples. However, if the samples in the dataset are not independent, the consistency of these estimators requires further investigation. This is of particular interest for a more complex measure such as the directed information, which is pivotal in characterizing causality and is meaningful over time-dependent variables. The extension of the convergence proof for such cases is not trivial and demands further assumptions on the data. In this paper, we show that our neural estimator for conditional mutual information is consistent when the dataset is generated with samples of a stationary and ergodic source. In other words, we show that our information estimator using neural networks converges asymptotically to the true value with probability one. Besides universal functional approximation of neural networks, a core lemma to show the convergence is Birkhoff’s ergodic theorem. Additionally, we use the technique to estimate directed information and demonstrate the effectiveness of our approach in simulations.

Keywords:

neural networks; conditional mutual information; directed information; Markov source; variational bound

1. Introduction

In recent decades, a tremendous effort has been done to explore capabilities of feed-forward networks and their application in various areas. Novel machine learning (ML) techniques go beyond conventional classification and regression tasks and enable revisiting well-known problems in fundamental areas such as information theory. The functional approximation power of neural networks is a compelling tool to be used for estimating information-theoretic quantities such as entropy, KL-divergence, mutual information (MI), and conditional mutual information (CMI). As an example, MI is estimated with neural networks in [1] where numerical results show notable improvements compared to the conventional methods for high-dimensional, correlated data.

Information-theoretic quantities are characterized by probability densities and most classical approaches aim at estimating the densities. These techniques may vary depending on whether the random variables are discrete or continuous. In this paper, we focus on continuous random variables. Examples of conventional non-parametric methods to estimate these quantities are histogram and partitioning techniques, where the densities are approximated and plugged-in into the definitions of the quantities, or methods based on the distance of the k-th nearest neighbor [2]. Despite vast applications of nearest neighbor methods for estimation of information-theoretic quantities, such as the proposed technique in [3], recent studies advocate using neural networks while simulations demonstrate that the accuracy of the estimations improves in several scenarios [1,4]. In particular, the results indicate that by increasing the dimension of the data, the bias of the estimation deteriorates less with neural estimators. In addition to superior performance, a neural estimator of information can be considered to be a stand-alone block and coupled in a larger network. The estimator can then be trained simultaneously with the rest of the network and measure the flow of information among variables of the network. Therefore, it facilitates the implementation of ML setups with constraints on information measures (e.g., information bottleneck [5] and representation learning [6]). These compelling features motivate exploring the benefits of neural networks to estimate other information measures and more complex data structures.

The cornerstone of neural estimators for MI is to approximate bounds on the relative entropy instead of computing it directly. These bounds are referred to as variational bounds and recently have gained attention due to their applications in ML problems. Examples are the lower bounds proposed originally in [7] by Donsker and Varadhan, and in [8] by Nguyen, Wainwright, and Jordan that are referred to as DV bound and NWJ bound, respectively. Several variants of these bounds have been reviewed in [9]. Variational bounds are tight, and the estimators proposed in [1,4,10,11] leverage this property and use neural networks to approximate the bounds and correspondingly the desired information measure. These estimators were shown to be consistent (i.e., the estimation converges asymptotically to the true value) and suitably estimate MI and CMI when the samples are independently and identically distributed (i.i.d.). However, in several applications such as time series analysis, natural language processing, or estimating information rates in communication channels with feedback, there exists a dependency among samples in the data. In this paper, we investigate analytically the convergence of our neural estimator and verify the performance of the method in estimating several information quantities.

Consider several random processes such that their realizations are dependent in time. In addition to common information-theoretic measures such as MI and CMI, more complex quantities can be studied that are paramount in representing these processes. For instance, the (temporal) causal relationship between two random processes has been expressed with quantities such as directed information (DI) [12,13] and transfer entropy (TE) [14]. Both DI and TE have a variety of applications in different areas. In communication systems, DI characterizes the capacity of a channel with feedback [15], while it has several other applications in venues including portfolio theory [16], source coding [17], and control theory [18] where DI is exploited as a measure of privacy in a cloud-based control setup. Additionally, DI was introduced as a measure of causal dependency in [19] which led to a series of works in that direction with applications in neuroscience [20,21] and social networks [22,23]. TE is also a well-celebrated measure in neuroscience [24,25], and the physics community [26,27] to quantify causality for time series. In this paper, we investigate capability of the neural estimator proposed in [11] to be used when the samples in the data are not generated independently.

Conventional approaches to estimate KL-divergence and MI such as nearest neighbor methods can be used for non-i.i.d. data; for example to estimate DI [28] and TE [29,30]. However, it is possible to leverage the benefits of neural estimators highlighted in [1] even though the data are generated from a source with dependency among its realizations. In a recent work [31], the authors estimate TE using the neural estimator for CMI introduced in [4]. Additionally, recurrent neural networks (RNN) are proposed in [32] to capture the time dependency to estimate DI. However, showing convergence of these estimators requires further theoretical investigation. Although the neural estimators are shown to be consistent in [1,4,11] for i.i.d. data, the extension of the proofs to dependent data needs to be addressed. In [32], the authors address the consistency of the estimation of DI by referring to universal approximation of RNN [33] and Breiman’s ergodic theorem [34]. Because RNNs are more complicated to be implemented and tuned, in this paper, we assume simple feed-forward neural networks, which were also proposed in [1,4,11] and in this paper. A conventional step to go beyond i.i.d. processes is to investigate stationary and ergodic Markov processes which have numerous applications in modeling real-world systems. Many convergence results for i.i.d. data such as the law of large numbers can be extended to ergodic processes; however, this generalization is not always trivial. The estimator proposed in [11] exhibits major improvements in estimating the CMI. Nevertheless, it is based on a k-nearest neighbors (k-NN) sampling technique which makes the extension of the convergence proofs to non-i.i.d. data more involved. The main contribution of this paper is to provide convergence results and consistency proofs for this neural estimator when the data are stationary and ergodic Markov.

The paper is organized as follows. Notations and basic definitions are introduced in Section 2. Then, in Section 3, the neural estimator and procedures are explained. Additionally, the convergence of the estimator is studied when the data are generated from a Markov source. Next, we provide simulation results in Section 4 for synthetic scenarios and verify the effectiveness of our technique in estimating CMI and DI. Finally, we conclude the paper in Section 5 and suggest potential future directions.

2. Preliminaries

We begin by describing the notation used throughout the paper, and the main definitions are explained afterwards. Then we review variational bounds which are the basis of our neural estimator.

2.1. Notation

Random variables and their realizations are denoted by capital and lower case letters, respectively. Given two integers i and j, a sequence of random variables

X_{i}, X_{i + 1}, \dots, X_{j}

is shown as

X_{i}^{j}

, or simply

X^{j}

when

i = 1

. For a stochastic processes

Z

, a randomly generated sample is denoted by random variable Z. We indicate sets with calligraphic notation (e.g.,

X

). The space of d-dimensional real vectors is shown as

R^{d}

. The probability density function (PDF) of a random variable X at

X = x

is denoted by

p_{X} (x)

or equivalently

p (x)

, and the distribution of X, by

P_{X}

or simply P. The PDF of multiple random variables

X_{1}, \dots, X_{i}

is

p_{X_{1} \dots X_{i}} (x_{1}, \dots, x_{i})

and for simplicity it is represented by

p (x_{1}, \dots, x_{i})

in the paper. For the distribution P,

E_{P} [\cdot]

denotes the expectation with respect to its density

p (\cdot)

. All the logarithms are in base e.

The convergence of the sequence

X_{n}

almost surely (or with probability one) to X is denoted by

X_{n} \overset{a . s .}{\to} X

and is defined as:

P (lim_{n \to \infty} X_{n} = X) = 1 .

2.2. Information Measures

The information-theoretic quantities of interest for this work can be written in terms of a KL-divergence, and the available neural estimators originally aim to estimate this quantity. For a random variable X with support

X \subseteq R^{d}

, the KL-divergence between two PDFs

p (x)

and

q (x)

is defined as:

\begin{matrix} D (p (x) ∥ q (x)) : = E_{P} [log \frac{p (X)}{q (X)}] . \end{matrix}

(1)

Then, CMI can be defined using KL-divergence as below:

\begin{matrix} I (X; Y | Z) : = D (p (x, y, z) ∥ p (x | z) p (y, z)) . \end{matrix}

(2)

where Y and Z are random variables with support on

Y

and

Z

, which are subsets of

R^{d}

. In this paper, we are focused on extending the estimators for CMI with non-i.i.d. data, where samples in time-series data might not be independently and identically distributed (e.g., generated from a Markov process); nonetheless, our method and consistency proofs are fairly general and can be applied for estimating KL-divergence as well. Consider a sequence of random samples

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

generated from the joint process

(X, Y, Z)

, where the samples are not necessarily i.i.d.. A simple step toward this extension is to verify that the previous neural estimators, e.g., [11], can be used to estimate

I (X; Y | Z)

, where

(X, Y, Z) \sim p (x, y, z)

and the processes

(X, Y, Z)

are Markov, as in the following assumption.

Assumption 1.

(X, Y, Z)

are jointly stationary and ergodic 1-st order Markov with marginal density

p (x, y, z)

. The extension of the results to d-th order Markov is straightforward.

To explore further in generalizing the neural estimators, it is possible to investigate their capability for information measures that rely on dependent random variables. Consider the pairs

{(X_{i}, Y_{i})}_{i = 1}^{n}

to be samples of the processes

(X, Y)

. If the generated samples are dependent in time, it is possible to measure the causal relationship between the processes with quantities such as DI and TE, defined as below:

\begin{matrix} I (X^{n} \to Y^{n}) : = \sum_{i = 1}^{n} I (X^{i}; Y_{i} | Y^{i - 1}) \end{matrix}

(3)

\begin{matrix} T_{X \to Y} (i) : = I (X_{i - J}^{i - 1}; Y_{i} | Y_{i - L}^{i - 1}), \end{matrix}

(4)

where J and L are parameters of the TE that determine the length of memory to consider for X and

Y

, respectively. Both quantities are functions of the CMI and Figure 1 visualizes the corresponding variables in each CMI term for DI and TE. In particular, each CMI term in (3) quantifies the amount of shared information between

X^{i}

and

Y_{i}

conditioned on

Y^{i - 1}

, i.e., it excludes the effect of the causal history of

Y

. In a general form, to express the causal effect of the process

X

on

Y

conditioning causally on

Z

, DI is normalized with respect to n which is defined below and denoted as directed information rate (DIR):

\begin{matrix} I (X \to Y ∥ Z) & : = lim_{n \to \infty} \frac{1}{n} I (X^{n} \to Y^{n} ∥ Z^{n}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} I (X^{i}; Y_{i} | Y^{i - 1}, Z^{i}) . \end{matrix}

(5)

By assuming the processes to be Markov, (5) can be simplified (see [23,35,36]). To be explicit, if both

(X, Y, Z)

and

(Y, Z)

are stationary and ergodic 1-st order Markov, from (5) the DIR can be simplified as:

\begin{matrix} I (X \to Y ∥ Z) = I (X^{2}; Y_{2} | Y_{1}, Z^{2}), \end{matrix}

(6)

where the CMI is with respect to the stationary density

p (x^{2}, y^{2}, z^{2})

of the Markov model. To generalize this approach, let us define the maximum Markov order (

o_{max}

) of a set of processes to be the minimum number o such that the Markov order of the joint random variables of any subset of the processes is less than or equal to o. So if

o_{max} = l

for

(X, Y, Z)

, then from (5) we can simplify the DIR term as:

\begin{matrix} I (X \to Y ∥ Z) = I (X^{l + 1}; Y_{l + 1} | Y^{l}, Z^{l + 1}) . \end{matrix}

(7)

The following example shows how DIR can be computed for a linear data model, and emphasizes on the difference when DIR is conditioned causally on another process.

Example 1.

Consider the following linear model where

{W_{i}}_{i = 1}^{\infty}

,

{W_{i}^{'}}_{i = 1}^{\infty}

, and

{W_{i}^{″}}_{i = 1}^{\infty}

are uncorrelated white Gaussian noises with variances

σ_{x}^{2}

,

σ_{y}^{2}

, and

σ_{z}^{2}

respectively:

\begin{matrix} \{\begin{matrix} X_{i} = W_{i} \\ Y_{i} = a Y_{i - 1} + Z_{i - 1} + W_{i}^{'} \\ Z_{i} = X_{i} + W_{i}^{″} \end{matrix} \end{matrix}

for some

| a | < 1

, and

(X_{0}, Y_{0}, Z_{0})

are distributed according to the stationary distribution of the processes

X

,

Y

, and

Z

. This model holds in Assumption 1 and

o_{max} = 1

, so

I (X \to Y)

can be computed as:

\begin{matrix} I (X \to Y) = I (X_{1}^{2}; Y_{2} | Y_{1}) = \frac{1}{2} log (1 + \frac{σ_{x}^{2}}{σ_{y}^{2} + σ_{z}^{2}}), \end{matrix}

while from (7):

\begin{matrix} I (X \to Y ∥ Z) = I (X_{1}^{2}; Y_{2} | Y_{1}, Z_{1}^{2}) = 0 . \end{matrix}

(8)

As emphasized earlier, (7) holds when

(X, Y, Z)

and

(Y, Z)

are Markov with order l. Then the CMI estimators can be used potentially to estimate the DIR. However, the consistency of the estimation still needs to be investigated since the samples are not independent. Before introducing our technique, we review the basics for estimating information measures with neural networks.

2.3. Estimating the Variational Bound

The estimators proposed in [1,4,11] are all based on tight lower bounds on the KL-divergence, such as the DV bound, introduced in [7]:

\begin{matrix} D (p (x) ∥ q (x)) \geq sup_{f \in F} E_{P} [f (X)] - log E_{Q} [exp (f (X))], \end{matrix}

(9)

where p and q are two PDFs defined over

X

with corresponding distributions P and Q, respectively, and

F

is any class of functions such that

f : X \to R

, and the two expectations exist and are finite. Consider a neural network with parameters

θ \in Θ

, then

F

can be to the class of all functions constructed with this neural network by choosing different values for the parameters

θ

. In more details, let

f (x)

to be the end-to-end function of a neural network with parameters

θ \in Θ

and the optimization in the right hand side (RHS) of (9) is equivalent to optimizing over

Θ

(as performed in [1]). Nevertheless, we can leverage from the fact that the DV bound is tight when the function is chosen as:

\begin{matrix} f^{*} (x) = log \frac{p (x)}{q (x)} \forall x \in X . \end{matrix}

(10)

Thus, the neural network can approximate

f^{*} (x)

directly and the lower bound can be computed accordingly (as performed in [4,11]).

Definition 1.

For the PDFs

p (x, y, z)

and

p (x | z) p (y, z)

, define the corresponding distributions on

X \times Y \times Z

to be

\tilde{P}

and

\tilde{Q}

, respectively.

Since the CMI can be stated as a KL-divergence (2), the DV bound can be defined for CMI as bellow:

\begin{matrix} I (X; Y | Z) \geq sup_{f \in F} E_{\tilde{P}} [f (X, Y, Z)] - log E_{\tilde{Q}} [exp (f (X, Y, Z))], \end{matrix}

(11)

and the bound is tight by choosing

\begin{matrix} f^{*} (x, y, z) = log \frac{p (x, y, z)}{p (x | z) p (y, z)} \forall x, y, z \in X \times Y \times Z . \end{matrix}

(12)

The main barrier to compute this bound for

f^{*} (x, y, z)

is that the densities are unknown. This challenge is addressed in [4,11] by proposing neural classifiers that can approximate

f^{*} (x, y, z)

without knowing the densities. Below we review the steps of the estimation technique provided in [11]:

(1): Construct the joint batch, containing samples generated according to $p (x, y, z)$ .
(2): Construct the product batch, containing samples generated according to $p (x | z) p (y, z)$ .
(3): Train the neural network with a particular loss function, which we explain later, to approximate $f^{*} (x, y, z)$ , i.e., the density ratio of $\frac{p (x, y, z)}{p (x | z) p (y, z)}$ .
(4): Compute (11) using the batches and the approximated function.

To show the consistency of the estimation with this approach, it is crucial to verify if the empirical average with respect to each sample batch converges asymptotically to the corresponding expectations. Additionally, the neural network should be designed and trained to be capable of approximating the density ratio. For i.i.d. data samples, the authors in [4,11] provided the proofs in the form of concentration bounds. In this paper, we extend these proofs for non-i.i.d. data by providing convergence results for the special case of stationary and ergodic Markov processes. In the remainder of the paper, we denote the data by

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

which are consecutive samples of the stationary Markov processes

(X, Y, Z)

with marginal PDF

p (x, y, z)

.

3. Main Results

In this section, we describe our proposed neural estimator in detail. To create the batches, the estimator is equipped with a k-NN sampling block such that the empirical average over the samples converges to the expected mean. Next, we describe the roadmap to show the convergence of the estimation to the true value (i.e., consistency analysis).

3.1. Batch Construction

To create the joint batch it is sufficient to take

(X_{i}, Y_{i}, Z_{i})

randomly from the available data. Below we define the joint batch formally using an auxiliary random variable that indicates whether an instance is selected or not (see also Algorithm 1 for the implementation).

Algorithm 1: Construction of the joint batch

Definition 2 (Joint batch).

Let

W_{i} \sim B e r (α)

for

i = 1, \dots, n

be independent random variables, and

I_{α, n} (W^{n}) : = {i ∣ i \in {1, \dots, n}, W_{i} = 1}

. Then

B_{j o i n t}^{α}

is defined as

\begin{matrix} B_{j o i n t}^{α} : = {(X_{i}, Y_{i}, Z_{i}) ∣ i \in I_{α, n}}, \end{matrix}

(13)

where we use

I_{α, n}

to simplify the notation.

Please note that by the law of large numbers, the length of the joint batch is asymptotically

α n

. Next, to construct the product batch we use the method based on the k-NN technique, which is introduced in [11]. Below we define our method denoted by isolated k-NN technique, and explain how the product batch is constructed (see also Algorithm 2).

Algorithm 2: Construction of the product batch

Definition 3 (Product batch).

For

s < n

, let

W_{i} \sim B e r n o u l l i (α^{'})

for

i = 1, \dots, s

be independent random variables, and

I_{α^{'}, s} (W^{s}) : = {i ∣ i \in {1, \dots, s}, W_{i} = 1} & I_{α^{'}, s}^{c} (W^{s}) : = {1, \dots, n} \ I_{α^{'}, s} (W^{s}) .

Then for any

ζ \in Z

and given the data

{(x_{i}, y_{i}, z_{i})}_{i = 1}^{n}

, define

A^{α^{'}, k, n, s} (ζ, z^{n}, w^{s})

as the set of indices of the k nearest neighbors of ζ (by Euclidean distance) among

{z_{i}}

for

i \in I_{α^{'}, s}^{c} (w^{s})

. Formally, let

π : {1, \dots, n - s} \to I_{α^{'}, s}^{c} (W^{s})

be a bijection such that

{∥ ζ - z_{π (1)} ∥}_{2} \leq \dots \leq {∥ ζ - z_{π (n - s)} ∥}_{2}

. Then,

A^{α^{'}, k, n, s} (ζ, z^{n}, w^{s}) : = {π (1), \dots, π (k)}

. So the product batch can be defined as:

\begin{matrix} B_{p r o d}^{α^{'}, s} : = \{(X_{j (i)}, Y_{i}, Z_{i}) ∣ i \in I_{α^{'}, s} (W^{s}), j (i) \in A^{α^{'}, k, n, s} (Z_{i}, Z^{n}, W^{s})\} . \end{matrix}

(14)

Hereafter we use

I_{α^{'}, s}

,

I_{α^{'}, s}^{c}

, and

A^{α^{'}} (ζ)

instead as the remaining parameters can be understood from the context. We refer to this sampling technique as isolated k-NN in the sequel. An example is also provided in Figure 2 for the case of

k = 2

.

Remark 1.

Here we emphasize that the isolated indices are selected from the first s indices of samples while the neighbors can be searched among all n indices of data except the ones in

I_{α^{'}, s} (w^{s})

. Additionally, note that the length of the product batch is

α^{'} s k

asymptotically as

n \to \infty

because

s k

also tends to ∞ as we see later in the assumptions of Proposition 2.

3.2. Training the Classifier

As explained earlier, the optimal function for a tight lower bound on the CMI is obtained by the density ratio and to compute that we use the functional approximation power of neural networks. Consider a feedforward neural network with the last layer equipped with the sigmoid function. The network is parameterized with

θ \in Θ \subseteq R^{h}

where h is the number of parameters, and the neural network function is denoted by

ω_{θ} : X \times Y \times Z \to [0, 1]

. For an input

(X, Y, Z)

of the network, let

C \in {0, 1}

denote the class of the input which determines that the tuple is generated according to

p (x, y, z)

or

p (x | z) p (y, z)

. To be explicit, the input is either picked from the joint batch (class

C = 1

) or the product batch (class

C = 0

), and the goal is to learn the network parameters such that it can distinguish the class of new (unseen) queries. Let the loss function be the binary cross-entropy function. So for

ω

to be any function with inputs

(x, y, z)

and ranging between

[0, 1]

, the expected loss is defined as:

\begin{matrix} L (ω) : = - E [C log ω (X, Y, Z) + (1 - C) log (1 - ω (X, Y, Z))] . \end{matrix}

(15)

It is well-established that by minimizing

L (ω)

, the solution

ω^{*}

would represent the probability of classifying the input in the class

C = 1

given the input data, i.e.,

P (C = 1 | x, y, z)

. In fact, as shown in [11] (Lemma 1) if the prior distribution on the classes is unbiased, by taking the derivative in (15) we have:

\begin{matrix} Γ (x, y, z) = \frac{p (x, y, z)}{p (x | z) p (y, z)} = \frac{ω^{*} (x, y, z)}{1 - ω^{*} (x, y, z)} . \end{matrix}

(16)

So from (12) the optimal function can be expressed with

Γ (x, y, z)

as:

\begin{matrix} f^{*} (x, y, z) = log Γ (x, y, z) \forall x, y, z \in X \times Y \times Z . \end{matrix}

(17)

Therefore, by training the neural network, we can approximate the optimal function

f^{*} (x, y, z)

and estimate the lower bound for CMI.

Consider the neural network

ω_{θ}

, then the empirical loss function is defined as:

\begin{matrix} L_{e m p} (ω_{θ}) : = - \frac{1}{2 | B_{j o i n t}^{α} |} \sum_{(X, Y, Z) \in B_{j o i n t}^{α}} log ω_{θ} (X, Y, Z) \\ - \frac{1}{2 | B_{p r o d}^{α^{'}, s} |} \sum_{(X, Y, Z) \in B_{p r o d}^{α^{'}, s}} log (1 - ω_{θ} (X, Y, Z)), \end{matrix}

(18)

and the optimal parameters are obtained by solving the following problem:

\begin{matrix} \hat{θ} : = arg min_{θ} L_{e m p} (ω_{θ}) . \end{matrix}

(19)

Consequently, we can approximate the density ratio

Γ (x, y, z)

from (16):

\begin{matrix} \hat{Γ} (x, y, z) = \frac{ω_{\hat{θ}} (x, y, z)}{1 - ω_{\hat{θ}} (x, y, z)} . \end{matrix}

(20)

To avoid having boundary values (i.e.,

ω_{\hat{θ}} (x, y, z)

close to zero or 1), the output of the neural network is clipped between

[τ, 1 - τ]

for some small

τ > 0

.

Remark 2.

Please note that

\hat{Γ} (x, y, z)

approximates the density ratio, if the batch sizes

| B_{j o i n t}^{α} |

and

| B_{p r o d}^{α^{'}, s} |

are balanced. Otherwise, (20) requires a correction coefficient (see [11]). To fulfill this, given the number of samples n, one can choose the parameters such that

α n = α^{'} s k

. Then, by the law of large numbers, the batches will asymptotically be balanced.

3.3. Estimation of the DV Bound

The final step in the estimation of CMI is to compute the lower bound (11) empirically using

\hat{Γ} (x, y, z)

. So by substituting the expectations with empirical averages with respect to samples in the joint and the product batch, the CMI estimator is defined as:

\begin{matrix} {\hat{I}}_{D V}^{n} (X; Y | Z) : = \frac{1}{| B_{j o i n t}^{α} |} \sum_{(x, y, z) \in B_{j o i n t}^{α}} log \hat{Γ} (x, y, z) + log \frac{1}{| B_{p r o d}^{α^{'}, s} |} \sum_{(x, y, z) \in B_{p r o d}^{α^{'}, s}} \hat{Γ} (x, y, z) . \end{matrix}

(21)

In practice, to mitigate the induced inaccuracy due to sampling from the original data, the training and estimation is repeated for several sampling trials. The steps for implementing the estimator are described in Algorithm 3. In the next part, we provide the convergence results for our estimator to validate substitution of the expectations in (11) with empirical averages with respect to the joint and the product batch. Then we show the convergence of the overall estimation to the true CMI value.

Algorithm 3: Estimation of CMI

3.4. Consistency Analysis

The consistency of our neural estimator (i.e., showing that the estimator converges to its true value) is based on the universal functional approximation power of neural networks and concentration results for the samples collected in the joint batch and in the product batch using the isolated k-NN. Informally, Hornik’s functional approximation theorem [37] guarantees that feedforward neural networks are capable of fitting any continuous function. So depending on the true density of the data, there exists a choice of parameters

\tilde{θ}

that enables approximating the desired function with any arbitrary accuracy. Next, we show that the empirical loss function

L_{e m p} (ω_{θ})

is concentrated around its mean

L (ω_{θ})

for any

θ

. Combining these tools, we are able to minimize the empirical loss function as in (19) and we expect

\hat{θ}

to be close to

\tilde{θ}

asymptotically; thus, eventually

\hat{Γ} (x, y, z)

properly approximates

Γ (x, y, z)

. Additionally, the empirical computation of the DV bound is concentrated around the expected value which concludes the consistency of the end-to-end estimation of the CMI.

In this paper, we put the main focus on extending the concentration results provided in [11] (Proposition 1) with Markov assumption on data. Although conventionally many asymptotic results for i.i.d. data are assumed to hold for Markov data as well, the required extensions here are more involved due to the additional complexity of the k-NN method. In the following, we first show the convergence of the empirical average for the joint batch,

{| B_{j o i n t}^{α} |}^{- 1} \sum_{(X, Y, Z) \in B_{j o i n t}^{α}} g (X, Y, Z) \to E_{\tilde{P}} [g (X, Y, Z)],

where

g (\cdot)

is any measurable function such that the expectation exists and is finite. As the product batch collects samples corresponding to the k nearest neighbors, convergence results for nearest neighbor regression are invoked to show that the empirical average for the product batch converges to the expectation with respect to the product distribution

\tilde{Q}

,

{| B_{p r o d}^{α^{'}, s} |}^{- 1} \sum_{(X, Y, Z) \in B_{p r o d}^{α^{'}, s}} g (X, Y, Z) \to E_{\tilde{Q}} [g (X, Y, Z)] .

Then, we conclude the consistency of the overall estimation.

3.4.1. Convergence for the Joint Batch

One well-known extension to the law of large numbers for non-i.i.d. processes is Birkhoff’s ergodic theorem, and is the basis of our proof to show the following proposition on the convergence of the sample average over the joint batch.

Proposition 1.

Consider the sequence of random variables

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

generated under Assumption 1. Consider the distribution

\tilde{P}

in Definition 1, for any measurable function

g (\cdot)

such that

E_{\tilde{P}} [g (X, Y, Z)]

exists and is finite,

\begin{matrix} \frac{1}{| B_{j o i n t}^{α} |} \sum_{(X, Y, Z) \in B_{j o i n t}^{α}} g (X, Y, Z) \overset{a . s .}{\to} E_{\tilde{P}} [g (X, Y, Z)] . \end{matrix}

(22)

Proof.

See Appendix A. ☐

3.4.2. Convergence for the Product Batch

From Definition 3, the empirical summation over all samples in the product batch is equivalent to averaging

| I_{α^{'}, s} |

times k-NN regressions. Considering a sequence of pairs

{(U_{i}, V_{i})}_{i = 1}^{n}

generated from stationary ergodic processes

(U, V)

, the k-NN regression denotes the problem of estimating

m (u) : = E [V | U = u]

with

m_{n} (u) : = \frac{1}{k (n)} \sum_{j = 1}^{k (n)} V_{r_{j}}

where

r_{j}

refers to the j-th nearest neighbor of u among

U_{1}, \dots, U_{n}

. This problem has been well studied when the pairs

(U_{i}, V_{i})

are generated i.i.d.. For example in [38], the authors show the convergence of

m_{n} (u)

as:

\begin{matrix} P (\int | m_{n} (u) - m (u) | p (u) d u \geq ϵ) \leq exp (- n a ϵ^{2}), \end{matrix}

(23)

for some positive constant a, when

k (n) \to \infty

and

\frac{k (n)}{n} \to 0

. However, if the pairs are not independent, convergence results require a more advanced condition denoted geometric

ϕ

-mixing condition or geometric ergodicity condition [39,40]. As argued in [39], the geometric ergodicity is not a restrictive statement and holds for a wide range of processes (see also [41]). For instance, linear autoregressive processes are geometrically ergodic [41] (Ch. 15.5.2). Below we review the

ϕ

-mixing condition.

Definition 4

(

ϕ

-mixing condition). A process

U

is ϕ-mixing if for a sequence

{ϕ_{n}}_{n \in N}

of positive numbers satisfying

ϕ_{n} \to 0

as

n \to \infty

, for any integer

i > 0

we have:

\begin{matrix} | P (A \cap B) - P (A) P (B) | \leq ϕ_{i} P (A), \end{matrix}

(24)

for all

n > 0

and all sets A and B which are members of

σ (U_{1}, \dots, U_{n})

and

σ (U_{n + i}, U_{n + i + 1}, \dots)

, respectively. If

{ϕ_{n}}

is a geometric sequence,

U

is called geometrically ϕ-mixing.

To show the convergence of the empirical average over the product batch, we make the following assumptions.

Assumption 2.

The sequence

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

is geometrically ϕ-mixing.

Assumption 3.

We assume that

Y

and

Z

are compact.

Proposition 2.

Let the sequence of random variables

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

be generated under Assumptions 1–3, and we choose

k (n)

and

s (n)

such that:

\begin{matrix} s (n) k (n) = n \\ k (n) \to \infty \\ s (n) \to \infty \\ k (n) / {(log n)}^{2} \to \infty . \end{matrix}

(25)

Consider

\tilde{Q}

defined in Definition 1. Then, for any function

g (\cdot)

such that

E_{\tilde{Q}} [g (X, Y, Z)]

exists and is finite, and additionally,

\begin{matrix} | g (x, y_{1}, z) - g (x, y_{2}, z) | < L_{g} | y_{1} - y_{2} | \forall x \in X, z \in Z, y_{1}, y_{2} \in Y, \end{matrix}

(26)

where

L_{g} > 0

is the Lipschitz constant, we have that:

\begin{matrix} \frac{1}{| B_{p r o d}^{α^{'}, s} |} \sum_{(X, Y, Z) \in B_{p r o d}^{α^{'}, s}} g (X, Y, Z) \overset{a . s .}{\to} E_{\tilde{Q}} [g (X, Y, Z)] . \end{matrix}

(27)

Proof.

See Appendix B. ☐

Remark 3.

Examples of choices for

k (n)

and

s (n)

satisfying (25) are for instance

k (n) = n^{\frac{1}{2}}

and

k (n) = {(log n)}^{2 + ϵ}

for some

ϵ > 0

. Please note that in [11], the consistencies are shown when

k (n) = Θ (n^{\frac{1}{2}})

. However, the convergence result in [11] (Theorem 1) is an explicit bound, so the condition on

k (n)

can be relaxed (choosing a smaller

k (n)

) when we are only interested in the asymptotic behavior.

3.4.3. Convergence of the Overall Estimation

To complete our analysis on the consistency of the neural estimator, it is required to show that the loss function is properly approximated and it converges to the optimal loss as n increase. The following assumptions on the neural network and the densities enable us to show this convergence.

Assumption 4.

For a network

ω_{θ}

parameterized with

θ \in Θ

, the assumption holds if Θ is closed,

Θ \subseteq {θ | ∥ θ ∥_{2} \leq K}

for some constant

K > 0

and

ω_{θ}

is B-Lipschitz, for some constant

B > 0

, regarding θ, for all

(x, y, z)

, i.e.,

| ω_{θ_{1}} (x, y, z) - ω_{θ_{2}} (x, y, z) | \leq B {∥ θ_{1} - θ_{2} ∥}_{2}, \forall θ_{1}, θ_{2} \in Θ, (x, y, z) \in X \times Y \times Z .

Assumption 5.

There exist

0 < p_{min} < p_{max} < \infty

such that for all

x, y, z \in X \times Y \times Z

, the values of

p (x, y, z)

and

p (x | z) p (y, z)

are both in the interval

[p_{min}, p_{max}]

, and it holds that

\begin{matrix} \frac{p_{min}}{p_{max} + p_{min}} \geq τ, \end{matrix}

(28)

to guarantee that

τ \leq ω^{*} \leq 1 - τ

.

The following theorem concludes the consistency of the end-to-end estimator.

Theorem 1.

Let Assumptions 1, 2, 3, 4, and 5 hold and

k (n)

and

s (n)

satisfy (25). Then the CMI estimator

{\hat{I}}_{D V}^{n} (X; Y | Z)

(defined in (21)), converges strongly to

I (X; Y | Z)

, i.e.,

\begin{matrix} {\hat{I}}_{D V}^{n} (X; Y | Z) \overset{a . s .}{\to} I (X; Y | Z) . \end{matrix}

(29)

Proof.

See Appendix D. ☐

In the next section, we apply our estimator in several synthetic scenarios to verify its capability in estimating CMI and DI.

4. Simulation Results

In this section, we experiment with our proposed estimator of CMI and DI in the following auto-regressive model which is widely used in different applications, including wireless communications [42], defining causal notions in econometrics [43], and modeling traffic flow [44], among others:

[\begin{matrix} X_{i} \\ Y_{i} \\ Z_{i} \end{matrix}] = A [\begin{matrix} X_{i} \\ Y_{i} \\ Z_{i} \end{matrix}] + B [\begin{matrix} X_{i - 1} \\ Y_{i - 1} \\ Z_{i - 1} \end{matrix}] + [\begin{matrix} N_{i}^{x} \\ N_{i}^{y} \\ N_{i}^{z} \end{matrix}],

(30)

where A and B are

3 \times 3

matrices and the rest of variables are d-dimensional row vectors. A models the instantaneous effect of

X_{i}

,

Y_{i}

, and

Z_{i}

on each other and its diagonal elements are zero, while B models the effect of previous time instance.

N_{i}^{x}

,

N_{i}^{y}

, and

N_{i}^{z}

(denoted as noise in some contexts) are independent and generated i.i.d. according to zero-mean Gaussian distributions with covariance matrices

σ_{x}^{2} I_{d}

,

σ_{y}^{2} I_{d}

, and

σ_{z}^{2} I_{d}

, respectively (i.e., the dimensions are d and components are uncorrelated). Please note that this model fulfills Assumptions 1 and 2 by setting appropriate initial random variables. Although the Gaussian random variables do not range in a compact set and thus, Assumption 3 does not hold, we could use truncated Gaussian distributions. Such adjustment does not significantly change the statistics of the generated dataset since the probability of finding a value far away from the mean is negligible.

In the following section, we test the capability of our estimator in estimating both conditional mutual information (CMI) and directed information (DI). In both cases, n samples are generated from the model and the estimations are performed according to Algorithms 1 and 2. Then according to Algorithm 3, the joint and product batches are split randomly in half to construct train and evaluation sets. Then the parameters of the classifier are trained with the train set and the final estimation is computed with the evaluation set (Codes are available at https://github.com/smolavipour/Neural-Estimator-of-Information-non-i.i.d, accessed on 20 May 2021).

To verify the performance of our technique, we also compared it with the approach taken in [4,31] which is as follows. Conditional mutual information can be computed by subtracting two mutual information terms, i.e.,

\begin{matrix} I (X; Y | Z) = I (X; Y, Z) - I (X; Z) . \end{matrix}

(31)

So instead of estimating the CMI term directly, one can use a neural estimator such as the classifier based estimator in [4] or the MINE estimator [1], and estimate each MI term in (31) to estimate the CMI. In what follows, we refer to this technique as MI-diff since it computes the difference between two MI terms.

4.1. Estimating Conditional Mutual Information

In this scenario, we estimate

I (X_{1}; Y_{1} | Z_{1})

when A and B are chosen to be:

A = [\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}], B = [\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{matrix}] .

Then from (30), the CMI can be computed as below:

\begin{matrix} I (X_{1}; Y_{1} | Z_{1}) & = h (X_{1} | Z_{1}) - h (X_{1} | Y_{1}, Z_{1}) \\ = h (Y_{0} + Y_{1} + N_{1}^{x} | Z_{1}) - h (Y_{0} + N_{1}^{x} | Y_{1}, Z_{1}) \\ = h (Y_{0} + Y_{1} + N_{1}^{x}) - h (Y_{0} + N_{1}^{x}) \\ = \frac{d}{2} log (1 + \frac{σ_{y}^{2} + σ_{z}^{2}}{σ_{x}^{2} + σ_{y}^{2} + σ_{z}^{2}}) . \end{matrix}

(32)

Each estimated value is an average of

T = 20

estimations, where in each round the batches are re-selected while having a fixed dataset. This procedure is repeated for 10 Monte Carlo trials and the data are re-generated for each trial. The hyper-parameters and settings of the experiment are provided in Table 1. In Figure 3, the CMI is estimated (as

{\hat{I}}_{D V}^{n, T} (X_{1}; Y_{1} | Z_{1})

in Algorithm 3) with

n = 2 \times 10^{4}

samples with dimension

d = 1

when

σ_{y} = 2

,

σ_{z} = 2

and by varying

σ_{x}

. It can be observed that the estimator can properly estimate the CMI while the variance of the estimation is also small. The latter can be inferred from the shaded region, which indicates the range of estimated CMI for a particular

σ_{x}

over all Monte Carlo trials. Next, the experiment is repeated for

d = 10

and the results are depicted in Figure 4, where we compare our estimation of CMI with the MI-diff approach, which is explained in (31) and each MI term is estimated with the classifier-based estimator proposed in [4]. It can be observed that the means of both estimators are similar; nonetheless, estimating the CMI directly is more accurate and has less variation compared to the MI-diff approach. Additionally, our method is faster since it computes the information term only once, while in the MI-diff approach, two different classifiers are trained to estimate each MI term.

4.2. Estimating Directed Information

DI can explain the underlying causal relationship among processes. This notion has wide applications in various areas. For example, consider a social network where the activities of users are monitored (e.g., the messages times as studied in [23]). The DI between these time-series data expresses how the activity of one user can affect the activity of the others. In addition, to such data analytic applications, DI characterizes the capacity of communication channels with feedback and by estimating the capacity, rates and powers of transmission can be adjusted in radio communications (see for example [32]). Now in this experiment, consider a network of three processes

X

,

Y

, and

Z

, such that the time-series data are modeled with (30) with

d = 1

where

\begin{matrix} A = 0, B = [\begin{matrix} 0 & 0 & 0 \\ b_{1} & 0 & 0 \\ 0 & b_{2} & 0 \end{matrix}] . \end{matrix}

(33)

In this model, where the relations are depicted in Figure 5, the process

X

is affecting

Y

with a delay and similarly the signal of

Y

appears on

Z

in the next time instance while an independent noise is accumulated on both steps. The DIR from

X \to Y

in this network can be computed as follows:

\begin{matrix} I (X \to Y) & = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} I (X^{i}; Y_{i} | Y^{i - 1}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} H (Y_{i} | Y^{i - 1}) - H (Y_{i} | X^{i}, Y^{i - 1}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} H (Y_{i}) - H (Y_{i} | X_{i - 1}) \\ = \frac{1}{2} log (1 + \frac{b_{1}^{2} σ_{x}^{2}}{σ_{y}^{2}}) . \end{matrix}

(34)

Similarly, for the link

Y \to Z

, we have:

\begin{matrix} I (Y \to Z) & = \frac{1}{n} \sum_{i = 1}^{n} I (Y^{i}; Z_{i} | Z^{i - 1}) \\ = \frac{1}{n} \sum_{i = 1}^{n} H (Z_{i} | Z^{i - 1}) - H (Z_{i} | Y^{i}, Z^{i - 1}) \\ = \frac{1}{n} \sum_{i = 1}^{n} H (Z_{i}) - H (Z_{i} | Y_{i - 1}) \\ = \frac{1}{2} log (1 + \frac{b_{1}^{2} b_{2}^{2} σ_{x}^{2} + b_{2}^{2} σ_{y}^{2}}{σ_{z}^{2}}) . \end{matrix}

(35)

Next we can compute the true DIR for the link

X \to Z

as:

\begin{matrix} I (X \to Z) & = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} I (X^{i}; Z_{i} | Z^{i - 1}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} H (Z_{i} | Z^{i - 1}) - H (Z_{i} | X^{i}, Z^{i - 1}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} H (Z_{i}) - H (Z_{i} | X_{i - 2}) \\ = \frac{1}{2} log (1 + \frac{b_{1}^{2} b_{2}^{2} σ_{x}^{2}}{b_{2}^{2} σ_{y}^{2} + σ_{z}^{2}}) . \end{matrix}

(36)

Please note that the DIR corresponding to other links (i.e., the above links in the reverse direction) is zero by similar computations. Suppose we represent the causal relationships with a directed graph, where a link between two nodes exists if the corresponding DIR is non-zero. Then according to (34)–(36), the causal relationships are described with the graph of Figure 6a.

To estimate the DIR, note that the processes are Markov and the maximum Markov order (

o_{max}

) for the set of all processes is

o_{max} = 2

according to (30) and (33). Hence by (7), we can estimate DIR with the CMI estimator. For instance the DIR for processes

(X, Y)

can be obtained by:

{\hat{I}}_{D V}^{n} (X \to Y) : = {\hat{I}}_{D V}^{n} (X^{3}; Y_{3} | Y^{2}),

where the right hand side is computed similar to (21). We performed the experiment with

n = 2 \times 10^{5}

samples of dimension

d = 1

generated according to the model (30) and (33) with

b_{1} = 1

,

b_{2} = 2

,

σ_{x} = 3

,

σ_{y} = 2

, and

σ_{z} = 1

, while the settings of the neural network were chosen as in Table 1. The estimated values are stated in Table 2. It can be seen that the bias of the estimator is fairly small while the variance of the estimations is negligible. This is inline with the observations in [11] when estimating CMI for i.i.d. case.

Although

I (X \to Z) > 0

, intuitively

X

is only affecting

Z

causally through

Y

, which suggests that

I (X \to Z ∥ Y) = 0

. This event is referred to as proxy effect when studying directed information graph (see [45]). In fact the graphical representation of causal relationships can be simplified using the notion of causally conditioned DIR as depicted in Figure 6b. To see this formally, note that from (30) it yields that:

\begin{matrix} I (X \to Z ∥ Y) & = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} I (X^{i}; Z_{i} | Y^{i}, Z^{i - 1}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} H (Z_{i} | Y^{i}, Z^{i - 1}) - H (Z_{i} | X^{i}, Y^{i}, Z^{i - 1}) \\ = lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} H (Z_{i} | Y_{i - 1}) - H (Z_{i} | Y_{i - 1}) \\ = 0 . \end{matrix}

(37)

Considering

o_{max} = 2

, the causally conditioned DIR terms can be estimated with our CMI estimator according to (7); for instance,

\begin{matrix} {\hat{I}}_{D V}^{n} (X \to Y ∥ Z) : = {\hat{I}}_{D V}^{n} (X^{3}; Y_{3} | Y^{2}, Z^{3}) . \end{matrix}

The estimation results are provided in Table 3 for all the links, where for each link we averaged over

T = 20

estimations (as in Algorithm 3); then the procedure is repeated for 10 Monte Carlo trials in which we generate a new dataset according to the model.

In this experiment, we did not explore the effect of higher dimensions for data, although one should note that for the causally conditioned DIR estimation, with

d = 1

the neural network is fed with data of size 9. Nevertheless, the performance of higher dimensions for this estimator with i.i.d. data has been studied in [11] and the challenges of dealing with high dimensions when data has dependency can be considered to be a future direction of this work. Additionally, although the information about

o_{max}

may not always be available in practice, it can be approximated by data-driven approaches similar to the method described in [45].

5. Conclusions and Future Directions

In this paper, we explored the potentials of a neural estimator for information measures when there exist time dependencies among the samples. We extended the analysis on the convergence of the estimation and provided experimental results to show the performance of the estimator in practice. Furthermore, we compared our estimation method with a similar approach taken in [4,31] (which we denoted as MI-diff), and demonstrations on synthetic scenarios show that the variances of our estimations are smaller. However, the main contribution is the derivation of proofs of convergence when the data are generated from a Markov source. Our estimator is based on a k-NN method to re-sample the dataset such that the empirical average over the samples converges to the expectation with certain density. The convergence result derived for the re-sampling technique is stand-alone and can be adopted in other sampling application.

Our proposed estimator can be used potentially in the areas of information theory, communication systems, and machine learning. For instance, the capacity of channels with feedback can be characterized with directed information and estimated with our estimator and can be investigated as a future direction. Furthermore, in machine learning applications where the data has some form of dependency (either spatial of temporal), regularizing the training with information flow requires the estimator of information to capture causality which is considered in our technique. Finally, information measures can be used in modeling and controlling a complex system and the results in this work can provide meaningful measures such as conditional dependence and causal influence.

Author Contributions

Conceptualization, S.M.; methodology, S.M., H.G., and G.B.; software, S.M.; validation, S.M., H.G., G.B. and M.S.; formal analysis, S.M., H.G., and G.B.; investigation, S.M. and G.B.; resources, M.S.; data curation, S.M.; writing—original draft preparation, S.M. and H.G.; writing—review and editing, S.M., H.G., G.B. and M.S.; visualization, S.M.; supervision, G.B. and M.S.; project administration, M.S.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research, and the Swedish Research Council under contract 2019-03606.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PDF	Probability density function
IID	Independent and identically distributed
MI	Mutual information
CMI	Conditional mutual information
DI	Directed information
DIR	Directed information rate
TE	Transfer entropy
DV	Donsker-Varadhan
NWJ	Nguyen-Wainwright-Jordan
k-NN	k nearest neighbors
ML	Machine learning
RNN	Recurrent neural network

Appendix A. Proof of Proposition 1

To show the convergence stated in the Proposition, let us first introduce the following lemma which is a variant of Birkhoff’s ergodic theorem for the case where the samples are not necessarily subsequent.

Lemma A1.

Let

U^{n}

be n observations of a stationary and ergodic Markov process where

U_{i} \in U

and

U \subseteq R^{d}

. Then if

E [g (U)]

exists and is finite,

\begin{matrix} \frac{1}{| I_{α, n} |} \sum_{j \in I_{α, n}} g (U_{j}) \overset{a . s .}{\to} E [g (U)], \end{matrix}

(A1)

where

I_{α, n}

is defined in Definition 2 and the empirical average is considered to be zero when

| I_{α, n} | = 0

.

Proof.

Consider

W_{1}, \dots, W_{n}

generated i.i.d. and

W_{i} \sim B e r n o u l l i (α)

. From the definition of

I_{α, n}

, we can write the summation equivalently as

\begin{matrix} \sum_{j \in I_{α, n}} g (U_{j}) = \sum_{i = 1}^{n} W_{i} g (U_{i}) . \end{matrix}

(A2)

Since the

W_{i}

’s are independent of

g (U_{i})

, the pairs

(W_{i}, g (U_{i}))

are also stationary and ergodic Markov, so from Birkhoff’s ergodic theorem,

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} W_{i} g (U_{i}) - E [W g (U)] \overset{a . s .}{\to} 0, \end{matrix}

(A3)

and since

E [W g (U)] = E [W] E [g (U)] = α E [g (U)]

,

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} W_{i} g (U_{i}) \overset{a . s .}{\to} α E [g (U)] . \end{matrix}

(A4)

On the other hand, from the strong law of large numbers:

\begin{matrix} \frac{| I_{α, n} |}{n} = \frac{1}{n} \sum_{i = 1}^{n} W_{i} \overset{a . s .}{\to} α . \end{matrix}

(A5)

From (A4) and (A5), and since the summation in (A5) is bounded,

\frac{1}{| I_{α, n} |} \sum_{j \in I_{α, n}} g (U_{j}) \overset{a . s .}{\to} E [g (U)]

and the proof is complete. ☐

Using Lemma A1, the proof of Proposition 1 becomes trivial by letting

U_{i} = (X_{i}, Y_{i}, Z_{i})

since the triple is a sample of a jointly stationary ergodic Markov process. Noting that

| I_{α, n} | = | B_{j o i n t}^{α} |

concludes the proof of the Proposition.

Appendix B. Proof of Proposition 2

To show the convergence of the empirical average over samples in the product batch, we begin by reviewing convergence results for k-NN regression.

Lemma A2

([39] (Theorem 2-a)). Consider the sequence

{(U_{i}, V_{i})}_{i = 1}^{n}

is stationary and geometrically ϕ-mixing (see Definition 4). If

\frac{k (n)}{n} \to 0

and

\frac{k (n)}{{(log n)}^{2}} \to \infty

, then

\begin{matrix} sup_{u} | m_{n} (u) - m (u) | \overset{a . s .}{\to} 0 . \end{matrix}

(A6)

Now to extend Lemma A2 to the case where the samples are randomly selected for the regression, we show the following lemmas.

Lemma A3.

Let

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

be generated under Assumptions 1–3. If

\frac{k (n)}{n} \to 0

and

\frac{k (n)}{{(log n)}^{2}} \to \infty

, and for any

y \in Y

,

E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z]

exists and is finite, then we have that, for all y:

\begin{matrix} sup_{z} | \tilde{g} (y, z) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | \overset{a . s .}{\to} 0, \end{matrix}

(A7)

where

\tilde{g} (y, z) : = \frac{1}{k (n)} \sum_{j = 1}^{k (n)} g (X_{r_{j}}, y, z),

and

r_{j}

refers to the index of the j-th nearest neighbor of z among

{Z_{i}}_{i = 1}^{n}

.

Proof.

The proof follows directly from Lemma A2 as y is fixed in (A7). ☐

Lemma A4.

Let

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

be generated under Assumptions 1–3. Then, if

k (n)

and

s (n)

fulfill the assumptions in (25), and for any

y \in Y

,

E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z]

exists and is finite, for all y:

\begin{matrix} sup_{z} | \bar{g} (y, z, W^{s (n)}) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | \overset{a . s .}{\to} 0, \end{matrix}

(A8)

where

\begin{matrix} \bar{g} (y, z, W^{s (n)}) : = \frac{1}{k (n)} \sum_{l \in A^{α^{'}, k (n), n, s (n)} (z, Z^{n}, W^{s (n)})} g (X_{l}, y, z), \end{matrix}

(A9)

and

A^{α^{'}, k (n), n, s (n)} (z, Z^{n}, W^{s (n)})

and

W^{s (n)}

are defined in Definition 3.

Proof.

See Appendix C. ☐

Lemma A5.

For the sequence

{(X_{i}, Y_{i}, Z_{i})}_{i = 1}^{n}

defined in Lemma A4:

\begin{matrix} | \bar{g} (Y_{s (n)}, Z_{s (n)}, W^{s (n)}) - E_{P_{X | Z}} [g (X, Y, Z) ∣ Y = Y_{s (n)}, Z = Z_{s (n)}] | \overset{a . s .}{\to} 0, \end{matrix}

(A10)

where

s (n) < n

and the convergence occurs according to the random variables

Y_{s (n)}, Z_{s (n)}, W^{s (n)}

, and the sequence.

Proof.

To simplify the notation, we use

\bar{g} (y, z)

instead of

\bar{g} (y, z, W^{s (n)})

in this proof. Since

Y

is compact, for any

ϵ > 0

, there exist M finite balls with radius

ϵ / L_{g}

and centers

{\tilde{y}}_{j}

for

j = 1, \dots, M

, that cover

Y

. Then, from the triangle inequality, we have:

\begin{matrix} P (lim_{n \to \infty} sup_{y, z} | \bar{g} (y, z) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | \leq 2 ϵ) \\ \geq P (lim_{n \to \infty} sup_{y, z} | Δ_{(1)} (y, z) | + | Δ_{(2)} (y, z) | + | Δ_{(3)} (y, z) | \leq 2 ϵ), \end{matrix}

(A11)

where

\begin{matrix} Δ_{(1)} : = (y, z) \bar{g} (y, z) - \bar{g} ({\tilde{y}}_{j}, z) \end{matrix}

(A12)

\begin{matrix} Δ_{(2)} : = (y, z) \bar{g} ({\tilde{y}}_{j}, z) - E_{P_{X | Z}} [g (X, {\tilde{y}}_{j}, Z) ∣ Z = z] \end{matrix}

(A13)

\begin{matrix} Δ_{(3)} (y, z) : = E_{P_{X | Z}} [g (X, {\tilde{y}}_{j}, Z) ∣ Z = z] - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z], \end{matrix}

(A14)

and

{\tilde{y}}_{j}

is the center of the ball containing y. Note that

\begin{matrix} lim_{n \to \infty} sup_{y, z} (| Δ_{(1)} (y, z) | + | Δ_{(2)} (y, z) | + | Δ_{(3)} (y, z) |) \end{matrix}

\begin{matrix} \leq lim_{n \to \infty} sup_{y, z} | Δ_{(1)} (y, z) | + lim_{n \to \infty} sup_{y, z} | Δ_{(2)} (y, z) | + lim_{n \to \infty} sup_{y, z} | Δ_{(3)} (y, z) | [2] \end{matrix}

(A15)

\begin{matrix} \leq 2 ϵ + lim_{n \to \infty} sup_{y, z} | Δ_{(2)} (y, z) |, \end{matrix}

(A16)

where (A16) follows from (26) and the radius of the balls being

ϵ / L_{g}

. Thus (A11) yields:

\begin{matrix} P (lim_{n \to \infty} sup_{y, z} | \bar{g} (y, z) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | \leq 2 ϵ) \\ \geq P (lim_{n \to \infty} sup_{y, z} | Δ_{(2)} (y, z) | \leq 0) \end{matrix}

\begin{matrix} \geq P (lim_{n \to \infty} max_{{\tilde{y}}_{j}} sup_{z} | Δ_{(2)} ({\tilde{y}}_{j}, z) | \leq 0) \end{matrix}

(A17)

\begin{matrix} = P (max_{{\tilde{y}}_{j}} lim_{n \to \infty} sup_{z} | Δ_{(2)} ({\tilde{y}}_{j}, z) | \leq 0) \end{matrix}

(A18)

\begin{matrix} \geq 1 - \sum_{j = 1}^{M} P (lim_{n \to \infty} sup_{z} | Δ_{(2)} ({\tilde{y}}_{j}, z) | > 0) \end{matrix}

(A19)

\begin{matrix} = 1, \end{matrix}

(A20)

where (A17) holds by the definition (A13), (A18) follows since

{\tilde{y}}_{j}

is independent of n, and the last step is due to Lemma A4. Finally since (A20) holds for any

ϵ > 0

, according to [46] (Prop 1.13) it is concluded that:

\begin{matrix} P (lim_{n \to \infty} sup_{y, z} | \bar{g} (y, z) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | = 0) = 1 . \end{matrix}

(A21)

Consider now the probability space

(Ω, F, P)

. For any

y \in Y

and

z \in Z

,

\bar{g} (y, z)

can be expressed equivalently as

\bar{g} (y, z; ψ) : Ω \to R

. Consider the functions

Y_{s (n)} (ψ) : Ω \to Y

and

Z_{s (n)} (ψ) : Ω \to Z

, then from (A21):

\begin{matrix} P (ψ \in Ω : lim_{n \to \infty} | \bar{g} (Y_{s (n)} (ψ), Z_{s (n)} (ψ); ψ) - E_{P_{X | Z}} [g (X, Y, Z) ∣ Y = Y_{s (n)} (ψ), Z = Z_{s (n)} (ψ)] | = 0) \\ \geq P (ψ \in Ω : lim_{n \to \infty} sup_{y, z} | \bar{g} (y, z; ψ) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | = 0) \\ = 1, \end{matrix}

(A22)

which implies that:

\begin{matrix} | \bar{g} (Y_{s (n)}, Z_{s (n)}, W^{s (n)}) - E_{P_{X | Z}} [g (X, Y, Z) ∣ Y = Y_{s (n)}, Z = Z_{s (n)}] | \overset{a . s .}{\to} 0, \end{matrix}

(A23)

and the proof of Lemma A5 is concluded. ☐

Now that the required tools were introduced, we can continue the proof of Proposition 2. From Definition 3 and (A9), the LHS of (27) can be expressed as below:

\begin{matrix} \frac{1}{k (n) | I_{α^{'}, s (n)} |} \sum_{(X, Y, Z) \in B_{p r o d}^{α^{'}, s}} g (X, Y, Z) = \frac{1}{| I_{α^{'}, s (n)} |} \sum_{i = 1}^{s (n)} W_{i} \bar{g} (Y_{i}, Z_{i}, W^{s (n)}) . \end{matrix}

(A24)

Let us define:

Δ_{i} : = \bar{g} (Y_{i}, Z_{i}, W^{s (n)}) - E_{P_{X | Z}} [g (X, Y, Z) ∣ Y = Y_{i}, Z = Z_{i}],

and from Lemma A5, we obtain that:

\begin{matrix} | Δ_{s (n)} | \overset{a . s .}{\to} 0 . \end{matrix}

(A25)

As a result we can show the following strong convergence:

\begin{matrix} P (lim_{n \to \infty} \frac{1}{s (n)} \sum_{i = 1}^{s (n)} W_{i} Δ_{i} = 0) & \geq P (lim_{n \to \infty} W_{s (n)} Δ_{s (n)} = 0) \end{matrix}

(A26)

\begin{matrix} \geq P (lim_{n \to \infty} | Δ_{s (n)} | = 0) \end{matrix}

(A27)

\begin{matrix} = 1, \end{matrix}

(A28)

where (A26) holds since

s (n) \to \infty

by (25) and using Cesáro mean ([47] (Theorem 4.2.3)), (A27) holds since

W_{s (n)} \in {0, 1}

, and the equality in the last step follows from (A25). In other words,

\begin{matrix} \frac{1}{s (n)} \sum_{i = 1}^{s (n)} W_{i} \bar{g} (Y_{i}, Z_{i}, W^{s (n)}) - \frac{1}{s (n)} \sum_{i = 1}^{s (n)} W_{i} E_{P_{X | Z}} [g (X, Y, Z) ∣ Y = Y_{i}, Z = Z_{i}] \overset{a . s .}{\to} 0 . \end{matrix}

(A29)

Next since the sequence

{(W_{i}, Y_{i}, Z_{i})}_{i = 1}^{s (n)}

is stationary and ergodic, using Birkhoff’s ergodic theorem we have:

\begin{matrix} \frac{1}{s (n)} \sum_{i = 1}^{s (n)} W_{i} E_{P_{X | Z}} [g (X, Y, Z) ∣ Y = Y_{i}, Z = Z_{i}] \\ \overset{a . s .}{\to} E_{P_{W} P_{Y Z}} [W E_{P_{X | Z}} [g (X, Y, Z) ∣ Y, Z]] . \end{matrix}

(A30)

As W is generated independently

\begin{matrix} E_{P_{W} P_{Y Z}} [W E_{P_{X | Z}} [g (X, Y, Z) ∣ Y, Z]] = E [W] E_{\tilde{Q}} [g (X, Y, Z)] . \end{matrix}

(A31)

To complete the proof, note that

\begin{matrix} \frac{| I_{α^{'}, s (n)} |}{s (n)} \overset{a . s .}{\to} E [W] . \end{matrix}

(A32)

Therefore, from (A24) and (A29)–(A32), and

| B_{p r o d}^{α^{'}, s} | = k (n) | I_{α^{'}, s (n)} |

we conclude that:

\begin{matrix} \frac{1}{| B_{p r o d}^{α^{'}, s} |} \sum_{(X, Y, Z) \in B_{p r o d}^{α^{'}, s}} g (X, Y, Z) \overset{a . s .}{\to} E_{\tilde{Q}} [g (X, Y, Z)], \end{matrix}

(A33)

and the proof is complete. ☐

Appendix C. Proof Lemma A4

According to Definition 3, the index set

I_{α^{'}, s (n)}

is determined by the sequence

W^{s (n)}

. Therefore,

A^{α^{'}, k (n), n, s (n)} (z, Z^{n}, W^{s (n)})

denotes the set of indices of the

k (n)

nearest neighbors of z among

{Z_{i} ∣ i \in I_{α^{'}, s (n)}^{c}}

, unlike in Lemma A3 where the neighbors can be chosen among the whole sequence

{Z_{i}}_{i = 1}^{n}

. Hence, the first step is to verify the ϕ-mixing condition for the isolated k-NN method where some indices are excluded. Intuitively, if

{X_{i}, Y_{i}, Z_{i}}_{i = 1}^{n}

is ϕ-mixing, then the sequence

{(X_{i}, Y_{i}, Z_{i})}_{i \in I_{α^{'}, s (n)}^{c}}

is also ϕ-mixing since the random jumps make the asymptotic independence (see Definition 4) happen with a faster rate. Nonetheless, we can show that the sequence

{(X_{i}, Y_{i}, Z_{i})}_{i \in I_{α^{'}, s (n)}^{c}}

satisfy the mixing condition for Lemma A3 which is expressed in the following.

The basis of the proof for Lemma A2 and thus Lemma A3, is Collomb’s inequality [48] (Theorem 2.2.1) which provides a concentration bound similar to Hoeffding’s inequality for ϕ-mixing variables. For instance if

U

is a

ϕ

-mixing process where

E [U_{i}] = 0

,

| U_{i} | \leq a_{1}

,

E [U_{i}^{2}] \leq a_{2}

, and

E [| U_{i} |] \leq a_{3}

, the inequality states that:

\begin{matrix} P (| \sum_{i = 1}^{n} U_{i} | > ϵ) \leq exp (3 \sqrt{e} n \frac{ϕ_{t}}{t} - a_{4} ϵ + 6 a_{4}^{2} n (a_{2} + 4 a_{1} a_{3} \sum_{i = 1}^{t} ϕ_{i})), \end{matrix}

(A34)

for some integer

t < n

and real

a_{4}

such that

a_{1} a_{4} t \leq 1 / 4

. In order to show a similar inequality for

{U_{i}}_{i \in I_{α^{'}, s (n)}^{c}}

, we have that:

\begin{matrix} P (| \sum_{i \in I_{α^{'}, s (n)}^{c}} U_{i} | > ϵ) & = P (| \sum_{i = 1}^{n} U_{i} - \sum_{i = 1}^{s (n)} W_{i} U_{i} | > ϵ) \\ \leq P (| \sum_{i = 1}^{n} U_{i} | > ϵ / 2) + P (| \sum_{i = 1}^{s (n)} W_{i} U_{i} | > ϵ / 2), \end{matrix}

(A35)

where both terms in (A35) are bounded with exponential terms and can be dominated by either of them. Thus, as

n \to \infty

and

s (n) \to \infty

(by assumption (25)) both terms tend to zero and Collomb’s inequality applies to the summation over the sub-sequence of samples remained after the isolation. In other words, the required mixing condition holds for the new sequence

{(X_{i}, Y_{i}, Z_{i})}_{i \in I_{α^{'}, s (n)}^{c}}

and the result in Lemma A2 can be extended to this Lemma.

Next it remains to verify the conditions of Lemma A2 on

k (n)

. From (25) we have,

\frac{k (n)}{| I_{α^{'}, s (n)}^{c} |} \leq \frac{k (n)}{n - s (n)} = \frac{1}{s (n) (1 - \frac{1}{k (n)})} \overset{a . s .}{\to} 0,

which yields that:

\begin{matrix} \frac{k (n)}{{(log | I_{α^{'}, s (n)}^{c} |)}^{2}} \geq \frac{k (n)}{{(log n)}^{2}} \overset{a . s .}{\to} \infty . \end{matrix}

(A36)

Therefore, the conditions of Lemma A2 hold and from Lemma A3 it follows that for all

y \in Y

:

\begin{matrix} sup_{z} | \bar{g} (y, z, W^{s (n)}) - E_{P_{X | Z}} [g (X, y, Z) ∣ Z = z] | \overset{a . s .}{\to} 0, \end{matrix}

(A37)

which concludes the proof of the Lemma. ☐

Appendix D. Proof Theorem 1

Based on the universal functional approximation theory of neural networks [37], ref. [4] (Lemma 4) implies that for any

ϵ_{0} > 0

, there exists

\tilde{θ} \in Θ

such that:

\begin{matrix} | L (ω_{\tilde{θ}}) - L^{*} | < \frac{ϵ_{0}}{2}, \end{matrix}

(A38)

where

L^{*} : = L (ω^{*})

and

L (ω)

and

ω^{*}

were defined in (15). Moreover, from Propositions 1 and 2, for any

θ \in Θ

, the empirical loss

L_{e m p} (ω_{θ})

defined in (18) converges asymptotically to the expected loss

L (ω_{θ})

. This is obtained by letting

g (x, y, z) = log (ω_{θ} (x, y, z))

and

g (x, y, z) = log (1 - ω_{θ} (x, y, z))

in Propositions 1 and 2, respectively, and noting Remark 2. Thus we have:

\begin{matrix} L_{e m p} (ω_{θ}) \overset{a . s .}{\to} L (ω_{θ}) . \end{matrix}

(A39)

Since

Θ \subset R^{h}

and

{∥ θ ∥}_{2} \leq K

,

\forall θ \in Θ

,

Θ

can be covered with finite

N (Θ, r)

number of balls of radius r, where

N (Θ, r)

is bounded [49]:

\begin{matrix} N (Θ, r) \leq {(\frac{2 K \sqrt{h}}{r})}^{h} . \end{matrix}

(A40)

Let

{θ_{1}, \dots, θ_{N (Θ, r)}}

denote the centers of the covering balls. Let

j_{n}

be the index of the ball that

\hat{θ}

belongs to, then from the triangle inequality we have:

\begin{matrix} | L_{e m p} (ω_{\hat{θ}}) - L (ω_{\hat{θ}}) | \leq | L_{e m p} (ω_{\hat{θ}}) - L_{e m p} (ω_{θ_{j_{n}}}) | + | L_{e m p} (ω_{θ_{j_{n}}}) - L (ω_{θ_{j_{n}}}) | \\ + | L (ω_{θ_{j_{n}}}) - L (ω_{\hat{θ}}) | \\ \leq | L_{e m p} (ω_{θ_{j_{n}}}) - L (ω_{θ_{j_{n}}}) | + \frac{2 B r}{τ} \end{matrix}

(A41)

where the second inequality holds due to the Lipschitz continuity of

ω_{θ}

stated in Assumption 4. From the union bound and for any

ϵ^{'} > 0

, we have:

\begin{matrix} P (lim_{n \to \infty} | L_{e} m p (ω_{\hat{θ}}) - L (ω_{\hat{θ}}) | > \frac{ϵ^{'}}{2}) \end{matrix}

\begin{matrix} \leq N (Θ, r) P (lim_{n \to \infty} | L_{e} m p (ω_{θ_{j_{n}}}) - L (ω_{θ_{j_{n}}}) | > \frac{ϵ^{'}}{2} - \frac{2 B r}{τ}) \end{matrix}

(A42)

\begin{matrix} = 0, \end{matrix}

(A43)

where (A42) holds due to (A41), applying a union bound over all centers

θ_{j}

, and choosing

r < \frac{ϵ^{'} τ}{4 B}

, and the last step follows by exploiting the strong convergence in (A39). As a result, with probability one:

\begin{matrix} lim_{n \to \infty} L (ω_{\hat{θ}}) & \leq lim_{n \to \infty} L_{e m p} (ω_{\hat{θ}}) + \frac{ϵ^{'}}{2} \end{matrix}

(A44)

\begin{matrix} \leq lim_{n \to \infty} L_{e m p} (ω_{\tilde{θ}}) + \frac{ϵ^{'}}{2} \end{matrix}

(A45)

\begin{matrix} = L (ω_{\tilde{θ}}) + \frac{ϵ^{'}}{2} \end{matrix}

(A46)

\begin{matrix} \leq L^{*} + ϵ^{'}, \end{matrix}

(A47)

where (A44) is obtained from (A43), and (A45) holds since

\hat{θ}

minimizes

L_{e m p} (ω_{θ})

, and (A46) follows from (A39). Finally, the last step is derived using (A38) and choosing

ϵ_{0} = ϵ^{'}

.

To conclude the proof, note that if Assumption 5 holds, from [4] (Lemma 6) and taking similar steps as in [11] (Lemma 8), it is implied that for any given

ϵ^{'} > 0

, with probability one as

n \to \infty

:

\begin{matrix} E_{\tilde{P}} [| ω^{*} (X, Y, Z) - ω_{\hat{θ}} (X, Y, Z) | ∣ \hat{θ}] \leq η, \\ E_{\tilde{Q}} [| ω^{*} (X, Y, Z) - ω_{\hat{θ}} (X, Y, Z) | ∣ \hat{θ}] \leq η, \end{matrix}

(A48)

where

η : = (1 - τ) p_{max} \sqrt{2 λ ϵ^{'} / p_{min}}

, with

λ

being the Lebesgue measure corresponding to

X \times Y \times Z

. Note that the expectations in (A48) are random variables due to

\hat{θ}

. Let us define

I_{D V}^{n} (X; Y | Z)

as:

\begin{matrix} I_{D V}^{n} (X; Y | Z) : = E_{\tilde{P}} [log \hat{Γ} (X, Y, Z) ∣ \hat{θ}] - log E_{\tilde{Q}} [\hat{Γ} (X, Y, Z) ∣ \hat{θ}] . \end{matrix}

(A49)

Thus by the triangle inequality we have:

\begin{matrix} | {\hat{I}}_{D V}^{n} (X; Y | Z) - I (X; Y | Z) | \\ \leq | {\hat{I}}_{D V}^{n} (X; Y | Z) - I_{D V}^{n} (X; Y | Z) | + | I_{D V}^{n} (X; Y | Z) - I (X; Y | Z) | . \end{matrix}

(A50)

where

{\hat{I}}_{D V}^{n} (X; Y | Z)

was defined in (21).

To bound the first term, note that by the triangle inequality

\begin{matrix} | {\hat{I}}_{D V}^{n} (X; Y | Z) - I_{D V}^{n} (X; Y | Z) | & \leq Δ_{D V} + Δ_{D V}^{'}, \end{matrix}

(A51)

where

Δ_{D V} : = | | B_{j o i n t}^{α} |^{- 1} \sum_{X, Y, Z \in B_{j o i n t}^{α}} log \hat{Γ} (X, Y, Z) - E_{\tilde{P}} [log \hat{Γ} (X, Y, Z) ∣ \hat{θ}] |

and

Δ_{D V}^{'} : = | log | B_{p r o d}^{α^{'}, s} |^{- 1} \sum_{X, Y, Z \in B_{p r o d}^{α^{'}, s}} \hat{Γ} (X, Y, Z) - log E_{\tilde{Q}} [\hat{Γ} (X, Y, Z) ∣ \hat{θ}] | .

Since

\hat{Γ} (\cdot)

is bounded as:

\frac{τ}{1 - τ} \leq \hat{Γ} (X, Y, Z) \leq \frac{1 - τ}{τ},

by the Lipschitz continuity of

log (\cdot)

it follows that:

\begin{matrix} | {\hat{I}}_{D V}^{n} (X; Y | Z) - I_{D V}^{n} (X; Y | Z) | & \leq Δ_{D V} + Δ_{D V}^{″}, \end{matrix}

(A52)

where

Δ_{D V}^{″} : = \frac{1 - τ}{τ} | | B_{p r o d}^{α^{'}, s} |^{- 1} \sum_{X, Y, Z \in B_{p r o d}^{α^{'}, s}} \hat{Γ} (X, Y, Z) - E_{\tilde{Q}} [\hat{Γ} (X, Y, Z) ∣ \hat{θ}] | .

Both

Δ_{D V}

and

Δ_{D V}^{″}

converge strongly to zero from Propositions 1 and 2, respectively, i.e., for any given

ϵ > 0

, we have that:

\begin{matrix} P (lim_{n \to \infty} Δ_{D V} > ϵ / 4) = 0, \\ P (lim_{n \to \infty} Δ_{D V}^{''} > ϵ / 4) = 0 . \end{matrix}

(A53)

To bound the second term in (A50), using the triangle inequality it yields that:

\begin{matrix} | I_{D V}^{n} (X; Y | Z) - I (X; Y | Z) | & \leq | E_{\tilde{P}} [log \hat{Γ} (X, Y, Z) - log Γ (X, Y, Z) ∣ \hat{θ}] | \\ + | log E_{\tilde{Q}} [\hat{Γ} (X, Y, Z) ∣ \hat{θ}] - log E_{\tilde{Q}} [Γ (X, Y, Z)] | . \end{matrix}

(A54)

Thus from (A48) and the Lipschitz continuity of

Γ

,

\hat{Γ}

, and

log (\cdot)

, it follows that:

\begin{matrix} P (lim_{n \to \infty} | E_{\tilde{P}} [log \hat{Γ} (X, Y, Z) - log Γ (X, Y, Z) ∣ \hat{θ}] | > \frac{η}{τ (1 - τ)}) = 0,] \\ P (lim_{n \to \infty} | log E_{\tilde{Q}} [\hat{Γ} (X, Y, Z) ∣ \hat{θ}] - log E_{\tilde{Q}} [Γ (X, Y, Z)] | > \frac{η}{τ^{2}}) = 0 . \end{matrix}

(A55)

Then combining (A50) and (A52)–(A55), it is concluded that with probability one as

n \to \infty

\begin{matrix} | {\hat{I}}_{D V}^{n} (X; Y | Z) - I (X; Y | Z) | & \leq Δ_{D V} + Δ_{D V}^{″} + \frac{η}{τ (1 - τ)} + \frac{η}{τ^{2}} \end{matrix}

(A56)

\begin{matrix} \leq \frac{ϵ}{4} + \frac{ϵ}{4} + \frac{ϵ}{2} = ϵ, \end{matrix}

(A57)

where the last step holds by choosing

η = τ^{2} (1 - τ) \frac{ϵ}{2}

, and

ϵ^{'}

and

ϵ_{0}

accordingly. In other words,

{\hat{I}}_{D V}^{n} (X; Y | Z) \overset{a . s .}{\to} I (X; Y | Z),

and the proof of Theorem 1 is completed. ☐

References

Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. MINE: Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
Wang, Q.; Kulkarni, S.R.; Verdú, S. Universal estimation of information measures for analog sources. Found. Trends Commun. Inf. Theory 2009, 5, 265–353. [Google Scholar] [CrossRef]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mukherjee, S.; Asnani, H.; Kannan, S. CCMI: Classifier based Conditional Mutual Information Estimation. In Proceedings of the Uncertainty in Artificial Intelligence, Tel Aviv, Israel, 22–25 July 2019. [Google Scholar]
Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain markov process expectations for large time, I. Comm. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef] [Green Version]
Poole, B.; Ozair, S.; van den Oord, A.; Alemi, A.A.; Tucker, G. On variational lower bounds of mutual information. In Proceedings of the NeurIPS Workshop on Bayesian Deep Learning, Montréal, QC, Canada, 7–8 December 2018. [Google Scholar]
Molavipour, S.; Bassi, G.; Skoglund, M. Conditional Mutual Information Neural Estimator. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 5025–5029. [Google Scholar]
Molavipour, S.; Bassi, G.; Skoglund, M. Neural Estimators for Conditional Mutual Information Using Nearest Neighbors Sampling. IEEE Trans. Signal Process. 2021, 69, 766–780. [Google Scholar] [CrossRef]
Marko, H. The bidirectional communication theory-a generalization of information theory. IEEE Trans. Commum. 1973, 21, 1345–1351. [Google Scholar] [CrossRef]
Massey, J. Causality, Feedback and Directed Information. In Proceedings of the International Symposium on Information Theory and Its Applications (ISITA), Honolulu, HI, USA, 27–30 November 1990; pp. 303–305. [Google Scholar]
Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85, 461. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kramer, G. Directed Information for Channels with Feedback. Ph.D. Thesis, Department of Information Technology and Electrical Engineering, ETH Zurich, Zürich, Switzerland, 1998. [Google Scholar]
Permuter, H.H.; Kim, Y.H.; Weissman, T. Interpretations of directed information in portfolio theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory 2011, 57, 3248–3259. [Google Scholar] [CrossRef]
Venkataramanan, R.; Pradhan, S.S. Source coding with feed-forward: Rate-distortion theorems and error exponents for a general source. IEEE Trans. Inf. Theory 2007, 53, 2154–2179. [Google Scholar] [CrossRef]
Tanaka, T.; Skoglund, M.; Sandberg, H.; Johansson, K.H. Directed information and privacy loss in cloud-based control. In Proceedings of the American Control Conference (ACC), Seattle, WD, USA, 24–26 May 2017; pp. 1666–1672. [Google Scholar]
Rissanen, J.; Wax, M. Measures of mutual and causal dependence between two time series (Corresp.). IEEE Trans Inf. Theory 1987, 33, 598–601. [Google Scholar] [CrossRef]
Quinn, C.J.; Coleman, T.P.; Kiyavash, N.; Hatsopoulos, N.G. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. J. Comput. Neurosci. 2011, 30, 17–44. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Neveu, C.L.; Baxter, D.A.; Byrne, J.H.; Aazhang, B. Inferring neuronal network functional connectivity with directed information. J. Neurophysiol. 2017, 118, 1055–1069. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ver Steeg, G.; Galstyan, A. Information transfer in social media. In Proceedings of the 21st international conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 509–518. [Google Scholar]
Quinn, C.J.; Kiyavash, N.; Coleman, T.P. Directed information graphs. IEEE Trans. Inf. Theory 2015, 61, 6887–6909. [Google Scholar] [CrossRef] [Green Version]
Vicente, R.; Wibral, M.; Lindner, M.; Pipa, G. Transfer entropy—A model-free measure of effective connectivity for the neurosciences. J. Comput. Neurosci. 2011, 30, 45–67. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chávez, M.; Martinerie, J.; Le Van Quyen, M. Statistical assessment of nonlinear causality: Application to epileptic EEG signals. J. Neurosci. Meth. 2003, 124, 113–128. [Google Scholar] [CrossRef]
Spinney, R.E.; Lizier, J.T.; Prokopenko, M. Transfer entropy in physical systems and the arrow of time. Phys. Rev. E 2016, 94, 022135. [Google Scholar] [CrossRef] [Green Version]
Runge, J. Quantifying information transfer and mediation along causal pathways in complex systems. Phys. Rev. E 2015, 92, 062829. [Google Scholar] [CrossRef] [Green Version]
Murin, Y. k-NN Estimation of Directed Information. arXiv 2017, arXiv:1711.08516. [Google Scholar]
Faes, L.; Kugiumtzis, D.; Nollo, G.; Jurysta, F.; Marinazzo, D. Estimating the decomposition of predictive information in multivariate systems. Phys. Rev. E 2015, 91, 032904. [Google Scholar] [CrossRef] [Green Version]
Baboukani, P.S.; Graversen, C.; Alickovic, E.; Østergaard, J. Estimating Conditional Transfer Entropy in Time Series Using Mutual Information and Nonlinear Prediction. Entropy 2020, 22, 1124. [Google Scholar] [CrossRef]
Zhang, J.; Simeone, O.; Cvetkovic, Z.; Abela, E.; Richardson, M. ITENE: Intrinsic Transfer Entropy Neural Estimator. arXiv 2019, arXiv:1912.07277. [Google Scholar]
Aharoni, Z.; Tsur, D.; Goldfeld, Z.; Permuter, H.H. Capacity of Continuous Channels with Memory via Directed Information Neural Estimator. arXiv 2020, arXiv:2003.04179. [Google Scholar]
Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. Int. J. Neural Syst. 2007, 17, 253–263. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. The individual ergodic theorem of information theory. Ann. Math. Stat. 1957, 28, 809–811. [Google Scholar] [CrossRef]
Kontoyiannis, I.; Skoularidou, M. Estimating the directed information and testing for causality. IEEE Trans. Inf. Theory 2016, 62, 6053–6067. [Google Scholar] [CrossRef] [Green Version]
Molavipour, S.; Bassi, G.; Skoglund, M. Testing for directed information graphs. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 3–6 October 2017; pp. 212–219. [Google Scholar]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
Devroye, L.; Gyorfi, L.; Krzyzak, A.; Lugosi, G. On the strong universal consistency of nearest neighbor regression function estimates. Ann. Stat. 1994, 22, 1371–1385. [Google Scholar] [CrossRef]
Collomb, G. Nonparametric time series analysis and prediction: Uniform almost sure convergence of the window and k-NN autoregression estimates. Statistics 1985, 16, 297–307. [Google Scholar] [CrossRef]
Yakowitz, S. Nearest-neighbour methods for time series analysis. J. Time Ser. Anal. 1987, 8, 235–247. [Google Scholar] [CrossRef]
Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer Science & Business Media: Dordrecht, The Netherlands, 2012. [Google Scholar]
Raleigh, G.G.; Cioffi, J.M. Spatio-temporal coding for wireless communication. IEEE Trans. Inf. Theory 1998, 46, 357–366. [Google Scholar] [CrossRef]
Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 1969, 37, 424–438. [Google Scholar] [CrossRef]
Kamarianakis, Y.; Prastacos, P. Space–time modeling of traffic flow. Comput. Geosci. 2005, 31, 119–133. [Google Scholar] [CrossRef] [Green Version]
Molavipour, S.; Bassi, G.; Čičić, M.; Skoglund, M.; Johansson, K.H. Causality Graph of Vehicular Traffic Flow. arXiv 2020, arXiv:2011.11323. [Google Scholar]
Ross, S.M.; Peköz, E.A. A Second Course in Probability. 2007. Available online: www.bookdepository.com/publishers/Pekozbooks (accessed on 20 May 2021).
Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Györfi, L.; Härdle, W.; Sarda, P.; Vieu, P. Nonparametric Curve Estimation from Time Series; Springer: Berlin/Heidelberg, Germany, 2013; Volume 60. [Google Scholar]
Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]

Figure 1. The memory considered for conditional mutual information terms in directed information (left) and transfer entropy (right) at time instance i. To compute directed information (left), the effect of

X^{i}

(i.e.,

X_{i}

and all its past samples) on

Y_{i}

is considered, while the history of

Y_{i}

is excluded. However, for transfer entropy (right), the effect of

X_{i - J}^{i - 1}

(i.e., the previous J samples before

X_{i}

) on

Y_{i}

is accounted for, while we exclude the history of

Y_{i}

. Note that the length of memories (J and L) for transfer entropy may differ.

Figure 1. The memory considered for conditional mutual information terms in directed information (left) and transfer entropy (right) at time instance i. To compute directed information (left), the effect of

X^{i}

(i.e.,

X_{i}

and all its past samples) on

Y_{i}

is considered, while the history of

Y_{i}

is excluded. However, for transfer entropy (right), the effect of

X_{i - J}^{i - 1}

(i.e., the previous J samples before

X_{i}

) on

Y_{i}

is accounted for, while we exclude the history of

Y_{i}

. Note that the length of memories (J and L) for transfer entropy may differ.

Figure 2. Construction of the product batch from the data set which is expressed as the left table. Let

w_{i} = 1

, and the z component of the rows denoted with ‘*’ (indexed with

j_{1}

and

j_{2}

) are in the k nearest neighborhood of

z_{i}

for

k = 2

. So we pack the triples

(x_{j_{1}}, y_{i}, z_{i})

and

(x_{j_{2}}, y_{i}, z_{i})

in the product batch as in the right table.

Figure 2. Construction of the product batch from the data set which is expressed as the left table. Let

w_{i} = 1

, and the z component of the rows denoted with ‘*’ (indexed with

j_{1}

and

j_{2}

) are in the k nearest neighborhood of

z_{i}

for

k = 2

. So we pack the triples

(x_{j_{1}}, y_{i}, z_{i})

and

(x_{j_{2}}, y_{i}, z_{i})

in the product batch as in the right table.

Figure 3. Estimated CMI for AR-1 model in (30) using

n = 2 \times 10^{4}

samples with

d = 1

. The shaded region shows the range of the estimated values over the Monte Carlo trials.

Figure 3. Estimated CMI for AR-1 model in (30) using

n = 2 \times 10^{4}

samples with

d = 1

. The shaded region shows the range of the estimated values over the Monte Carlo trials.

Figure 4. Estimated CMI for AR-1 model in (30) using

n = 2 \times 10^{4}

samples with

d = 10

. The shaded region shows the range of the estimated values over the Monte Carlo trials. Blue shades correspond to estimation with our method, yellow shades correspond to estimation with MI-diff approach and the green shade is the overlap of the areas.

Figure 4. Estimated CMI for AR-1 model in (30) using

n = 2 \times 10^{4}

samples with

d = 10

. The shaded region shows the range of the estimated values over the Monte Carlo trials. Blue shades correspond to estimation with our method, yellow shades correspond to estimation with MI-diff approach and the green shade is the overlap of the areas.

Figure 5. Causal relationship of the processes.

Figure 6. Graphical representation of the causal influences between the processes using pairwise directed information (a), and causally conditioned directed information (b).

Table 1. Hyper-parameters.

Hidden units	64
Hidden layers	2 (64 × 64)
Activation	ReLU
$τ$	$10^{- 3}$
Optimizer	Adam
Learning rate	$10^{- 3}$
Epochs	200

Table 2. True and estimated DIR.

	True DIR	Estimation with Our Method (Mean ± Std)
$I (X \to Y)$	$0.59$	$0.57 \pm 0.00$
$I (X \to Z)$	$0.57$	$0.55 \pm 0.00$
$I (Y \to Z)$	$1.99$	$1.92 \pm 0.01$
$I (Y \to X)$	0	$0.00 \pm 0.00$
$I (Z \to X)$	0	$0.00 \pm 0.00$
$I (Z \to Y)$	0	$0.00 \pm 0.00$

Table 3. True and estimated DIR.

	True DIR	Estimation with Our Method (Mean ± Std)
$I (X \to Y ∥ Z)$	$0.59$	$0.57 \pm 0.00$
$I (X \to Z ∥ Y)$	0	$0.00 \pm 0.00$
$I (Y \to Z ∥ X)$	$1.42$	$1.52 \pm 0.01$
$I (Y \to X ∥ Z)$	0	$0.01 \pm 0.00$
$I (Z \to X ∥ Y)$	0	$0.01 \pm 0.00$
$I (Z \to Y ∥ X)$	0	$0.01 \pm 0.00$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Molavipour, S.; Ghourchian, H.; Bassi, G.; Skoglund, M. Neural Estimator of Information for Time-Series Data with Dependency. Entropy 2021, 23, 641. https://doi.org/10.3390/e23060641

AMA Style

Molavipour S, Ghourchian H, Bassi G, Skoglund M. Neural Estimator of Information for Time-Series Data with Dependency. Entropy. 2021; 23(6):641. https://doi.org/10.3390/e23060641

Chicago/Turabian Style

Molavipour, Sina, Hamid Ghourchian, Germán Bassi, and Mikael Skoglund. 2021. "Neural Estimator of Information for Time-Series Data with Dependency" Entropy 23, no. 6: 641. https://doi.org/10.3390/e23060641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neural Estimator of Information for Time-Series Data with Dependency

Abstract

1. Introduction

2. Preliminaries

2.1. Notation

2.2. Information Measures

2.3. Estimating the Variational Bound

3. Main Results

3.1. Batch Construction

3.2. Training the Classifier

3.3. Estimation of the DV Bound

3.4. Consistency Analysis

3.4.1. Convergence for the Joint Batch

3.4.2. Convergence for the Product Batch

3.4.3. Convergence of the Overall Estimation

4. Simulation Results

4.1. Estimating Conditional Mutual Information

4.2. Estimating Directed Information

5. Conclusions and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof of Proposition 1

Appendix B. Proof of Proposition 2

Appendix C. Proof Lemma A4

Appendix D. Proof Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI