Next Article in Journal
Pricing of Commodity and Energy Derivatives for Polynomial Processes
Next Article in Special Issue
Construction and Analysis of Queuing and Reliability Models Using Random Graphs
Previous Article in Journal
Fuzzy Stability Results of Generalized Quartic Functional Equations
Previous Article in Special Issue
Graph Theory for Modeling and Analysis of the Human Lymphatic System

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# Polynomial Representations of High-Dimensional Observations of Random Processes

Zhejiang University/University of Illinois at Urbana-Champaign Institute, Haining 314400, China
Mathematics 2021, 9(2), 123; https://doi.org/10.3390/math9020123
Submission received: 15 October 2020 / Revised: 4 December 2020 / Accepted: 21 December 2020 / Published: 7 January 2021

## Abstract

:
The paper investigates the problem of performing a correlation analysis when the number of observations is large. In such a case, it is often necessary to combine random observations to achieve dimensionality reduction of the problem. A novel class of statistical measures is obtained by approximating the Taylor expansion of a general multivariate scalar symmetric function by a univariate polynomial in the variable given as a simple sum of the original random variables. The mean value of the polynomial is then a weighted sum of statistical central sum-moments with the weights being application dependent. Computing the sum-moments is computationally efficient and amenable to mathematical analysis, provided that the distribution of the sum of random variables can be obtained. Among several auxiliary results also obtained, the first order sum-moments corresponding to sample means are used to reduce the numerical complexity of linear regression by partitioning the data into disjoint subsets. Illustrative examples provided assume the first and the second order Markov processes.

## 1. Introduction

The interest in developing and improving statistical methods and models is driven by the ever increasing volumes and variety of data. Making sense of data often requires one to uncover the existing patterns and relationships in data. One of the most commonly used universal tools for data exploration is evaluation of cross-correlation among different data sets and of autocorrelation of data sequence with itself. The correlation values can be utilized to carry out sensitivity analysis, to forecast future values, to visualize complex relationships among subsystems, and to evaluate other important spatio-temporal statistical properties. However, it has been well established that correlations do not imply causalization.
The correlations measure linear dependencies between pairs of random variables which is implicitly exploited in linear regression. The correlation values rely on estimated or empirical statistical moments such as sample mean and sample variance. If the mean values are removed from data, the correlations are referred to as covariances. The pairwise correlations can be represented as a fully connected graph with edges parameterized by time shifts and possibly by amplitude adjustments between the corresponding data sequences. However, the graph may become excessively complex when the number of random variables considered is large, for example, when analyzing multiple long sequences of random observations.
The problem of extending the notion of correlations and covariances to more than two random variables has been considered previously in literature. In particular, the multirelation has been defined as a measure of linearity for multiple random variables in [1]. This measure is based on a geometric analysis of linear regression. Under the assumption of multivariate Student-t distribution, the statistical significance of the multirelation coefficient is defined in terms of eigenvalues of the correlation matrix in [2]. An univariate correlation measure for multiple random variables is defined in [3] to be a sum of elements on the main diagonal of the covariance matrix. Linear regression is again utilized in [4] to define the multiple correlation coefficient. It is derived from the coefficient of determination of linear regression as a proportion of the variance in the dependent variable which is predictable from the independent variables.
The distance correlation measure between two random vectors based on the hypothesis testing of independence of random variables is proposed in [5]. Different operations on random variables including ordering, weighting, and a nonlinear transformation are assumed in [6] to define a family of Minkowski distance measures for random vectors. A maximal correlation measure for random vectors is introduced in [7]. It generalizes the concept of maximal information coefficient while nonlinear transformations are also used to allow assessment of nonlinear correlations. Similarly to [3], the sample cross-correlation between multiple realizations of two random vectors is shown in [8] to be proportional to the sum of diagonal elements of the product of the corresponding correlation matrices. The reference [9] investigates two-dimensional general and central moments for random matrices which are invariant to similarity transformations. Two new distance measures for random vectors are defined in [10] assuming the joint characteristic function. These measures can be also used to test for statistical independence of random vectors. The most recent paper [11] derives a linear correlation measure for multiple random variables using determinants of the correlation submatrices.
This brief literature survey indicates that there are still no commonly accepted correlation measures for multiple random variables. The measures considered in the literature are either rigorously derived but mathematically rather complicated, or there are many modifications of the existing, simpler, but well understood measures. In this paper, it is shown that by constraining the complexity of multivariate Taylor series to reduce the number of its parameters or degrees-of-freedom, the Taylor series can be rewritten as a finite degree univariate polynomial. The independent variable of this polynomial is a simple sum of the random variables considered. The polynomial coefficients are real-valued constants, and they are application dependent. The polynomial defines a many-to-one transformation of multiple random variables to another scalar random variable. The mean value of the polynomial represents a broad class of polynomial measures which can be used for any number of random variables. The mean value of each polynomial element corresponds to a general or central moment of the sum of random variables. Therefore, these moments are referred to here as sum-moments. In case of multiple random vectors, similarly to computing the correlation or covariance matrix by first concatenating the vectors into one long vector, the sum-moments can be readily defined and computed for such a concatenated random vector. The main advantages of assuming sum-moments to study statistical properties of multiple random variables or multiple random vectors are the clarity in understanding their statistical significance, and mathematical simplicity of their definition. Moreover, as long as the distribution of the sum of random variables can be found, the closed-form expression for the sum-moments can be obtained.
Before introducing the polynomial representations of random vectors in Section 4 together with central and general sum-moments, a number of auxiliary results are presented in Section 2 and Section 3. In particular, Section 2 summarizes the key results and concepts from the literature concerning stationary random processes, their parameter estimation via linear regression and method of moments, and how to generate the 1st and the 2nd order Markov processes is also given. Section 3 extends the results from Section 2 by deriving additional results which are used in Section 4 and in Section 5 such as a low complexity approximation of linear regression and a procedure to generate multiple Gaussian processes with defined autocorrelation and cross-correlation. The main results of the paper are obtained in Section 4 including defining a class of polynomial statistical measures and sum-moments for multiple random variables and random processes. Other related concepts involving sums of random variables are also reviewed. Section 5 provides several examples to illustrate and evaluate the results obtained in the paper. The paper is concluded in Section 6.

## 2. Background

This section reviews key concepts and results from the literature which are used to develop new results in the subsequent sections. Specifically, the following concepts are briefly summarized: stationarity of random processes, estimation of general and central moments and of correlation and covariance using the method of moments, definition of cosine similarity and Minkowski distance, parameter estimation via linear regression, generation of the 1st and the 2nd order Markov processes, and selected properties of polynomials and multivariate functions are also given. Note that more straightforward proofs for some lemmas are only indicated and not fully and rigorously elaborated.

#### 2.1. Random Processes

Consider a real-valued one-dimensional random process $x ( t ) ∈ R$ over a continuous time $t ∈ R$. The process is observed at N discrete time instances $t 1 < t 2 < … < t N$ corresponding to N random variables $X i = x ( t i )$, $i = 1 , … , N$. The random variables $X = { X 1 , … , X N } ∈ R N$ are completely statistically described by their joint density function $f x ( X )$, so that $f x ( X ) ≥ 0$ and $∫ R N f x ( X ) d X = 1$. The process $x ( t )$ is further assumed to be K-th order stationary [12].
Definition 1.
A random process $x ( t )$ is K-th order stationary, if,
$f x ( X 1 , … , X K ; t 1 , … , t K ) = f x ( X 1 , … , X K ; t 1 + t 0 , … , t K + t 0 ) ∀ t 0 ∈ R .$
Lemma 1.
The K-th order stationary process is also $( K − k )$-th order stationary, $k = 1 , 2 , … , K − 1$, for any subset ${ X i 1 , … , X i K − k } ⊆ { X 1 , … , X N }$.
Proof.
The unwanted random variables can be integrated out from the joint density. □
Unless otherwise stated, in the sequel, all observations of random processes are assumed to be stationary.
The expectation, $E X = ∫ R x f x ( x ) d x$, of a random variable X is a measure of its mean value. A linear correlation between two random variables $x ( t 1 )$ and $x ( t 2 )$ is defined as [12],
$R x ( t 1 , t 2 ) = corr x ( t 1 ) x ( t 2 ) = R x ( t 2 − t 1 ) = R x ( τ ) .$
The (auto-) covariance measures a linear dependency between two zero-mean random variables, i.e.,
$C x ( t 1 , t 2 ) = cov x ( t 1 ) , x ( t 2 ) = E x ( t 1 ) − E x ( t 1 ) x ( t 2 ) − E x ( t 2 ) = C x ( t 2 − t 1 ) .$
It can be shown that the maximum of $C x ( τ )$, $τ = t 2 − t 1$, occurs for $τ = 0$, corresponding to the variance of a stationary process $x ( t )$, i.e.,
$C x ( 0 ) = var x ( t ) = E x ( t ) − E x ( t ) 2 ∀ t ∈ R .$
The covariance $C x ( τ )$ can be normalized, so that, $− 1 ≤ C x ( τ ) / C x ( 0 ) ≤ 1$. Furthermore, for real-valued regular processes, the covariance has an even symmetry, i.e., $C x ( τ ) = C x ( − τ )$,12].
For $N > 2$, it is convenient to define a random vector, $X = [ X 1 , … , X N ] T$, where $( · ) T$ denotes the transpose, and the corresponding mean vector, $X ¯ = E X$. Then, the covariance matrix,
$C x = E ( X − X ¯ ) ( X − X ¯ ) T ⊆ R N × N$
has as its elements the pairwise covariances, $[ C x ] i j = C x ( t j − t i )$, $i , j = 1 , 2 , … , N$.
Assume now the case of two K-th order stationary random processes $x 1 ( t )$ and $x 2 ( t )$, and the discrete time observations,
$X 1 i = x ( t 1 i ) , i = 1 , 2 , … , N 1 X 2 i = x ( t 2 i ) , i = 1 , 2 , … , N 2 .$
Using the set notation, ${ X i } 1 : K = { X 1 , X 2 , … , X K }$, the jointly or mutually stationary processes imply time-shift invariance of their joint density function.
Definition 2.
Random processes $x 1 ( t )$ and $x 2 ( t )$ are K-th order jointly stationary, if,
$f x ( { X 1 i } 1 : K , { X 2 i } 1 : K ; { t 1 i } 1 : K , { t 2 i } 1 : K ) = f x ( { X 1 i } 1 : K , { X 2 i } 1 : K ; { t 1 i + t 0 } 1 : K , { t 2 i + t 0 } 1 : K ) ∀ t 0 ∈ R$
is satisfied for all subsets, ${ X 1 i } 1 : K ⊆ { X 1 i } 1 : N 1$, $K ≤ N 1$, and ${ X 2 i } 1 : K ⊆ { X 2 i } 1 : N 2$, $K ≤ N 2$.
Lemma 2.
The K-th order joint stationarity implies the joint stationarity of all orders smaller than K.
Proof.
The claim follows from marginalization of the joint density function to remove unwanted variables. □
The cross-covariance of random variables $X 1 = x 1 ( t 1 )$ and $X 2 = x 2 ( t 2 )$ being discrete time observations of jointly stationary random processes $x 1 ( t )$ and $x 2 ( t )$, respectively, is defined as,
$C x 1 x 2 ( t 1 , t 2 ) = cov X 1 , X 2 = cov x ( t 1 ) − E x ( t 1 ) x ( t 2 ) − E x ( t 2 ) = C x 1 x 2 ( t 2 − t 1 ) .$
The cross-covariance can be again normalized, so it is bounded as,
$− 1 ≤ C x 1 x 2 ( τ ) / var x 1 ( t ) var x 2 ( t ) ≤ 1 .$
Note that, unlike for autocovariance, the maximum of $C x 1 x 2 ( τ )$ can occur for any value of the argument $τ$.
The covariance matrix for the random vectors $X 1 = [ X 1 i , … , X 1 N 1 ] T$ and $X 2 = [ X 2 i , … , X 2 N 2 ] T$ having the means, $X ¯ 1 = E X 1$ and $X ¯ 2 = E X 2$, respectively, is computed as,
$C x 1 x 2 = E ( X 1 − X ¯ 1 ) ( X 2 − X ¯ 2 ) T ⊆ R N 1 × N 2 .$
Its elements are the covariances, $[ C x 1 x 2 ] i j = C x 1 x 2 ( t 1 i − t 2 j )$, $i = 1 , … , N 1$, $j = 1 , … , N 2$.
In addition to the first order (mean value) and the second order (covariance) statistical moments, higher order statistics of a random variable X are given by the general and the central moments defined, respectively, as [13],
$g m ( X ) = E | X | m a n d μ m ( X ) = E | X − E X | m , m = 1 , 2 , …$
where $| X |$ denotes the absolute value of scalar variable X. The positive integer-valued moments (1) facilitate mathematically tractable integration, and prevent producing complex numbers, if the argument is negative. Note also that the absolute value in (1) is necessary if X is complex valued, or if m is odd, in order to make the moments to be real-valued and convex. The central moment can be normalized by the variance as,
$μ m ( X ) = E | X − E X | m / E ( X − E X ) 2 m / 2 .$
The cosine similarity between two equal-length vectors, $X 1 = [ X 11 , … , X 1 N ] T$ and $X 2 = [ X 21 , … , X 2 N ] T$, is defined as [14],
$S cos ( X 1 , X 2 ) = ∑ i = 1 N X 1 i X 2 i ∑ i = 1 N X 1 i 2 ∑ i = 1 N X 2 i 2 .$
The Minkowski distance between two equal-length vectors $X 1$ and $X 2$ is defined as the $l m$-norm [6], i.e.,
$S mnk ( X 1 , X 2 ) = X 1 − X 2 m = ( ∑ i = 1 N X 1 i − X 2 i | m 1 / m , m = 1 , 2 , …$

#### 2.2. Estimation Methods

Statistical moments can be empirically estimated from measured data using sample moments. Such inference strategy is referred to as the method of moments [15]. In particular, under the ergodicity assumption, a natural estimator of the mean value of a random variable X from its measurements $X i$ is the sample moment,
$X ¯ = E X ≈ 1 N ∑ i = 1 N X i .$
The sample mean estimator is unbiased and consistent [15]. More generally, the sample mean estimator of the first moment of a transformed random variable $h ( X )$ is,
$E h ( X ) ≈ 1 N − d ∑ i = 1 N h ( X i )$
where d is the number of degrees of freedom used in the transformation h, i.e., the number of other parameters which must be estimated. For example, the variance of X is estimated as,
$var X = E | X − X ¯ | 2 ≈ 1 N − 1 ∑ i = 1 N ( X i − X ¯ ^ ) 2$
where $X ¯ ^$ is the estimate of the mean value $X ¯$ of X.
Assuming random sequences ${ X 1 i } 1 : N 1$ and ${ X 2 i } 1 : N 2$, their autocovariance and cross-covariance, respectively, are estimated as [15],
$C x ( k ) = E X i X i + k ≈ 1 N − k ∑ i = 1 N − k X i X i + k , k ≪ N . C x 1 x 2 ( k ) = E X 1 i X 2 ( i + k ) ≈ 1 N − k ∑ i = 1 N − k X 1 i X 2 ( i + k ) , k ≪ N = min ( N 1 , N 2 ) .$
Since these estimators are consistent, the condition $k ≪ N$ is necessary to combine a sufficient number of samples and achieve an acceptable estimation accuracy.
The parameters of a random process can be estimated by fitting a suitable data model to the measurements $X i$. Denote such data model as, $E X i = M i ( P )$, $i = 1 , 2 , … , N$. Assuming the least-squares (LS) criterion, the vector of parameters, $P = [ P 1 , … , P D ] T ⊆ R D$, is estimated as,
$P ^ = argmin P ∑ i = 1 N ( X i − M i ( P ) ) 2 .$
For continuous parameters, the minimum is obtained using the derivatives, i.e., let $d d P M i ( P ) = M ˙ i ( P )$, and,
$d d P ∑ i = 1 N ( X i − M i ( P ) ) 2 = ! 0 ⇔ ∑ i = 1 N M ˙ i ( P ^ ) M i ( P ^ ) = ∑ i = 1 N M ˙ i ( P ^ ) X i .$
For a linear data model, $M i ( P ) = w i T P$ where $w i = [ w 1 i , … , w D i ] T$ are known coefficients, the LS estimate can be obtained in the closed-form, i.e.,
$P ^ = ∑ i = 1 N w i w i T − 1 ∑ i = 1 N w i X i$
where $( · ) − 1$ denotes the matrix inverse.

#### 2.3. Generating Random Processes

The task is to generate a discrete time stationary random process with a given probability density and a given autocovariance. The usual strategy is to generate a correlated Gaussian process followed by a nonlinear memoryless transformation. For instance, the autoregressive (AR) process described by the second order difference equation with constant coefficients $a 1$, $a 2$, and $b > 0$ [12], i.e.,
$x ( n ) + a 1 x ( n − 1 ) + a 2 x ( n − 2 ) = b u ( n )$
generates the 2nd order Markov process from a zero-mean white (i.e., uncorrelated) process $u ( n )$. For $a 1 = − 2 1 − a 1 + a$ and $a 2 = 1 − a 1 + a$, $0 < a < 1$, this process has the autocovariance,
$C 2 MP ( k ) = b 2 ( 1 + a ) 2 4 a 3 ︸ σ 2 1 − a 1 + a | k | / 2 ︸ ≐ ( 1 − a ) | k | ( 1 + a | k | ) ≐ σ 2 ( 1 − a ) | k | ( 1 + a | k | ) = σ 2 e − | k | ( − ln ( 1 − a ) ) ( 1 + a | k | ) .$
On the other hand, for $a 1 = a > 0$, and $a 2 = 0$, the AR process,
$x ( n ) + a x ( n − 1 ) = b u ( n )$
generates the 1st order Markov process with the autocovariance,
$C 1 MP ( k ) = b 2 1 − a 2 ︸ σ 2 a | k | = σ 2 e − α | k | , a = e − α .$
Lemma 3.
[16] The stationary random process $x ( k )$ with autocovariance $C x ( k )$ is transformed by a linear time-invariant system with real-valued impulse response $h ( k )$ into another stationary random process $y ( k ) = ∑ i = − ∞ ∞ h ( i ) x ( k − i ) ≡ h ( k ) ⊛ x ( k )$ with autocovariance, $C y ( k ) = h ( k ) ⊛ h ( − k ) ⊛ C x ( k )$. The symbol, ⊛, denotes (discrete time) convolution.
Proof.
By definition, the output covariance, $C y ( k , l ) = E ( y ( k ) − E y ( k ) ) ( y ( l ) − E y ( l ) )$. Substituting $y ( k ) = ∑ i = − ∞ ∞ h ( i ) x ( k − i )$, and rearranging, the covariance, $C y ( k − l ) = ∑ i ∑ m h ( m ) C x ( i − m ) h ( k − i ) = h ( k ) ⊛ h ( − k ) ⊛ C x ( k )$. □
Lemma 4.
A stationary random process at the output of a linear or nonlinear time-invariant system remains stationary.
Proof.
For any multivariate function $h ( x ) ∈ R$ and any $t 0 ∈ R$, the expectation,
$E h ( x ) = ∫ R N h ( x ) f X ( x ; t 1 , … , t N ) d x = ∫ R N h ( x ) f X ( x ; t 1 − t 0 , … , t N − t 0 ) d x$
assuming Definition 1 and provided that dimension N of observations is at most equal to the order of stationarity K. □
For shorter sequences, the linear transformation, $X = T U$, can be used to generate a normally distributed vector $X ∈ R N$ having the covariance matrix, $C x = T T T$, from uncorrelated Gaussian vector $U ∈ R N$. The mean, $E X = T E U$. For longer sequences, a linear time-invariant filter can be equivalently used as indicated by Lemma 3.

#### 2.4. Polynomials and Multivariate Functions

Lemma 5.
A univariate twice differentiable function $p ( x )$ is convex, if and only if, $d 2 d x 2 p ( x ) = p ¨ ( x ) > 0$ for $∀ x ∈ R$. More generally, a twice differentiable multivariate function $f : R N ↦ R$ is convex, if and only if, the domain of f is convex, and its Hessian is positive semidefinite, i.e., $∇ 2 f ≥ 0$ for $∀ x ∈ dom f$.
Proof.
See [17] (Sec. 3.1). □
Consequently, convex polynomials can be generated as follows.
Lemma 6.
Let $Q ∈ R m × m$ be a positive semidefinite matrix, and assume a polynomial, $p ¨ ( x ) = ∑ i = 0 2 m − 2 p i x i$ for $∀ x ∈ R$ where $p i = ∑ k + l = i Q k l$. Then, for any $q 0 , q 1 ∈ R$, the polynomial $p ( x )$ of degree $2 m$,
$p ( x ) = q 0 + q 1 x + ∑ i = 0 2 m − 2 p i ( i + 1 ) ( i + 2 ) x 2 + i$
is convex.
Proof.
Let $x = [ x 0 , x 1 , … , x m − 1 ] T$. Then, $x T Q x = ∑ i = 0 2 m − 2 p i x i = p ¨ ( x ) ≥ 0$ for $∀ x$, since $Q$ is positive semidefinite. Using Lemma 5 concludes the proof. □
For a non-negative integer $m ∈ { 0 } ∪ N + = { 0 , 1 , 2 , … }$, assume the following notations to simplify the subsequent mathematical expressions [18]:
$n = { n 1 , n 2 , … , n N } ∈ { N + ∪ 0 } N x = { x 1 , x 2 , … , x N } ∈ R N x p = x 1 p + x 2 p + ⋯ + x N p 1 / p x 1 = | x 1 | + | x 2 | + ⋯ + | x N | | x | 1 = x 1 + x 2 + ⋯ + x N h = { h 1 , h 2 , … , h N } ∈ R N h n = h 1 n 1 h 2 n 2 ⋯ h N n N ∂ n f ( x ) = ∂ 1 n 1 ∂ 2 n 2 ⋯ ∂ N n N f ( x ) = ∂ | n | 1 ∂ x 1 n 1 x 2 n 2 ⋯ x N n N f ( x ) m ! = ∏ i = 1 m i n ! = n 1 ! n 2 ! ⋯ n N !$
Note that $| x | 1$ denotes the sum of elements of $x$ whereas $x 1$ is the sum of absolute values of its elements.
Lemma 7.
The m-th power of a finite sum can be expanded as [19],
$| x | 1 m = ∑ | n | 1 = m m ! n ! x n .$
Proof.
See [18]. □
Theorem 1.
The multivariate Taylor’s expansion of a $( m + 1 )$-order differentiable function $f : R N ↦ R$ about the point $x ∈ R N$ is written as [19],
$f ( x + h ) = f ( x ) + ∑ | n | 1 ≤ m ∂ n f ( x ) n ! h n + ∑ | n | 1 = m + 1 ∂ n f ( x + t h ) n ! h n$
for some $t ∈ ( 0 , 1 )$.
Proof.
See [18,19]. □
Definition 3.
A multivariate function $f ( x ) = f ( x 1 , … , x N )$ is said to be symmetric, if, for any permutation of its arguments $x$ denoted as $x ′$, $f ( x ) = f ( x ′ )$.

## 3. Background Extensions

In this section, additional results are obtained which are used in the next section to introduce statistical sum-moments for random vectors. In particular, the mean cosine similarity, the mean Minkowski distance as well as the higher central moments for random vectors are defined. A polynomial approximation of univariate functions is shown to be a linear regression problem. A numerically efficient solution of the LS problem is derived. Finally, a procedure to generate multiple Gaussian processes with defined autocovariance and cross-covariance is devised.
Recall the second moment of a random variable X, i.e.,
$μ 2 = E ( X − c ) 2 , c ∈ R .$
It is straightforward to show that $μ 2$ is minimized for $c = E X$, giving the variance of X. On the other hand, $μ 2 = 0$, if and only if $c = E X ± − var X$.
For random vectors, both cosine similarity and Minkowski distance are random variables. Assuming that the vectors are jointly stationary, and their elements are identically distributed, the mean cosine similarity can be defined as,
$S ¯ cos ( X 1 , X 2 ) = ∑ i = 1 N E X 1 i X 2 i ∑ i = 1 N E X 1 i 2 − X ¯ 1 i 2 ∑ i = 1 N E X 2 i 2 − X ¯ 2 i 2 = 1 N ∑ i = 1 N E X 1 i X 2 i σ 1 σ 2 = 1 N ∑ i = 1 N ρ 1 i , 2 i$
where $σ 1 2 = var X 1 i$, $σ 2 2 = var X 2 i$, and, $ρ 1 i , 2 i$ denotes the Pearson correlation coefficient of the i-th elements of the vectors $X 1$ and $X 2$. It should be noted that this definition of the mean cosine similarity does not account for other correlations, $E X 1 i X 2 j$, $i ≠ j$.
The mean Minkowski distance for random vectors can be defined as,
$S ¯ mnk ( X 1 , X 2 ) = ∑ i = 1 N E X 1 i − X 2 i m 1 / m , m = 1 , 2 , …$
Recognizing the m-th general moment in (5), the m-th power of the mean Minkowski distance can be normalized as,
$S ˜ mnk m ( X 1 , X 2 ) = ∑ i = 1 N E X 1 i − X 2 i m N E X 1 i − X 2 i 2 m / 2 = μ ¯ m ( X 1 − X 2 ) , m = 1 , 2 , …$
where the average Minkowski distance between two random vectors is,
$μ ¯ m ( X 1 − X 2 ) = 1 N ∑ i = 1 N μ m ( X 1 i − X 2 i ) , m = 1 , 2 , …$
Furthermore, note that for $m = 2$,
$1 2 E X 1 − X 2 2 2 + E X 1 + X 2 2 2 = E X 1 2 2 + E X 2 2 2 .$
Assuming positive integers $m = { m 1 , m 2 , … , m N }$, the higher order joint central moments of random vector $X = { X i } 1 : N$ can be defined as,
$μ m 1 , … , m N ( X 1 , … , X N ) = E ∏ i = 1 N ( X i − X ¯ i ) m i$
or, using more compact index notation as,
$μ m ( X ) = E ( X − X ¯ ) m .$

#### 3.1. Linear LS Estimation

The linear LS estimation can be used to fit a degree $( D − 1 )$ polynomial to N samples of a random process $x ( t )$ at discrete time instances $t 1 , t 2 , … , t N$. Hence, consider the polynomial data model,
$E x ( t ) ≈ M ( t ; P ) = ∑ k = 1 D P k t k − 1 .$
Denoting $w i = [ t i 0 , t i 1 , … , t i D − 1 ] T$, the linear LS solution (2) gives the estimates,
$P ^ = ∑ i = 1 N t i 0 t i 1 ⋯ t i D − 1 t i 1 t i 2 ⋯ t i D ⋮ ⋮ ⋱ ⋮ t i D − 1 t i D ⋯ t i 2 D − 2 − 1 ∑ i = 1 N x ( t i ) t i 0 t i 1 ⋮ t i D − 1 .$
Assuming $D = 2$ parameters, i.e., the linear LS regression for a straight line, the parameters $P 1$ and $P 2$ to be estimated must satisfy the following equality:
$d d P 1 ∑ i = 1 N X i − w 1 i P 1 − w 2 i P 2 2 = − 2 ∑ i = 1 N w 1 i X i + 2 ∑ i = 1 N w 1 i 2 P 1 + 2 ∑ i = 1 N w 1 i w 2 i P 2 = ! 0 .$
Denoting the weighted averages, $X ¯ = ∑ i = 1 N w 1 i X i$, $w 1 2 2 = ∑ i = 1 N w 1 i 2$, and $w ¯ 12 = ∑ i = 1 N w 1 i w 2 i$, a necessary but not sufficient condition for the linear LS estimation of parameters $P 1$ and $P 2$ is,
$X ¯ = w 1 2 2 P 1 + w ¯ 12 P 2 .$
In the LS terminology, the values $w 1 2 2$ and $w ¯ 12$ represent independent variables whereas $X ¯$ is a dependent variable.
Note that all N measurements are used in (7). However, if N is sufficiently large, the data points could be split into two disjoint sets of $N 1$ and $N 2$ elements, respectively, $N 1 + N 2 = N$. For convenience, denote the sums,
$X ¯ 1 = 1 a 1 ∑ i ∈ I 1 w 1 i X i , W ¯ 11 = 1 a 1 ∑ i ∈ I 1 w 1 i 2 , W ¯ 12 = 1 a 1 ∑ i ∈ I 1 w 1 i w 2 i , X ¯ 2 = 1 a 2 ∑ i ∈ I 2 w 1 i X i , W ¯ 21 = 1 a 2 ∑ i ∈ I 2 w 1 i 2 , W ¯ 22 = 1 a 2 ∑ i ∈ I 2 w 1 i w 2 i ,$
where $a 1 , a 2 > 1$ are some constants (to be determined later), and $I 1$ and $I 2$ are two disjoint index sets, such that $I 1 ∪ I 2 = { 1 , 2 , … , N }$, and the cardinality, $| I 1 | = N 1$ and $| I 2 | = N 2$. Note also that,
$w 1 2 2 = a 1 W ¯ 11 + a 2 W ¯ 21 w ¯ 12 = a 1 W ¯ 12 + a 2 W ¯ 22 X ¯ = a 1 X ¯ 1 + a 2 X ¯ 2 .$
Consequently, the approximate LS estimates of parameters $P 1$ and $P 2$ are readily computed as,
$P ^ 1 P ^ 2 = W ¯ 11 W ¯ 12 W ¯ 21 W ¯ 22 − 1 X ¯ 1 X ¯ 2 .$
There are $2 N$ possibilities how to split N data points into two disjoint subsets indexed by $I 1$ and $I 2$. More importantly, the estimates (9) do not guarantee the minimum LS fit, i.e., achieving the minimum squared error (MSE). However, the complexity of performing the LS fit is greatly reduced by splitting the data, since independently of the value of $N ≫ 1$, only a $2 × 2$ matrix needs to be inverted. The optimum LS fit and the approximate LS fit (9) are depicted in Figure 1. The points $A 1$ and $A 2$ in Figure 1 correspond to the data subsets indexed by $I 1$ and $I 2$, respectively. The mid-point, $B = a 1 A 1 + a 2 A 2$, follows from (8). Note that B is always located at the intersection of the optimum and the approximate lines $L S opt$ and $L S apr$, respectively. The vertical arrows at points $A 1$ and $A 2$ in Figure 1 indicate that the dependent values $X ¯ 1$ and $X ¯ 2$ are random variables.
The larger the variation of the gradient of the line $L S apr$ in Figure 1, the larger the uncertainty and the probability that the line $L S apr$ deviates from the optimum regression line $L S opt$. Since the line $L S apr$ is defined by points $X ¯ 1$ and $X ¯ 2$, and always, $B ∈ L S apr$, the spread of random variables $X ¯ 1$ and $X ¯ 2$ about their mean values affect the likelihood that $L S apr$ deviates from $L S opt$. In particular, given $0 < p < 1$, there exists $ξ > 0$, such that the probability of the gradient $G apr$ of $L S apr$ to be within the given bounds is at least,
$Pr b l ( ξ ) < G apr < b u ( ξ ) ≥ p$
where
$b l ( ξ ) = E X ¯ 2 − ξ var X ¯ 2 − E X ¯ 1 + ξ var X ¯ 1 W ¯ 22 − W ¯ 12 b u ( ξ ) = E X ¯ 2 + ξ var X ¯ 2 − E X ¯ 1 − ξ var X ¯ 1 W ¯ 22 − W ¯ 12 .$
For stationary measurements (cf. (9)), the means and the variances in (10) are equal to,
$E X ¯ 1 = E X ¯ N 1 / ( a 1 N ) var X ¯ 1 = var X ¯ N 1 / ( a 1 2 N ) E X ¯ 2 = E X ¯ N 2 / ( a 2 N ) var X ¯ 2 = var X ¯ N 2 / ( a 2 2 N ) .$
The uncertainty in computing $G apr$ from data, and thus, also the probability of line $L S apr$ deviating from line $L S opt$ is inversely proportional to the width of the interval in (10), i.e.,
$b u ( ξ ) − b l ( ξ ) = 2 ξ ( var X ¯ 1 + var X ¯ 2 ) W ¯ 22 − W ¯ 12 ∝ N 1 a 1 + N 2 a 2 W ¯ 22 − W ¯ 12 .$
Consequently, the numerator of (11) must be minimized, and the denominator maximized.
In order to minimize the numerator in (11), it is convenient to choose, $a 1 = N 1 ν$ and $a 2 = N 2 ν$, where $ν ∈ R +$ is a constant to be optimized. It is straightforward to show that the expression, $( N 1 1 / 2 − ν + N 2 1 / 2 − ν )$ is convex, i.e., it has a unique global minimum for $∀ ν > 1 / 2$. Hence, a necessary condition to reduce the approximation error is that $a 1 > N 1$ and $a 2 > N 2$. For numerical convenience, let $a 1 = N 1$ and $a 2 = N 2$. Then, the dependent and independent variables assumed in (9) become arithmetic averages, and the optimum index subsets have the cardinality,
$| I 1 | = | I 2 | = N / 2 N − e v e n | I 1 | = | I 2 | ± 1 = ( N ± 1 ) / 2 N − o d d .$
In order to maximize the denominator in (11), assume that the independent variables $( w 1 2 2 , w ¯ 12 )$ in (9) are sorted by $w l i$, i.e., let $w 11 < w 12 < … < w 1 N$. This ordering and the condition (12) suggests that the disjoint index sets $I 1$ and $I 2$ maximizing the difference, $W ¯ 22 − W ¯ 12$, are,
$I 1 = { 1 , 2 , … , N / 2 } , I 2 = { N / 2 + 1 , … , N } N − even I 1 = { 1 , 2 , … , ( N − 1 ) / 2 } , I 2 = { ( N + 1 ) / 2 , … , N } N − odd or , I 1 = { 1 , 2 , … , ( N + 1 ) / 2 } , I 2 = { ( N + 3 ) / 2 , … , N } .$
Such a partitioning corresponds to splitting the data into two equal (N-even) or approximately equal (N-odd) sized subsets by the median (2-quantile) index point.
In summary, the approximate linear LS regression can be efficiently computed with a good accuracy by splitting the data into multiple disjoint subsets, calculating the average data points in each of these subsets, and then solving the set of linear equations with the same (or smaller) number of unknown parameters. The data splitting should exploit ordering of data points by one of the independent variables. It can be expected that the accuracy of approximate LS regression is going to improve with the number of data points N. Numerical evaluation of the approximate LS regression is considered in Section 5.

#### 3.2. Generating Pairwise-Correlated Gaussian Processes

How to generate a single correlated Gaussian process is well established in the literature, and it has been described in Section 2.3. Moreover, the linear transformation to generate correlated Gaussian variables from uncorrelated Gaussian variables does not have to be square. A sufficient condition on the rank of linear transformation to obtain a positive definite covariance matrix is given by the following lemma.
Lemma 8.
The matrix, $T T T$, is positive definite, provided that the matrix, $T ∈ R N 1 × N 2$, has rank $N 2$.
Proof.
The rank $N 2$ of $T$ implies that $T$ consists of $N 2$ linearly independent columns and that $N 1 ≥ N 2$. The matrix $T T T$ is positive definite, provided that $U T T T T U = T U 2 2 > 0$ for $∀ U ∈ R N 2$ where $· 2$ denotes the Euclidean norm of a vector. Since the columns of $T$ are linearly independent, $T U 2 = 0$, if and only if $U = 0$. □
Corollary 1.
The matrix, $T T T$, is positive definite, provided that the matrix, $T ∈ R N 1 × N 2$, has rank $N 1$.
Corollary 2.
A rank $N 1$ linear transformation $T$ of uncorrelated Gaussian vector $U ∈ R N 2$ generates $N 1 ≤ N 2$ correlated Gaussian variables having the (positive definite) covariance matrix, $T T T$.
Furthermore, it is often necessary to generate multiple mutually correlated Gaussian processes with given autocorrelation as well as cross-correlation.
Lemma 9.
The linear transformation,
$x 1 x 2 = T 1 K 0 T 2 u 1 u 2$
generates a pair of correlated Gaussian vectors $x 1 ∈ R N 1$ and $x 2 ∈ R N 2$ from uncorrelated zero-mean Gaussian vectors $u 1 , u 2 ∈ R N$ where $0$ denotes a zero matrix, and according to Corollary 2, it is necessary that, $max ( N 1 , N 2 ) ≤ N$. The corresponding (auto-) correlation and cross-correlation matrices are,
$E u 1 u 1 T = σ 1 2 I , E u 2 u 2 T = σ 2 2 I , E u 1 u 2 T = 0$
$C x 1 = E x 1 x 1 T = T 1 T 1 T + K K T , C x 2 = E x 2 x 2 T = T 2 T 2 T , C x 1 x 2 = E x 1 x 2 T = K T 2 T = C x 2 x 1 T$
where $I$ denotes an identity matrix.
Proof.
The proof is straightforward by substituting (13) to definitions of $C x 1$, $C x 2$ and $C x 1 x 2$. □
The following corollary details the procedure described in Lemma 9.
Corollary 3.
Given $T 2$, calculate the (auto-) correlation matrix, $C x 2 = K T 2 T$, or vice versa. Then, given the cross-correlation matrix, $C x 1 x 2$, compute $K = C x 1 x 2 T 2 C x 2 − T$. Finally, given $T 1$, calculate the (auto-) correlation matrix, $C x 1 = T 1 T 1 T + K K T$, or, obtain $T 1$ by solving the matrix equation, $T 1 T 1 T = C x 1 − K K T$. Note that the matrix equation, $T T T = C$, can be solved for $T$ using the singular value decomposition, $C = U Λ U T$ where $U$ is a unitary matrix and Λ is a diagonal matrix of eigenvalues. Then, $T = U Λ$.

## 4. Polynomial Statistics and Sum-Moments for Vectors of Random Variables

The main objective of this section is to define a universal function to effectively measure the statistics of random vectors and random processes observed at multiple discrete time instances. The measure function should: (1) be universally applicable for an arbitrary number of random variables and random vectors, (2) be symmetric, so that all random variables are considered equally, (3) lead to mathematically tractable expressions, and (4) be convex to allow defining convex optimization problems.
Let $f : R N ↦ R$ denote such a mapping or transformation of N random variables $X i$ to a scalar random variable Y, i.e.,
$Y = f ( X 1 , X 2 , … , X N ) = f ( x ) .$
In order to satisfy the symmetry requirement, the random variables $X i$ can be first combined as,
$Y = f ( X 1 ∘ X 2 ∘ ⋯ ∘ X N )$
using a binary commutative operator, ∘, such as addition or multiplication. In case of addition, it is important that the function f is nonlinear. The nonlinearity can be also used to limit the extreme values of the combined variables.
For a random process $x ( t )$, the random variables are defined as, $X i = x ( t i )$. Define a vector of discrete time instances, $t = ( t 1 , … , t N )$, and assume the index notation, $x ( t ) = { X 1 = x ( t 1 ) , … , X N = x ( t N ) }$. Then the mapping (14) can be rewritten as,
$Y = f ( x ( t ) ) = F ( t ) = F ( t 1 , t 2 , … , t N ) .$
The mean value is the most important statistical property of the random variable Y. In addition, if the process $x ( t )$ is K-th order stationary, the dimension of the mean mapping for N observations is reduced by one.
Lemma 10.
The mean of $Y = f ( x ( t ) )$ for N discrete time observations of a K-th order stationary random process has dimension $( N − 1 )$, i.e., if $N ≤ K$,
$Y ¯ = E f ( x ( t ) = C ( t 2 − t 1 , t 3 − t 1 , … , t N − t 1 ) = C ( τ 1 , τ 2 , … , τ N − 1 ) = C ( τ )$
where $τ i = t i + 1 − t 1$, $i = 1 , 2 , … , N − 1$.
Proof.
Assuming Lemma 1 and the first sample $X 1 = x ( t 1 )$ to be a reference, the joint probability density of N process observations becomes, $f X ( x ; 0 , t 2 − t 1 , … , t N − t 1 ) ≡ f ˜ X ( x ; t 2 − t 1 , … , t N − t 1 )$, so the corresponding statistical moments have dimension $( N − 1 )$. □
In optimization problems, it is useful to consider the gradient of $Y ¯$, i.e.,
$∇ Y ¯ = ∂ Y ¯ ∂ τ 1 , ∂ Y ¯ ∂ τ 2 , … , ∂ Y ¯ ∂ τ N − 1 = ∂ Y ¯ ∂ T ∂ T ∂ τ 1 , ∂ T ∂ τ 2 , … , ∂ T ∂ τ N − 1$
where $T$ is a application dependent measure of the vector $τ$ such as the norms, $T = τ 1 = | τ 1 | + ⋯ + | τ N − 1 |$, or $T = τ ∞ = max ( τ 1 , … , τ N − 1 )$.
In general, assuming the Taylor’s series defined in Theorem 1 on p. 8, a multivariate function of a random vector $x$ can be expanded about the mean $x ¯ = E x$ as,
$f ( x ¯ + h ) ≈ f ( x ¯ ) + ∑ l = 1 m ∑ | n | 1 = l ∂ n f ( x ) n ! h n .$
Thus, the value of $f ( x ¯ + h )$ is a weighted sum of $h n$ plus an offset $f ( x ¯ )$. More importantly, if the partial derivatives $∂ n f ( x )$ are replaced with the coefficients $( l ! p l )$ which are independent of $n$, the number of parameters in (15) is greatly reduced. Moreover, instead of precisely determining the values of $p l ∈ R$ to obtain the best possible approximation of the original function f, it is useful as well as sufficient to constrain the Taylor expansion (15) to the class of functions that are exactly constructed as,
$f ( x ¯ + h ) = f ( x ¯ ) + ∑ l = 1 m p l ∑ | n | 1 = l l ! n ! h n = ∑ l = 0 m p l ( h 1 + ⋯ + h N ) l$
where $p 0 = f ( x ¯ )$. The function expansion (16) represents a m-th degree polynomial in variable $| h | 1$. The coefficients $p l$ of this polynomial can be set using Lemma 6, so the polynomial is convex.
The key realization is that the polynomial functions (16) have all the desired properties specified at the beginning of this section.
Claim 1.
A multivariate function $f : R N ↦ R$ having the desirable properties to measure the statistics of a random vector $x$ is the m-th degree polynomial,
$Y = f ( x ) = ∑ l = 0 m p l ∑ i = 1 N ( X i − E X i ) l$
where $p 0 = E x$, and the values of m and of the coefficients $p l ∈ R$ are determined by the application requirements.
The polynomial Function (17) can be used for any number of observations N. It is symmetric, so all observations are treated equally. Moreover, it is prone to integration and employing the expectation operator. The convexity can be achieved by Lemma 6.

#### 4.1. Related Concepts

Assume the scalar function f defined in (17) for a K-th order stationarity random process $x ( t )$. Define the auxiliary random variable,
$Z ( a ) = 1 a ∑ i = 1 N ( X i − X ¯ i )$
where a is a normalization constant, $a ≠ 0$. The expression (17) can be then rewritten as, $Y ( a ) = ∑ l = 0 m p l Z l ( a )$. For $a = 1$, $Z ( 1 )$ has a zero mean, and the variance, $var Z ( 1 ) = ∑ i , j E ( X i − X ¯ i ) ( X j − X ¯ j )$. For $a = N$, $Z ( N )$ represents a sample mean, and its variance is equal to, $var Z ( N ) = var Z ( 1 ) / N 2$. For $a = N$, the variance of $Z ( N )$ is normalized by the number of dimensions, i.e., $var Z ( N ) = var Z ( 1 ) / N$.
For $p l = s l / l !$, in the limit of large m, Equation (17) gives,
$Y = lim m → ∞ ∑ l = 0 m s l l ! ( Z ( a ) ) l = e s Z ( a ) .$
The mean, $E Y = E e s Z ( a )$, is the moment generating function of the random variable $Z ( a )$.
In data processing, the sample mean is intended to be an estimate of the true population mean, i.e., $N ≫ 1$ is required. Here, $Z ( a )$ is calculated over a finite number of vector or process dimensions, so it is a random variable for $∀ a ∈ R \ { 0 }$. The variable $Z ( N )$ should be then referred to as an arithmetic average or a center of gravity of the random vector $X$ in the Euclidean space $R N$, i.e.,
$Z ( N ) ≜ X ¯ = 1 N ∑ i = 1 N X i ∈ R .$
Note that (18) is not an $l 1$-norm, since the variables $X i$ are not summed with absolute values.
If the random variables $X i$ are independent, the distribution of $Z ( a )$ is given by convolution of their marginal distributions. For correlated observations, if the characteristic function, $f ˜ ( s / a ) = E e j · | X | 1 s / a$, of the sum $| X | 1$ can be obtained, the distribution of $Z ( a ) = | X | 1 / a$ is computed as the inverse transform,
$f Z ( Z ( a ) ) = 1 2 π ∫ − ∞ ∞ e − j s Z ( a ) f ˜ s a d s .$
Many other properties involving the sums of random variables can be obtained. For instance, if the random variables $X i$ are independent and have zero mean, then,
$E Z m ( 1 ) − E ( Z ( 1 ) − X N ) m = E X N m m = 2 , 3 E X N 4 + 6 ∑ i = 1 N − 1 E X N 2 E X i 2 m = 4 E X N 5 + 10 ∑ i = 1 N − 1 E X N 2 E X i 3 + E X N 3 E X i 2 m = 5 .$
2 Considering Claim 1, an important statistic for a random vector can be defined as the m-th central sum-moment.
Definition 4.
The m-th central sum-moment of random vector $X ∈ R N$ is computed as,
$∑ μ m ( X ) = μ m ( | X | 1 ) = E ∑ i = 1 N ( X i − X ¯ i ) m , m = 1 , 2 , …$
Lemma 11.
The second central sum-moment of random vector is equal to the sum of all elements of its covariance matrix, i.e.,
$∑ μ 2 ( X ) = μ 2 ( | X − X ¯ | 1 ) = E ∑ i = 1 N ( X i − X ¯ i ) 2 = ∑ i , j = 1 N cov X i , X j = | C x | 1 .$
Furthermore, the second central sum-moment is also equal to the variance of $| X |$, i.e.,
$∑ μ 2 ( X ) = var | X | 1 = var ∑ i = 1 N ( X i − X ¯ i ) .$
Proof.
The first equality is shown by expanding the expectation, and substituting for elements of the covariance matrix, $[ C x ] i , j = cov X i , X j$. The second expression follows by noting that $∑ i = 1 N ( X i − X ¯ i )$ has zero mean. □
In the literature, there are other measures involving sums of random variables. For instance, in Mean-Field Theory, the model dimensionality is reduced by representing N-dimensional vectors by their center of gravity [20]. The central point of a vector is also used in the first order approximation of multivariate functions in [21] and in the model overall variance in [22].
In Measure Theory [23], the total variation (TV) of a real-valued univariate function, $x : ( t 0 , t N ) ↦ R$, is defined as the supremum over all possible partitions $P : t 0 < t 1 < ⋯ < t N$ of the interval $( t 0 , t N )$, i.e.,
$TV ( x ) = sup P ∑ i = 0 N − 1 | x ( t i + 1 − x ( t i ) | .$
The TV concept can be adopted for observations $X i = x ( t i )$ of a stationary random process $x ( t )$ at discrete time instances, ${ t i } 0 : N$. A mathematically tractable mean TV measure can be defined as,
$TV ¯ 2 ( x ) = sup P E ∑ i = 0 N − 1 | X i + 1 − X i | 2 = sup P 2 N ( E X i 2 − cov X i + 1 , X i ) .$
Jensen’s inequality for a random vector assuming equal weights can be stated as [17],
$E 1 N ∑ i = 1 N ( X i − X ¯ i ) m ≤ 1 N ∑ i = 1 N E | X i − X ¯ i | m .$
Alternatively, exchanging the expectation and summation, Jensen’s inequality becomes,
$∑ i = 1 N | E X i − X ¯ i | m ≤ E ∑ i = 1 N | X i − X ¯ i | m .$
Furthermore, if the right-hand side of (19) is to be minimized and $m = 2$, the inequality in (19) changes to equality. In particular, consider the minimum mean square error (MMSE) estimation of a vector of random parameters $P = { P i } 1 : N$ from measurements $X$. Denoting $P ¯ i ( X ) = E P i | X$, conditioned on $X$, the MMSE estimator $P ^ ( X )$ minimizes [15],
$min P ^ | X E ∑ i = 1 N ( P ^ i ( X ) − P i ) 2 = min P ^ | X E ∑ i = 1 N ( P ^ i ( X ) − P ¯ i ( X ) ) − ( P i − P ¯ i ( X ) ) 2 = min P ^ | X ∑ i = 1 N ( P ^ i ( X ) − P ¯ i ( X ) ) 2 + E ∑ i = 1 n ( P i − P ¯ i ( X ) ) 2 = min P ^ | X ∑ i = 1 N ( E P ^ i ( X ) − P i | X ) 2 = min P ^ | X ∑ i = 1 N ( P ^ i ( X ) − E P i | X ) 2$
where the expectations are over the conditional distribution $f P | X$.
In signal processing, a length N moving average (MA) filter transforms the input sequence $X i$ into an output sequence $Y i$ by discrete-time convolution ⊛, i.e.,
$Y i = ∑ j = 0 N − 1 X i − j = [ 1 1 … 1 ︸ 1 N ] ⊛ X i .$
The (auto-) correlations of the input and output sequences are related by Lemma 3 on p. 6, i.e.,
$C Y ( i ) = 1 N ⊛ 1 N ⊛ C x ( i ) = ∑ j = − N + 1 N − 1 ( N − | j | ) C x ( i − j ) .$
Note that if the input process is stationary, then the input and output processes are jointly stationary.

#### 4.2. Multiple Random Processes

The major complication with observing, evaluating, and processing multiple random processes is how to achieve their time alignment and amplitude normalization (scaling). Focusing here on the time alignment problem only, denote the discrete time observation instances of L random processes as,
$t l = { t l 1 < t l 2 < … < t l N l } , l = 1 , 2 , … , L .$
Assume that the first time instance $t l 1$ of every process serves as a reference. Then, there are $( L − 1 )$ uncertainties in time alignment of L processes, i.e.,
$Δ l = ( t l 1 − t 11 ) ∈ R , l = 2 , 3 , … , L .$
The $( L − 1 )$ values $Δ l$ are unknown parameters which must be estimated. Note also that the difference,
$Δ l − Δ k = t l 1 − t k 1$
represents an unknown time shift between the process $x l ( t )$ and $x k ( t )$.
For any multivariate stationary distribution of observations of two random processes, the corresponding cross-correlation normally attains a maximum for some time shift between these processes [12]. Hence, a usual strategy for aligning the observed sequences is to locate the maximum value of their cross-correlation. The time shifts $Δ l$, $l = 1 , 2 , … , L$, are then estimated as,
$Δ ^ l = argmax Δ C x 1 x l ( Δ ) , Δ ∈ { ( t l i − t 11 ) } i = 1 , … , N l .$
Assuming the center values $X ¯ 1$ and $X ¯ 2$ in (18) as scalar representations of the vectors $X 1$ and $X 2$, their cross-covariance can be computed as,
$cov | X 1 | 1 , | X 2 | 1 = N 2 cov X ¯ 1 , X ¯ 2 = N 2 E ( X ¯ 1 − E X 1 ) ( X ¯ 2 − E X 2 ) .$
The task now is how to generalize the pairwise cross-covariance (22) to the case of multiple random vectors having possibly different lengths. If all random vectors of interest are concatenated into one single vector, the m-th joint central sum-moment can be defined by utilizing Claim 1.
Definition 5.
The m-th central sum-moment for L random processes with $N l$ observations, $l = 1 , 2 , … , L$, is computed as,
$∑ μ m ( X 1 , … , X L ) = μ m ( | X 1 | 1 + … + | X L | 1 ) = E ∑ l = 1 L ∑ i = 1 N l ( X l i − X ¯ l i ) m .$
Lemma 12.
The second central sum-moment for L random processes with $N l$ observations, $l = 1 , 2 , … , L$, is equal to the sum of all pairwise covariances, i.e.,
$∑ μ 2 ( X 1 , … , X L ) = ∑ l , k = 1 L ∑ i , j = 1 N l cov X l i , X k j = ∑ l , k = 1 L | cov X l , X k | 1 = var | X 1 | 1 + … + | X L | 1 .$
Proof.
The expression can be obtained by expanding the sum and then applying the expectation. □
Many other properties of central and noncentral sum-moments can be obtained. For example, assuming two equal-length vectors $X 1$ and $X 2$, it is straightforward to show that,
$E | X 1 | 1 + | X 2 | 1 2 − E | X 1 | 1 2 + | X 2 | 1 2 = 2 ∑ i , j = 1 N E X 1 i X 2 j E | X 1 | 1 + | X 2 | 1 2 − E X 1 − X 2 2 2 = ∑ i , j = 1 i ≠ j N E X 1 i X 1 j + X 2 i X 2 j + 2 ∑ i , j = 1 N E X 1 i X 2 j + 2 ∑ i = 1 N E X 1 i X 2 i .$

## 5. Illustrative Examples

This section provides examples to quantitatively evaluate the results developed in the previous sections. In particular, the accuracy of approximate linear LS regression proposed in Section 3.1 is assessed to justify its lower computational complexity. The central sum-moments introduced in Section 4 are compared assuming correlated Gaussian processes. Finally, several signal processing problems involving the 1st order Markov process are investigated.

#### 5.1. Linear Regression

Consider a classical one-dimensional linear LS regression problem with independent and identically normally distributed errors. The errors are also assumed to be independent from all other data model parameters. The data points are generated as,
$X i = Δ i : Y i = P 2 X i + P 1 + E i , i = 1 , 2 , … , N$
where $E i$ are zero-mean, uncorrelated Gaussian samples having the equal variance $σ e 2$, and $P 1$ and $P 2$ are unknown parameters to be estimated. This LS problem can be solved exactly using the expression (2), and substituting $w 1 i = 1$ and $w 2 i = Δ i$, $∀ i = 1 , 2 , … , N$. Alternatively, to avoid inverting the $N × N$ data matrix, the procedure devised in Section 3.1 suggests to split the data into two equal-size subsets, compute the average data point for each subset, and then solve the corresponding set of two equations with two unknowns. Specifically, the set of two equations with the unknown parameters $P ^ 1$ and $P ^ 2$ is,
$2 N ∑ i = 1 N / 2 Y i = P ^ 1 + P ^ 2 Δ 8 N ( 2 + N ) 2 N ∑ i = N / 2 + 1 N Y i = P ^ 1 + P ^ 2 Δ 8 N ( 2 + 3 N )$
assuming N is even, and using, $∑ i = 1 N / 2 Δ i = Δ 8 N ( 2 + N )$, and $∑ i = N / 2 + 1 N Δ i = Δ 8 N ( 2 + 3 N )$. Denoting $Y ¯ 1 = 2 N ∑ i = 1 N / 2 Y i$ and $Y ¯ 2 = 2 N ∑ i = N / 2 + 1 N Y i$, the closed-form solution of (23) is,
$P ^ 1 = 1 2 N ( 2 + 3 N ) Y ¯ 1 − ( 2 + N ) Y ¯ 2 P ^ 2 = 4 Δ N 2 ( Y ¯ 2 − Y ¯ 1 ) .$
As a numerical example, assume the true values $P 1 = 1.5$, $P 2 = 0.3$, $E E i = 0$, $var E i = 1$, and $N = 40$ data points. Figure 2 shows the intervals $( T ¯ − var T , T ¯ + var T )$ versus the subset size $1 ≤ N 1 ≤ N / 2$ for the random variable T defined as,
$T = 100 S apr ( N ) − S opt ( N ) S opt ( N )$
where $S apr ( N ) = ∑ i = 1 N ( Y i − P ^ 1 − P ^ 2 X i ) 2$ and $S opt ( N ) = ∑ i = 1 N ( Y i − P 1 − P 2 X i ) 2$ are the total MSEs. In the limit, $lim N → ∞ ( S apr ( N ) − S opt ( N ) ) = 0$, since a sufficiently large subset of data is as good as the complete set of data. For finite N, it is likely that $S apr ( N ) > S opt ( N )$, so the lower bounds in Figure 2 converge much faster to zero than the upper bounds.

#### 5.2. Comparison of Central Moments

Assuming Lemma 11 and Equation (3), the second central sum-moment of the 2nd order Markov process of length N is,
$∑ μ 2 ( X ) = ∑ i , j = 1 N C 2 MP ( i − j ) = σ 2 N + 2 ∑ i = 1 N − 1 i ( 1 + ( N − i ) α ) e − α ( N − i ) .$
These moments are compared in Figure 3 for different values of the sequence length N, three values of the parameter $α$, and assuming $σ 2 = 1$. It can be observed that the values of the second central sum-moment are increasing with N and the level of correlation, $e − α$.
Consider now the following three central moments of order $m = 1 , 2 , …$, i.e.,
$S ¯ mnk ( N ) = ∑ i = 1 N E | N X i | m ∑ μ 2 ( N ) = E ∑ i = 1 N X i m ∑ μ ˜ 2 ( N ) = E ∑ i = 1 N | X i | m .$
The moment, $S ¯ mnk$, is the mean Minkowski distance; the scaling by $N$ is introduced to facilitate the comparison with the other two moments, i.e., the mean sum-moment $∑ μ 2$, and the mean sum-moment $∑ μ ˜ 2$ having the samples summed as the $l 1$ norm. More importantly, assuming a correlated Gaussian process $X i = x ( t i )$, the central sum-moment can be readily obtained in a closed form whereas obtaining the closed form expressions for the other two metrics may be mathematically intractable. In particular, by factoring the covariance matrix as, $C x = T x T x T$, the correlated Gaussian vector can be expressed as, $X = T U$. Then the sum of elements, $| X | 1 = 1 T T U$, and the m-th central sum-moment can be computed as,
$∑ μ 2 ( N ) = E 1 T T U m = 1 T T 2 m E | U | m = 1 T T 2 m 2 m / 2 π Γ m + 1 2$
where U is a zero-mean, unit-variance Gaussian random variable, and $Γ$ denotes the gamma function.
Figure 4 shows all three moments as a function of sequence length N for three values of parameter $α$ assuming the 1st order Markov process. The vertical axis in Figure 4 is scaled by $1 / N$ for convenience. Note that, for uncorrelated, i.e., independent Gaussian samples, the moments $∑ μ 2$ and $∑ μ ˜ 2$ are identical. More importantly, all three moments are strictly increasing with the number of samples N and with the moment order m.
2
Finally, the second central sum-moment can be visualized in two dimensions. Consider $N = 3$ observations of a zero-mean real-valued stationary random process, $X i = x ( t i )$, $i = 1 , 2 , 3$. Let $∑ μ 2 = E ( X 1 + X 2 + X 3 ) 2 = ∑ i , j = 1 3 E X i X j$. Assuming the 1st order Markov correlation model, $E X i X j ∝ e − 0.5 | τ |$, Figure 5 shows the values of the central sum-moment $∑ μ 2$ versus the sample distances $τ 1 = ( t 2 − t 1 )$ and $τ 2 = ( t 3 − t 1 )$. Several symmetries can be observed in Figure 5. In particular, the central sum-moment $∑ μ 2$ is symmetric about the axis $τ 1 = τ 2$ as well as about the axis $τ 1 = − τ 2$. These symmetries are consequences of the following equalities:
$∑ μ 2 ( τ 1 , τ 2 ) = ∑ μ 2 ( τ 2 , τ 1 ) ∑ μ 2 ( τ 1 , τ 2 ) = ∑ μ 2 ( − τ 1 , − τ 2 ) ∀ τ 1 , τ 2 .$

#### 5.3. Signal Processing Problems for the 1st Order Markov Process

Consider the 1st order Markov process observed at the output of a length N MA filter. According to (21), the (auto-) covariance of the output process is,
$C 1 MP + MA ( k ; α ) = ∑ j = − N + 1 N − 1 ( N − | j | ) σ 2 e − α | k − j | .$
Conjecture 1.
The MA filtering of the 1st order Markov process generates nearly a 2nd order Markov process.
The parameter of the 2nd order Markov process approximating the combined (auto-) covariance (24) can be obtained using the LS regression fit, i.e.,
$α ^ = argmin α ˜ ∑ k ( C 1 MP + MA ( k ; α ) − C 2 MP ( k ; α ˜ ) ) 2 .$
Substituting (3) and (24) into (25), and letting the first derivative to be equal to zero, the LS estimate $α ^$ must satisfy the linear equation,
$∑ k α ^ | k | + 1 + W − 1 − C 1 MP + MA ( k ; α ) e = 0$
which can be readily solved for $α ^$, and $W − 1$ denotes the Lambert function [24].
A discrete time sequence of N elements has the (auto-) covariance constrained to $( 2 N − 1 )$ time indexes as indicated in (24). Assuming the length N MA filter, and that there are $( n x N )$ samples, $n x = 1 , 2 , …$, of the 1st order Markov process available, the (auto-) covariance (24) has the overall length $2 N ( n x + 1 ) − 3$ samples. Figure 6 compares the MSE,
$MSE ( n x ) = 100 × ∑ k ( C 1 MP + MA ( k ; α ) − C 2 MP ( k ; α ^ ) ) 2 ∑ k C 1 MP + MA 2 ( k ; α )$
of the LS fit of the (auto-) covariance of the 2nd order Markov process to the combined (auto-) covariance of the 1st order Markov process and the impulse response of the MA filter assuming three values of $α$ and two values of $n x$. Given $α$ and $n x$, Figure 6 shows that the best LS fit occurs for a certain value of the MA filter length N. It can be concluded that, in general, the 1st order Markov process changes to the 2nd order Markov process at the output of the MA filter.
2
The second problem to investigate is a linear MMSE (LMMSE) prediction of the 1st order Markov process observed at the output of a MA filter. In particular, given N samples $X i = x ( t i )$, $i = 1 , 2 , … , N$, of random process having the (auto-) covariance (24), the task is to predict its future value, $X N + 1 = x ( t N + 1 )$, $t N + 1 > t N$.
In general, the impulse response $h$ of the LMMSE filter to estimate an unknown scalar parameter P from the measurements $X$ is computed as [15],
$h = E ( x − x ¯ ) ( X − X ¯ ) T E P X .$
Here, the unknown parameter $P = X N + 1$, and $E X N + 1 X i = C 1 MP + MA ( N + 1 − i )$ and $E X i X j = C 1 MP + MA ( i − j )$ in (26) which gives the length N LMMSE filter,
$h = [ 0 0 … 0 ︸ N − 1 C 1 MP + MA ( 1 ) ] .$
Consequently, the predicted value, $X N + 1 = X N C 1 MP + MA ( 1 ) / σ X 2$. Note that the same procedure, but excluding the MA filter, gives the LMMSE estimate, $X N + 1 = X N C 1 MP ( 1 ) / σ X 2$.
The last problem to consider is a time alignment of two zero-mean, jointly stationary processes. It is assumed that the normalized cross-covariance of these two processes is,
$E X 1 i X 2 j E X 1 i 2 E X 2 i 2 = e − α | i − j | ( 1 + α | i − j | ) .$
Denote the uncertainty in determining the difference, $( i − j )$, as $Δ$. In order to estimate the unknown parameters $α$ and $Δ$, the left-hand side of (27) can be estimated by the method of moments, i.e., let $E X i 2 ≈ 1 N ∑ i = 1 N X i 2$, and, $E X 1 i X 2 j ≈ 1 N ∑ i = 1 N X 1 i X 2 | i − Δ |$. The cross-covariance (27) can be then rewritten as,
$v k = ∑ i = 1 N X 1 i X 2 | i − Δ − k | ∑ i = 1 N X 1 i 2 ∑ i = 1 N X 2 i 2 = e − α | Δ + k | ( 1 + α | Δ + k | ) , k = 0 , 1 , 2 , …$
Utilizing the Lambert function $W − 1$, the cross-covariance can be rewritten further as,
$α | Δ + k | = − 1 − W − 1 − v k e ≜ v ˜ k , k = 0 , 1 , 2 , …$
Assuming, without loss of generality, that $Δ ≥ 0$, the absolute value in (28) can be ignored. Consequently, the unknown parameters $α$ and $Δ$ can be obtained as a linear LS fit to N measured and calculated values $v ˜ k$ in the linear model (28).

## 6. Conclusions

The development of a novel statistical measure to enable correlation analysis for multiple random vectors resumed by summarizing background knowledge on statistical description of discrete time random processes. This was then extended with the derivation of several supporting results which were used in the following sections. Specifically, it was shown that linear regression can be effectively approximated by splitting the data into disjoint subsets and assuming only one average data point within each subset. In addition, a procedure for generating multiple Gaussian processes with the prescribed autocovariance and cross-covariance was devised. The main result of the paper was obtained by assuming the Taylor’s expansion of multivariate symmetric scalar functions, and then approximating the Taylor’s expansion by a univariate polynomial. The single polynomial variable is a simple sum of variables in the original multivariate function. The polynomial approximation represents a mapping from multiple discrete time observations of a random process to a multidimensional scalar field. The mean field value is a weighted sum of canonical central moments with increasing orders. These moments were named central sum-moments to reflect how they are defined. The sum-moments were then discussed in light of other similar concepts such as total variance, Mean Field Theory, and moving average sequence filtering. Illustrative examples were studied in the last section of the paper. In particular, the accuracy of approximate linear regression was evaluated quantitatively assuming two disjoint data subsets. Assuming the 1st and the 2nd order Markov processes, the central sum-moments were compared with the mean Minkowski distance. For Gaussian processes, the central sum-moments can be obtained in closed form. The remaining problems investigated moving average filtering of the 1st order Markov processes and its prediction using a linear MMSE filter.

## Funding

This research received no external funding.

Not applicable.

Not applicable.

Not applicable.

## Conflicts of Interest

The author declares no conflict of interest.

## Abbreviations

The following abbreviations are used in the paper:
 1MP 1st order Markov process 2MP 2nd order Markov process AR autoregressive LMMSE linear minimum mean square error LS least squares MMSE minimum mean square error MSE mean square error MA moving average TV total variance $| · |$ absolute value, set cardinality, sum of vector elements $E ·$ expectation $corr ·$ correlation $cov ·$ covariance $var ·$ variance $( · ) − 1$ matrix inverse $( · ) T$ matrix/vector transpose $f x$ distribution of a random variable X f, $f ˙$, $f ¨$ function f, and its first and second derivatives $N +$ positive non-zero integers $R$, $R +$ real numbers, positive real numbers $X ¯$ mean value of random variable X $X i j$ j-th sample of process i $W − 1$ Lambert function

## References

1. Drezner, Z. Multirelation—A correlation among more than two variables. Comput. Stat. Data Anal. 1995, 19, 283–292. [Google Scholar] [CrossRef]
2. Dear, R.; Drezner, Z. On the significance level of the multirelation coefficient. J. Appl. Math. Decis. Sci. 1997, 1, 119–130. [Google Scholar] [CrossRef] [Green Version]
3. Geiß, S.; Einax, J. Multivariate correlation analysis—A method for the analysis of multidimensional time series in environmental studies. Chemom. Intell. Lab. Syst. 1996, 32, 57–65. [Google Scholar] [CrossRef]
4. Abdi, H. Chapter Multiple correlation coefficient. In Encyclopedia of Measurements and Statistics; SAGE: Los Angeles, CA, USA, 2007; pp. 648–655. [Google Scholar] [CrossRef]
5. Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing dependence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
6. Merigó, J.M.; Casanovas, M. A New Minkowski Distance Based on Induced Aggregation Operators. Int. J. Comput. Intell. Syst. 2011, 4, 123–133. [Google Scholar] [CrossRef]
7. Nguyen, H.V.; Müller, E.; Vreeken, J.; Efros, P.; Böhm, K. Multivariate maximal correlation analysis. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. II-775–II-783. [Google Scholar] [CrossRef]
8. Josse, J. Measuring multivariate association and beyond. Stat. Surv. 2016, 10, 132–167. [Google Scholar] [CrossRef] [PubMed]
9. Shu, X.; Zhang, Q.; Shi, J.; Qi, Y. A comparative study on weighted central moment and its application in 2D shape retrieval. Information 2016, 7, 10. [Google Scholar] [CrossRef] [Green Version]
10. Böttcher, B.; Keller-Ressel, M.; Schilling, R.L. Distance multivariance: New dependence measures for random vectors. Ann. Stat. 2019, 47, 2757–2789. [Google Scholar] [CrossRef]
11. Wang, J.; Zheng, N. Measures of correlation for multiple variables. arXiv 2020, arXiv:1401.4827. [Google Scholar]
12. Gardner, W.A. Introduction to Random Processes With Applications, 2nd ed.; McGraw-Hill: New York, NY, USA, 1990. [Google Scholar]
13. Papoulis, A.; Pillai, S.U. Probability, Random Variables, and Stochastic Processes, 4th ed.; McGraw-Hill: New York, NY, USA, 2002. [Google Scholar]
14. Giller, G.L. The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity. SSRN Preprint 2012. [Google Scholar] [CrossRef]
15. Kay, S.M. Fundamentals of Statistical Signal Processing: Estimation Theory; Prentice Hall: Upper Saddle River, NJ, USA, 1993; Volume I. [Google Scholar]
16. Oppenheim, A.V.; Schafer, R.W.; Buck, J.R. Discrete-Time Signal Processing, 3rd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
17. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
18. Folland, G.B. Higher-Order Derivatives and Taylor’s Formula in Several Variables. Available online: https://sites.math.washington.edu/~folland/Math425/taylor2.pdf (accessed on 1 December 2020).
19. Apostol, T. Mathematical Analysis, 2nd ed.; Pearson: London, UK, 1973. [Google Scholar]
20. Yedidia, J.S. chapter An idiosyncratic journey beyond mean field theory. In Advanced Mean Field Methods; MIT Press: Cambridge, MA, USA, 2001; pp. 21–36. [Google Scholar]
21. Sobol, I.M. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math. Comput. Simul. 2001, 55, 271–280. [Google Scholar] [CrossRef]
22. Saltelli, A.; Ratto, M.; Tarantola, S.; Campolongo, F. Sensitivity analysis for chemical models. Chem. Rev. 2005, 105, 2811–2828. [Google Scholar] [CrossRef] [PubMed]
23. Shirali, S. A Concise Introduction to Measure Theory; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
24. Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables; Dover: Mineola, NY, USA, 1974. [Google Scholar]
Figure 1. The exact ($L S opt$) and the reduced complexity ($L S apr$) linear LS regression.
Figure 1. The exact ($L S opt$) and the reduced complexity ($L S apr$) linear LS regression.
Figure 2. The relative total mean-square error of the approximate linear LS regression.
Figure 2. The relative total mean-square error of the approximate linear LS regression.
Figure 3. The second central sum-moments of the 2nd order Markov process with parameter $α$ and length N.
Figure 3. The second central sum-moments of the 2nd order Markov process with parameter $α$ and length N.
Figure 4. The Minkowski (blue), sum-moment (black), and sum-moment of absolute values (red) mean statistics for the 1st order Markov sequence of length N. Columns: different values $α$. Rows: different values m.
Figure 4. The Minkowski (blue), sum-moment (black), and sum-moment of absolute values (red) mean statistics for the 1st order Markov sequence of length N. Columns: different values $α$. Rows: different values m.
Figure 5. The second central sum-moment as a function of time differences between $N = 3$ observations of a stationary random process.
Figure 5. The second central sum-moment as a function of time differences between $N = 3$ observations of a stationary random process.
Figure 6. The MSE of approximating the (auto-) covariance of the 1st order Markov process at the output of length N MA filter by the (auto-) covariance of the 2nd order Markov process.
Figure 6. The MSE of approximating the (auto-) covariance of the 1st order Markov process at the output of length N MA filter by the (auto-) covariance of the 2nd order Markov process.
 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Share and Cite

MDPI and ACS Style

Loskot, P. Polynomial Representations of High-Dimensional Observations of Random Processes. Mathematics 2021, 9, 123. https://doi.org/10.3390/math9020123

AMA Style

Loskot P. Polynomial Representations of High-Dimensional Observations of Random Processes. Mathematics. 2021; 9(2):123. https://doi.org/10.3390/math9020123

Chicago/Turabian Style

Loskot, Pavel. 2021. "Polynomial Representations of High-Dimensional Observations of Random Processes" Mathematics 9, no. 2: 123. https://doi.org/10.3390/math9020123

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.