2.1. Maximum Entropy Method
The maximum entropy method (MEM) for the estimation of probability density is a rigorous method for producing unbiased parametric estimates [
39]. A brief summary of the multivariate case for
d variables is provided as part of background information for related previous work. The variables are described as a vector in a
d-dimensional space given by
. Moments in the
i-th direction are given as
. To describe different moments for each component of
it is convenient to define
. A general moment in terms of
is given as
. These product moments can be further generalized by using
d Cartesian product functions given by
The MEM obtains an unbiased density
by maximizing Shannon entropy while satisfying all known constraints including the normalization condition. This can be expressed as an optimization problem by maximizing the expression
Solving the extremum problem formally gives the solution to Equation (
2) to be of the form
where
is the sample estimate of the probability density function (PDF),
A is a normalization constant, and
is a set of Lagrange coefficients. Determining
for a set of moments
is computationally difficult, particularly as the number of moments becomes large and as the number of dimensions increase. Computational methods have been developed for a limited number of moments and dimensions [
22,
25], including more recent methods based on fractional moments. In other works, MEM is combined with dimension reduction techniques [
26,
27] to mitigate the computational difficulties of higher dimensions. A more sophisticated approach first creates a histogram of the sampled data to obtain a rough estimate, and then estimates the coefficients of 1D slices of the histogram in each dimension. These coefficients are then expanded in
d dimensions to produce a grid of coefficients used in Equation (
3) with Legendre polynomials as a basis set [
23,
25].
Our recent nonparametric probability density estimator for univariate data [
33,
34,
35,
36] begins with MEM to obtain the general form of the solution given in Equation (
3) for the case
. Chebyshev polynomials are used as a basis, with the domain on
after the data are scaled. The
coefficients are optimized based on a trial cumulative density function, CDF, given by
where
is evaluated for all data points {
} in the random sample. The set {
} will be uniform on
when the parameter set
is correct within statistical resolution. Instead of specifying a preset number of Lagrange multipliers, the coefficients are optimized by a random search method, adding more terms to the expansion if necessary. The process continues until the CDF ensures that {
} is distributed as sampled uniform random data (SURD). The deviation from SURD is measured using a scoring function based on single order statistics, which is sample size invariant [
34]. The method is stable for several hundred of Lagrange multipliers. Key features of the method include outlier removal and adaptive resolution, avoiding bandwidth and boundary problems in KDE methods. A generalization of this univariate density estimation method to higher dimensions using the form given in Equation (
3) is not possible, because the properties of order statistics are lost once
is considered. Nevertheless, a multivariate PDF can be obtained by calculating many 1D PDFs in terms of a product array of conditional probability functions.
2.2. Product Array of Conditional Probabilities
Similar in strategy to other methods [
23,
25], a multivariate probability density will be estimated using multiple strips of one-dimensional conditional probabilities. The multivariate joint probability distribution is expressed as products of
conditional probability distributions. To illustrate the method, consider the simplest multivariable case of two variables. The joint probability of two variables is expressed as a product of conditional and marginal probabilities as
. Estimating the conditional and marginal probability densities using the 1D PDF estimator separately, the joint probability density is estimated as
.
Figure 1a demonstrates a two-dimensional example with a 5 by 5 grid of cells, numbered from 1 to 25, superimposed over a scatter plot of sample data from a bivariate standard Gaussian distribution. The grid lines shown are used to stratify the data into strips. The number of grid lines is the same in each dimension. For dimension
d with
, the number of grid lines, denoted by
, is heuristically given as
where
n is the sample size. The scale factor of 100 provides a good balance between accuracy and computation time. Moreover, Equation (
4) suggests that
is a practical minimum number of grid lines to use. It is found that
(for
) gives a reasonable guide for how many independent observations produce a good nonparametric estimate as a function of dimension. Since the details of a particular problem affect the minimum sample size to obtain an accurate estimate, this heuristic sets a reasonable reference. For instance, sharp features or heavy tails in probability density require more samples, while fewer samples in high dimensions are sufficient when there are strong correlations between the
d random variables.
As seen in
Figure 1, the grid line spacing is not uniform, because the spacing adapts to the density of the data in each dimension. This is achieved by first calculating the marginal densities with the 1D PDF estimator for each variable. The range of the CDF on [0, 1] is divided into equally spaced marks separated by
, where
is determined by Equation (
4). To map from [0, 1] back to the variable, the location of the grid lines is obtained using the inverse of the estimated CDF, which is the estimated quantile function
for the
j-th variable. The grid lines divide the percent of data uniformly for the
j-th variable at the following locations
For the case shown in
Figure 1 . In general, this method creates
cells.
The estimated density is at the center of the two grid lines that define the boundaries of a data strip, as shown in
Figure 1b for the
variable. The blue points represent all observations of
that fall within the strip (−0.82
−0.21). These points are projected on the vertical centerline (ignoring the
values) to create a 1D problem where the probability density is a function of
along the centerline positioned at (
). This process is repeated for five centerlines representing each data strip. Special points along the estimated conditional probabilities corresponding to cell centers are evaluated to obtain an estimate of the joint probability density at the center of each cell. Interpolation is used to obtain estimates anywhere else in the hypercube if needed. The joint probability density estimate at the center of cell #9 in
Figure 1a, for example, is the product of two 1D PDF estimates:
The process of multiplying a conditional probability density to a marginal probability density means normalizing the joint probability density function is unnecessary. The normalization process for the entire constructed joint probability density is automatic. Since there is no smoothing function (as used in KDE through a selected kernel), the shrinkage that always exists with KDE is avoided. The general procedure of using a product array of conditional probability density and a marginal probability density is markedly stable and accurate.
This process is iterated for each additional variable. Equation (
6) generalizes to
d dimensions by writing a joint probability density estimate as a conditional probability density estimate multiplied by a marginal probability density estimate, given as
The 1D PDF estimator makes conditional probability density estimates based on stratified data assigned to a centerline. At the next iteration, the marginal probability density for
variables is expanded into another conditional probability density and a marginal probability density of
variables. Recursively repeating this process until a marginal probability density estimate of the last variable is reached yields the expression
where the value
denotes a specific value of the
k-th variable,
. The specific point is labeled using the index
, which lies on a centerline for the
k-th variable, which is midway between two grid lines that bound a data strip. In particular, each
corresponds to a specific
that is set by Equation (
5). Although Equation (
8) is valid for all continuous values of
, as indicated, the functions are evaluated only at discrete grid points. Indeed, all discrete points used to represent the
d-dimensional joint probability density are initially determined by the univariate marginal probability densities for each dimension, as described above, where
grid lines are defined through Equation (
5).
In this approach, the 1D PDF estimator is applied many times: the marginal PDF per variable as an initialization step (i.e.,
d times), and once per centerline representing a conditional probability density based on stratified data. For example, the 1D PDF estimator is used seven times when
and
as shown in
Figure 1. In general, the number of calls is
. For comparison,
when
and
. A summary of the main steps at a high level is as follows:
Estimate 1D marginal densities for each variable.
Define strip boundaries from marginal CDF per variable at equal probability spacing.
Estimate the conditional probability density at the centerline of each strip.
Initialize the joint probability density at the center of each cell to the marginal .
For each of the remaining variables, iterate: v = 1 to ,
- (a)
Extract the sample points for successive stratification as v is incremented.
- (b)
Project sample points found in (a) onto the centerline representing a special point.
- (c)
Estimate conditional probability density using the 1D PDF estimator per centerline.
- (d)
Multiply estimated conditional probability density from (c) to current value per cell.
Note that dimension reduction methods could be used as a preprocessing step before this algorithm is used when dealing with high-dimensional data. This line of analysis will be reported elsewhere. Importantly, the implementation of a transformation as a preprocessing step does not change the algorithm presented in this paper.
2.3. Synthetic Data Benchmark
A Gaussian mixture model (GMM) of various types [
12,
15,
16] is often used to benchmark estimation methods. The general form of a GMM with
m modes is given as
where a Gaussian distribution in
d dimensions is given as
To assess the accuracy and speed of the multivariate estimator, five families of GMMs are considered where the parameters are given in
Table 1. For each case,
, where the values of
are considered.
In addition, a single mode Cauchy model for
is considered. The joint PDF is expressed as a function of the marginal and cumulative PDFs using a copula probability density,
, according to Skylar’s theorem as
where the same Cauchy distribution is used for all the marginals in Equation (
11). Five different copulas were considered for generating correlated random samples, including Gaussian, t, Frank, Clayton, and Gumbel. Gaussian and t copulas are easily extended to an arbitrary number of dimensions and were thus calculated separately and compared for all cases. The remaining Archimedean copulas were restricted to the analysis of bivariate data only. The copula correlations are set as
to produce an analogous effect as the
k used in the covariance matrices for the GMMs. The Cauchy distribution is given by
and in standard form the location and scale parameters are set to
and
, respectively. Note that for distributions with a single mode, performance is unaffected by translations and scaling of variables. Hence, only standard parameters are reported.
A visualization of a typical estimate for
is shown in
Figure 2 for the third GMM (two heavily overlapped modes) listed in
Table 1 and with
. To assess the accuracy of an estimate, the mean squared error (MSE) between the estimate and the known distribution is used. The MSE is defined as
for
evaluation points defined as
where {
g} forms a hyper cube of arguments to the estimated quantile functions at the regular uniformly spaced points
covering all possible
combinations. In addition to the direct formula of Equation (
13), the MSE was also calculated for subsamples of points. First, all
are sorted from largest to smallest. Sweeping over a range of percentiles allows the estimates to be stratified into nine distinct bins. For each bin, the MSE is calculated for all points that fall into the bin. It was found that MSE is distributed approximately uniformly across all bins for all cases discussed in this subsection.
An important property of a product array of conditional probabilities to assess is sensitivity to variable ordering when calculating Equation (
8) using the five-step process outlined above. For
d variables, there are
possible orderings that can be executed. A multivariate distribution for
is constructed using the fifth GMM listed in
Table 1 with four modes, designed to ensure each dimension has its own distinct features. Density estimates for sample sizes 100 and 1000, are calculated and compared over 10 samples for each of the 24 possible orderings. For each of the grid points determined by Equation (
5), which do not depend on the variable order, there is no discernible difference between the mean variance due to different variable orders and the variance obtained between samples (data not shown). Based on this result, no conditions are imposed on how to choose the variable order. It is worth noting that an average can be performed over different variable orders if significant differences in variable order occurred. However, no case has been found where accuracy or speed is sensitive to variable ordering. The use of differences in variable ordering, if any, will be investigated when dimension reduction methods, such as principal component analysis, are combined with this algorithm.
As a further benchmark, the first to fourth GMM listed in
Table 1 is used to assess the speed and accuracy of the method. The GMM parameters are such that all variable orderings are statistically the same, because the distributions are centered on the origin and along a single direction (body diagonal) in
d-dimensions. This means that the center positions of all Gaussian distributions within the mixture lie along a one-dimensional line that is not optimal for any of the variables. These datasets were generated for two through six variables, using degenerate parameters in each dimension to provide a clear demonstration of trends as the dimension increases or the correlation factor is changes. In addition, dozens of other distributions were generated and analyzed (data not shown) to verify that symmetrical GMMs are not artificially more accurate than cases without special symmetries. Therefore, as a fair representation of the accuracy of the method, the average MSE is calculated over 10 trials for six distributions: the 4 GMM in
Table 1, a single mode Cauchy modeled with a Gaussian copula, and an additional single mode Cauchy using a t-copula with parameter
. Examples for
and five orders of magnitude variation in sample sizes are listed in
Table 2.
Several trends are noticeable and summarized in the plots in
Figure 3, averaged over the four GMM and two Cauchy distributions. The top row shows calculation time as a function of sample size with no correlation (left) and as a function of linear correlation for a sample size of 100,000 (right). The bottom row of
Figure 3 shows the accuracy for the same parameters. In summary, as sample size increases, computational time increases and accuracy improves, as expected. As variables become strongly correlated, accuracy and computational time both decrease. For all distributions, higher dimensions require more time to estimate. Somewhat less intuitively, accuracy improves when the dimension is increased, which can be observed regardless of the subsampling used in Equation (
13).