Next Article in Journal
Analyses of pp, CuCu, AuAu and PbPb Collisions by Tsallis-Pareto Type Function at RHIC and LHC Energies
Previous Article in Journal
The Advantage of Case-Tailored Information Metrics for the Development of Predictive Models, Calculated Profit in Credit Scoring
Previous Article in Special Issue
Information Generating Function of Ranked Set Samples
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Local Intrinsic Dimensionality, Entropy and Statistical Divergences

1
School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC 3010, Australia
2
School of Computer Science, Fudan University, Shanghai 200437, China
*
Author to whom correspondence should be addressed.
This work was partially conducted when M.E.H. was with the National Institute of Informatics, Japan.
Entropy 2022, 24(9), 1220; https://doi.org/10.3390/e24091220
Submission received: 4 July 2022 / Revised: 22 August 2022 / Accepted: 26 August 2022 / Published: 30 August 2022
(This article belongs to the Special Issue Entropies, Divergences, Information, Identities and Inequalities)

Abstract

:
Properties of data distributions can be assessed at both global and local scales. At a highly localized scale, a fundamental measure is the local intrinsic dimensionality (LID), which assesses growth rates of the cumulative distribution function within a restricted neighborhood and characterizes properties of the geometry of a local neighborhood. In this paper, we explore the connection of LID to other well known measures for complexity assessment and comparison, namely, entropy and statistical distances or divergences. In an asymptotic context, we develop analytical new expressions for these quantities in terms of LID. This reveals the fundamental nature of LID as a building block for characterizing and comparing data distributions, opening the door to new methods for distributional analysis at a local scale.

1. Introduction

Fundamental activities for analyzing data include both an ability to characterize data complexity and an ability to make comparisons between distributions. Widely used measures for these activities include entropy (for assessing uncertainty) and statistical divergences or distances (to compare distributions) [1]. Such analysis can be performed at either a global scale across the entire data distribution or at a local scale, in the vicinity of a given location in the distribution.
An important measure of global complexity is intrinsic dimensionality, which captures the effective number of degrees of freedom needed to describe the entire dataset. On the other hand, local intrinsic dimensionality (LID) [2] is capable of characterizing the complexity of the data distribution around a specified query location, thus capturing the number of degrees of freedom present at a local scale. LID is a unitless quantity that can also be interpreted as a relative growth rate of probability measure within an expanding neighborhood around the specified query location, or the intrinsic dimension of the space immediately around the query point.
Our focus in this paper is to characterize entropy and statistical divergences at a highly local scale, for an asymptotically small vicinity around a specified location. We show that it is possible to leverage properties that arise from LID based characterizations of lower tail distributions [3], to develop analytical expressions for a wide selection of entropy variants and statistical divergences, in both univariate and multivariate settings. This yields expressions for tail entropies and tail divergences.
Analytical characterizations for tail divergences and tail entropies are appealing from a number of perspectives. These are as follows:
  • For univariate scenarios, if working with the tail of a distribution that has a single variable, we can conduct:
    Temporal analysis: when a distribution models some property varying over time (e.g., survival analysis), we can analyze the entropy of a univariate distribution within an asymptotically short window of time, or the divergence between two univariate distributions within an asymptotically short window of time.
    Distance-based analysis: when a distribution models distances from a query location to its nearest neighbors and the distances are induced by a global data distribution. Here, our results can be used for analysis of tail entropy or divergence between distributions within an asymptotically small distance interval. In the case of the latter, this can provide insight into multivariate properties, since under minimal assumptions the divergences between univariate distance distributions provide lower bounds for distances between multivariate distributions [4,5]. This is applicable for models such as generative adversarial networks (GANs), where it is important to test correspondence between synthetic and true distributions at a local level [6].
  • For multivariate scenarios where we are analyzing distributions with multiple variables:
    If an assumption of locally spherical symmetry of the distribution holds, then we can directly compute the tail entropy of a distribution or the divergence between two tail distributions in the vicinity of a single point. Such an assumption is suitable for analyzing data distributions for many types of physical systems such as fluids, glasses, metals and polymers, where local isotropy holds.
A key challenge in developing analytical characterizations for tail entropies and tail divergences is how to avoid or minimize assumptions about the form of the local distribution in the vicinity of the query (for example, assumptions such as a local normal distribution or a local uniform distribution). As we will see, analytical results are in fact possible—as the neighborhood radius asymptotically tends to zero, the tail distribution (a truncated distribution induced from the global distribution) is guaranteed to converge to a generalized pareto distribution (GPD), with the GPD parameter determined by the LID value of the tail distribution. The technical challenge is to rigorously delineate under what circumstances it is possible to leverage this relationship to achieve a dramatic simplification of the integrals that are required to compute varieties of tail entropy or distribution divergences. Our results in this paper show that such simplifications are in fact possible, for a wide range of tail entropies and divergences. This allows us to characterize and analyze fundamental properties of local neighborhood geometry, with results holding asymptotically for essentially all smooth data distributions.
In summary, our key contributions are the development of substantial new theory that asymptotically relates tail entropy, divergences and LID. It builds on and extends an earlier work by Bailey et al. [3], which focused solely on univariate entropies, without reference to divergences or multivariate settings. Specifically in this paper, we:
  • Formulate technical lemmas which delineate when it is possible to substitute certain types of tail distributions by simple formulations that depend only on their associated LID values.
  • Use these lemmas to compute univariate tail formulations of entropy, cross entropy, cumulative entropy, entropy power and generalized q-entropies, all in terms of the LID values of the original tail distributions.
  • Use these lemmas to compute tail formulations of univariate statistical divergences and distances (Kullback–Leibler divergence, Jensen–Shannon divergence, Hellinger distance, χ 2 divergence, α -divergence, Wasserstein distance and L 2 distance).
  • Extend the univariate results to a multivariate context, when local spherical symmetry of the distribution holds.

2. Related Work

The core of our study involves intrinsic dimensionality (ID) and we begin by reviewing previous work on this topic.
There is a long history of work on ID, and this can be assessed either globally (for every data point) or locally (with respect to a chosen query point). Surveys of the field provide a good overview [7,8,9]. In the global case, a range of previous works have focused on topological models and appropriate estimation methods [10,11,12]. Such examples encompass techniques such as PCA and its variants [13], graph based methods [14] and fractal models [7,15]. Other approaches such as IDEA [16,17], DANCo [18] or 2-NN estimate the (global) intrinsic dimension based on concentration of norms and angles, or 2-nearest neighbors [19].
Local intrinsic dimensionality focuses on the intrinsic dimension of a particular query point and has been used in a range of applications. These include modeling deformation in granular materials [20,21], climate science [22,23], dimension reduction via local PCA [24], similarity search [25], clustering [26], outlier detection [27], statistical manifold learning [28], adversarial example detection [29], adversarial nearest neighbor characterization [30,31] and deep learning understanding [32,33]. In deep learning, it has been shown that adversarial examples are associated with high LID estimates, a characteristic that can be leveraged to build accurate adversarial example detectors [29]. It has also been found that the LID of deep representations [33] learned by Deep Neural Networks (DNNs) or the raw input data [34,35] is correlated with the generalization performance of DNNs. A ‘dimensionality expansion’ phenomenon has been observed when DNNs overfit to noisy class labels [32] and this can be leveraged to develop improved loss functions. The use of a “cross-LID” measure to evaluate the quality of synthetic examples generated by GANs has been proposed in [36]. Connections between local intrinsic dimensionality and global intrinsic dimensionality were explored by Romano et al in [37]. In the area of climate science and dynamical systems, a formulation similar to local intrinsic dimensionality has been developed and referred to as local dimension or instantaneous dimension [22,23,38], using links to extreme value theoretic methods. It has proved useful as measure to characterize predictability of states and explain system dynamics.
For local intrinsic dimensionality, a popular estimator is the maximum likelihood estimator, studied in the Euclidean setting by Levina and Bickel [39] and later formulated under the more general assumptions of extreme value theory by Houle [2] and Amsaleg et al. [40], who showed it to be equivalent to the classic Hill estimator [41]. Other local estimators include expected simplex skewness [42], the tight locality estimator [43], the MiND framework [17], manifold adaptive dimension [44], statistical distance [45] and angle-based approaches [46]. Smoothing approaches for estimation have also been used with success [47,48].
Local intrinsic dimensionality is closely related to (univariate) distance distributions. Fundamental relations for interpoint distances, connecting multivariate distributions and univariate distributions have been explored by both [4,5]. The former showed that two multivariate distributions are equal whenever the interpoint distances both within and between samples have the same univariate distribution, while the latter showed that two multivariate distributions F and G are different if their univariate distance distributions from some randomly chosen point z are different. This can form the basis of a two sample test for comparing F and G. These studies have implications for our work in this paper, since they characterize the role that comparison between univariate distributions can play as a necessary condition for comparing equality of multivariate distributions.
Our work in this paper formulates results for different varieties of entropy and different types of divergences. Entropy is a fundamental notion used across many scientific disciplines. A good overview of its role in information theory is presented in [49]. Entropy power (the exponential of entropy) is commonly used in signal processing and information theory, and is a building block for the well-known Shannon entropy power inequality which can be used to analyze the convolution of two independent random variables [50]. Entropy power goes under the name of perplexity in the field of natural language processing [51] and true diversity in the field of ecology [52]. It also corresponds to the volume of the smallest set that contains most of the probability measure [49], and it can be interpreted as a measure of statistical dispersion [53]. It is also related to Fisher information via Stam’s inequality [54].
Cumulative entropy was formulated in [55] and is a modification of cumulative residual entropy [56]. It is popular in reliability theory where it is used to characterize uncertainty over time intervals. Apart from reliability theory analysis, it has been used in data mining tasks such as dependency analysis [57] and subspace cluster analysis [58], where it has proved more effective due to good estimation properties. These data mining investigations have used cumulative entropy at a global level (over the entire data domain), rather than at the local (tail) level, as in our study. Generalized variants based on Tsallis q-statistics have been developed for both entropy [59] and cumulative entropy [60]. Inclusion of the extra q parameter can assist with higher robustness to anomalies and better fitting to characteristics of data distributions. Tail entropy has been used in financial applications for measuring the expected shortfall [61] in the upper tail using quantization. This is different from our context, where the our exclusive focus is on lower tails and we develop exact results for an asymptotic regime where lower tail size approaches zero.
Divergences between probability distributions are a fundamental building block in statistics and are used to assess the degree to which one probability distribution is different from another probability distribution. They have a wide range of formulations [1] and applications, which range from use as objective functions in supervised and unsupervised machine learning [62], to hypothesis and two sample or goodness of fit testing in statistics [63], as well as generative modeling in deep learning, particularly using the Wasserstein distance [64]. Asymptotic forms of KL divergence have been investigated by Contreras-Reyes [65], for comparison of multivariate asymmetric heavy-tailed distributions.
Finally, we note that this work considerably expands a recent study by Bailey et al. [3], which established relationships between tail entropies and LID. This current paper extends and generalizes that work in several directions: (i) We establish general lemmas that provide sufficient conditions for when it is possible to substitute a tail distribution with components such as a power law, inside an integral. The techniques of [3] were specially crafted for specific integrals. (ii) We provide results for statistical divergences and distances (the work of [3] only considers entropy). (iii) We show how to formulate results for the multivariate context (as [3] only considers univariate scenarios).

3. Local Intrinsic Dimensionality

In this section, we summarize the LID model using the presentation of [2]. LID can be regarded as a continuous extension of the expansion dimension [66,67]. Like earlier expansion-based models of intrinsic dimension, its motivation comes from the relationship between volume and radius in an expanding ball, where (as originally stated in [68]) the volume of the ball is taken to be the probability measure associated with the region it encloses. The probability as a function of radius—denoted by F ( r ) —has the form of a univariate cumulative distribution function (CDF). The model formulation (as stated in [2]) generalizes this notion to real-valued functions F for which F ( 0 ) = 0 , under appropriate assumptions of smoothness.
Definition 1
([2]). Let F be a real-valued function that is non-zero over some open interval containing r R , r 0 . The intrinsic dimensionality of F at r is defined as follows whenever the limit exists:
IntrDim F ( r ) lim ϵ 0 ln F ( ( 1 + ϵ ) r ) / F ( r ) ln ( 1 + ϵ ) .
When F satisfies certain smoothness conditions in the vicinity of r, its intrinsic dimensionality has a convenient known form:
Theorem 1
([2]). Let F be a real-valued function that is non-zero over some open interval containing r R , r 0 . If F is continuously differentiable at r and using F ( r ) to denote the derivative d F ( r ) d r , then
ID F ( r ) r · F ( r ) F ( r ) = IntrDim F ( r ) .
Let x be a location of interest within a data domain S for which the distance measure d : S × S R + 0 has been defined. To any generated sample s S we associate the distance d ( x , s ) ; in this way, a global distribution that produces the sample s can be said to induce the random value d ( x , s ) from a local distribution of distances taken with respect to x . The CDF F ( r ) of the local distance distribution is simply the probability of the sample distance lying within a threshold r—that is, F ( r ) Pr [ d ( x , s ) r ] . In characterizing the local intrinsic dimensionality in the vicinity of location x , we are interested in the limit of ID F ( r ) as the distance r tends to 0, which we denote by
ID F lim r 0 ID F ( r ) .
Henceforth, when we refer to the local intrinsic dimensionality (LID) of a function F, or of a point x whose induced distance distribution has F as its CDF, we will take ‘LID’ to mean the quantity ID F . In general, ID F is not necessarily an integer. In practice, estimation of the LID at x would give an indication of the dimension of the submanifold containing x that best fits the distribution.
The function ID F can be seen to fully characterize its associated function F. This result is analogous to a foundational result from the statistical theory of extreme values (EVT), in that it corresponds under an inversion transformation to the Karamata representation theorem [69] for the upper tails of regularly varying functions. For more information on EVT and how the LID model relates to the extreme-value theoretic generalized pareto distribution, we refer the reader to [2,70,71].
Theorem 2 (LID Representation Theorem
[2]).Let F : R R be a real-valued function, and assume that ID F exists. Let x and w be values for which x / w and F ( x ) / F ( w ) are both positive. If F is non-zero and continuously differentiable everywhere in the interval [ min { x , w } , max { x , w } ] , then
F ( x ) F ( w ) = x w ID F · A F ( x , w ) , where A F ( x , w ) exp x w ID F ID F ( t ) t d t ,
whenever the integral exists.
In [2], conditions on x and w are provided for which the factor A F ( x , w ) can be seen to tend to 1 as x , w 0 . The convergence characteristics of F to its asymptotic form are expressed by the factor A F ( x , w ) , which is related to the slowly varying component of functions as studied in EVT [70]. As we will shown in the next section, we make use of the LID Representation Theorem in our analysis of the limits of tail entropy variants under a form of normalization.

4. Definitions of Tail Entropies and Tail Dissimilarity Measures

In this section, we present the formulations of entropy, divergences and distances that will be studied in the later sections, in the light of the model of local intrinsic dimensionality outlined in Section 3. These entropies and dissimilarity measures will all be conditioned on the lower tails of smooth functions on domains bounded from below at zero. In each case, the formulations involve one or more non-negative real-valued functions whose restriction to [ 0 , w ] satisfies certain smooth growth properties:
Definition 2.
Let F : R + 0 R + 0 be a function that is positive except at F ( 0 ) = 0 . We say that F is a smooth growth function if
  • There exists a value r > 0 such that F is monotonically increasing over ( 0 , r ) ;
  • F is continuous over [ 0 , r ) ;
  • F is differentiable over ( 0 , r ) ; and
  • The local intrinsic dimensionality ID F exists and is positive.
Given a smooth growth function F and a value w > 0 , we define F w ( t ) F ( t ) F ( w ) . If F is the CDF of some random variable X 0 , then F w ( t ) = Pr [ X t | X w ] , which can in turn be interpreted as the CDF of the distribution of X conditioned to the lower tail [ 0 , w ] . It is easy to see that for a sufficiently small choice of w, F w must also be a smooth growth function. Its derivative F w ( t ) = F ( t ) F ( w ) exists since F ( t ) exists, and thus can be regarded as the probability density function (PDF) of the restriction of F to [ 0 , w ] . In addition, it can easily be shown (using Theorem 1) that the LID of F w is identical to that of F.
If the monotonicity of the function F is strict over the domain of interest [ 0 , r ) , its inverse function F 1 exists and satisfies the smooth growth conditions within some neighborhood of the origin. Moreover, F w 1 is also a smooth growth function over [ 0 , 1 ] , with F w 1 ( 0 ) = 0 and F w 1 ( 1 ) = w .
The following tail entropy, tail divergence and tail distance formulations all apply to any functions F and G satisfying the conditions stated above; in particular, they involve one or more of F w , F w , G w , G w , and (if the monotonicity of the functions is strict) F w 1 and G w 1 . In their definitions, the only difference between the tail variants and the original versions is that the distributions are conditioned on the lower tail [ 0 , w ] . In the tail measures involving one or more of F w , F w , G w and G w , integration is performed over the lower tail and not the entire distributional range [ 0 , + ) ; for the variant involving F w 1 and G w 1 , integration is performed over [ 0 , 1 ] for values of w constrained to the lower tail.
We begin with (differential) tail entropy. Entropy is perhaps the most fundamental and widely used model of data complexity and can be regarded as a measure of the uncertainty of a distribution. Differential entropy assesses the expected surprisal of a random variable and can take negative values.
Definition 3
(Tail Entropy).The entropy of F conditioned on [ 0 , w ] is
H ( F , w ) 0 w F w ( t ) ln F w ( t ) d t .
The tail entropy is equal to E ( log F w ) , the expected value of the (tail) log-likelihood. It is also possible to define the variance of the (tail) log-likelihood. This is known as the varentropy. Understanding this further, note that one may define the information content of a random variable X with density function f, to be log f ( X ) . The entropy (uncertainty) then corresponds to the expected value of the information content of X and the varentropy corresponds to the variance of the information content of X. The varentropy was introduced by Song [72] as an intrinsic measure of the shape of a distribution and has been explored in a range of studies [73,74,75].
Definition 4
(Tail varentropy).The varentropy of F conditioned on [ 0 , w ] is
VarH ( F , w ) 0 w F w ( t ) ln 2 F w ( t ) d t 0 w F w ( t ) ln F w ( t ) d t 2
The cumulative entropy is a variant of entropy proposed in [55,56] due to its attractive theoretical properties. Tail conditioning on the cumulative entropy has the same general form as that of the tail entropy. Cumulative entropy [55,56] is an information-theoretic measure popular in reliability theory, where it is used to model uncertainty over time intervals. It corresponds to the expected value of the mean inactivity time. Compared to ordinary Shannon differential entropy, cumulative entropy has certain attractive properties, such as non-negativity and ease of estimation.
Definition 5
(Cumulative Tail Entropy).The cumulative entropy of F conditioned on [ 0 , w ] is
cH ( F , w ) 0 w F w ( t ) ln F w ( t ) d t .
The entropy power is the exponential of the entropy, and is also known as perplexity in the natural language processing community. It corresponds to the volume of the smallest set that contains most of the probability measure [49], and can be interpreted as a measure of statistical dispersion [53]. There are several standard definitions of entropy power in the research literature. For our purposes, we adopt the simplest—the exponential of Shannon entropy—for our definition conditioned to the tail.
Definition 6
(Tail Entropy Power).The entropy power of F conditioned on [ 0 , w ] is defined to be
HP ( F , w ) exp ( H ( F , w ) ) .
In the introduction, we briefly mentioned some motivation for the entropy power HP ( F , w ) . We can add to this as follows:
  • It can be interpreted as a diversity. Observe that when F is a (univariate) uniform distance distribution ranging over the interval [ 0 , w ] , we have ID F = 1 and HP ( F , w ) = w . In other words, the entropy power is equal to the ‘effective diversity’ of the distribution (the number of neighbor distance possibilities).
  • Given two different queries, each with its own neighborhood, one query with tail entropy power equal to 2 and the other with tail entropy power equal to 4, we can say that the distance distribution of the second query is twice as diverse as that of the first query.
For each of the tail entropy variants introduced above, we also propose analogous variants based on the q-entropy formulation due to Tsallis [59]. Generalized Tsallis entropies [59,60] are a family of entropies characterized via an exponent parameter q applied to the probabilities, in which the traditional (Shannon) entropy variants are obtained as the special case when q is allowed to tend to 1. The use of such a parameter q can often facilitate more accurate fitting of data characteristics and robustness to outliers.
Definition 7
(Tail q-Entropy).For any q > 0   ( q 1 ) , the q-entropy of F conditioned on [ 0 , w ] is defined to be
H q ( F , w ) 1 q 1 1 0 w F w ( t ) q d t = 1 q 1 0 w F w ( t ) F w ( t ) q d t .
Definition 8
(Cumulative Tail q-Entropy).For any q > 0   ( q 1 ) , the cumulative q-entropy of F conditioned on [ 0 , w ] is defined to be
cH q ( F , w ) 1 q 1 0 w F w ( t ) F w ( t ) q d t .
We define the tail q-entropy power using the q-exponential function from Tsallis statistics [59], exp q ( x ) [ 1 + ( 1 q ) x ] 1 1 q . Note that L’Hôpital’s rule can be used to show that exp q ( x ) e x as q 1 .
Definition 9
(Tail q-Entropy Power).For any q > 0   ( q 1 ) , the q-entropy power of F conditioned on [ 0 , w ] is defined to be
HP q ( F , w ) 1 + ( 1 q ) H q ( F , w ) 1 1 q .
We next define the tail cross entropy. Cross entropy can be used to compare two probability distributions and is often employed as a loss function in machine learning, comparing a true distribution and a learned distribution. From an information theoretic perspective, cross entropy corresponds to the expected coding length when a wrong distribution G is assumed while the data actually follows a distribution F.
Definition 10
(Tail Cross Entropy).The cross entropy from F to G, conditioned on [ 0 , w ] , is defined to be
XH ( F ; G , w ) 0 w F w ( t ) ln G w ( t ) d t .
Similar to entropy power, we can also define the cross entropy power, which is the exponential of the cross entropy.
Definition 11
(Tail Cross Entropy Power).The cross entropy power from F to G, conditioned on [ 0 , w ] , is defined to be
XHP ( F ; G , w ) exp 0 w F w ( t ) ln G w ( t ) d t .
A classic and fundamental method for comparing two probability distributions is the Kullback–Leibler divergence (KL Divergence) [76]. K L ( F , G ) measures the degree to which a probability distribution G is different from a reference probability distribution F. It is a member of both the family of f-divergences and Bregman divergences. It is widely used in statistics, machine learning and information theory.
Definition 12
(Tail KL Divergence).The Kullback–Leibler divergence from F to G, conditioned on [ 0 , w ] , is defined to be
KL ( F ; G , w ) 0 w F w ( t ) ln F w ( t ) G w ( t ) d t .
The tail KL divergence can be connected to the tail entropy and the tail cross entropy according to the relationship KL ( F ; G , w ) = XH ( F ; G ; w ) H ( F , w ) .
The Jensen–Shannon divergence (JS divergence) [77] is another popular measure of distance between probability distributions. It is based on the KL divergence, but unlike the KL, the square root of the JS divergence is a true metric.
Definition 13
(Tail JS Divergence).The Jensen–Shannon divergence between F and G, conditioned on [ 0 , w ] , is defined to be
JS ( F ; G , w ) KL ( F ; M , w ) + KL ( G ; M , w ) 2 , where M ( t ) = F ( t ) + G ( t ) 2 .
The tail JS divergence can also be written in terms of the tail entropies JS ( F ; G , w ) = H ( F + G 2 , w ) H ( F , w ) + H ( G , w ) 2
The L2 distance is the squared Euclidean distance when comparing two probability distributions. It is part of the family of β divergences when setting β = 2 [78].
Definition 14
(Tail L2 Distance).The L2 distance between F and G, conditioned on [ 0 , w ] , is defined to be
L 2 D ( F ; G , w ) 0 w F w ( t ) G w ( t ) 2 d t .
The Hellinger distance [79] is a true metric for comparing two probability distributions. The squared Hellinger distance a member of the family of f-divergences and is part of the family of α divergences when setting α = 1 2 [80].
Definition 15
(Tail Hellinger Distance).The Hellinger distance between F and G, conditioned on [ 0 , w ] , is defined to be
HD ( F ; G , w ) 1 2 0 w F w ( t ) G w ( t ) 2 d t .
The χ 2 divergence between two probability distributions [81] is a member of the family of f divergences and is part of the family of α divergences when setting α = 2 [80].
Definition 16
(Tail χ 2 -Divergence).The χ 2 divergence between F and G, conditioned on [ 0 , w ] , is defined to be
χ 2 D ( F ; G , w ) 0 w F w ( t ) G w ( t ) 2 G w ( t ) d t .
The asymmetric α -divergence [80] is another member of the family of f divergences. When α = 2 it is proportional to the χ 2 divergence. When α = 0.5 it is proportional to the squared Hellinger distance. When α 1 it corresponds to the KL-divergence.
Definition 17
(Tail α -Divergence).The α-divergence from F to G, conditioned on [ 0 , w ] , is defined to be
α D ( F ; G , w ) 1 α ( 1 α ) 0 w α F w ( t ) + ( 1 α ) G w ( t ) F w ( t ) α G w ( t ) 1 α d t .
The Wasserstein distance between two probability distributions is also known as the Kantorovich–Rubinstein metric [82] or the earth mover’s distance. It has become very popular as part of the loss function used in generative adversarial networks [83]. In the univariate case it can be expressed in a simple analytic form.
Definition 18
(Tail Wasserstein Distance).The p-th Wasserstein distance between F and G, conditioned on [ 0 , w ] , is defined to be
WD p ( F ; G , w ) 0 1 F w 1 ( u ) G w 1 ( u ) p d u 1 p .
For some of the aforementioned tail measures, we will also consider a normalization of the entropy, divergence or distance (as the case may be) with respect to w, the length of the tail. In Section 5 and Section 6, we will show that as w tends to zero, the limits of these (possibly normalized) tail entropies and tail divergences can be expressed in terms of the local intrinsic dimensionalities of F and G. The notation for these variants, and our results for their limits in terms of ID F and ID G , are summarized in Table 1.

5. Simplification of Tail Measures

Next, we present the main theoretical contributions of the paper: three technical lemmas that will later be used to establish relationships between local intrinsic dimensionality and a variety of tail measures based on entropy, divergences or distances. The results presented in this section all apply asymptotically, as the tail boundary tends toward zero.
Each of the three lemmas allow, under certain conditions, the simplification of limits of integrals involving smooth growth functions of the form F w (as defined in Section 4), or its associated first derivative F w or inverse function F w 1 . The limit integral simplifications allow for the substitution of the function (or derivative or inverse) by expressions that involve one or more of the following: the LID value of the function, the variable of integration or the tail boundary w. Moreover, the lemmas require that the integrand be monotone with respect to small variations in the targeted function.
The first lemma allows terms of the form F w (resembling the CDF of a tail-conditioned distribution) to be converted into a term that depends only on the variable of integration, the tail length w, and the local intrinsic dimension ID F .
Lemma 1.
Let F be a smooth growth function over the interval [ 0 , r ) . Consider the function ϕ : R + 2 R admitting a representation of the form
ϕ ( t , w ) ψ t , w , z ( t , w ) ,
where:
  • ψ : R + 3 R ;
  • z ( t , w ) = F w ( t ) = F ( t ) F ( w ) ; and
  • for all fixed choices of t and w satisfying 0 < t w < r , ψ ( t , w , z ) is monotone and continuously partially differentiable with respect to z over the interval z ( 0 , 1 ] .
Then
lim w 0 + 0 w ϕ ( t , w ) d t lim w 0 + 0 w ψ t , w , F w ( t ) d t = lim w 0 + 0 w ψ t , w , t w ID F d t
whenever the latter limit exists or diverges to + or .
Proof. 
Since F is assumed to be a smooth growth function, the limit ID F = lim v 0 + ID F ( v ) exists and is positive. We present an ‘epsilon-delta’ argument based on this limit. For any real value ϵ > 0 satisfying ϵ < min { r , ID F } , there must exist a value 0 < δ < ϵ such that v < δ implies that | ID F ( v ) ID F | < ϵ . Therefore, when 0 < t w < δ ,
| ln A F t , w | = t w ID F ID F ( v ) v d v < ϵ · t w 1 v d v = ϵ · ln w t .
Exponentiating, we obtain the bounds
w t ϵ < A F t , w < w t ϵ .
Applying this bound together with Theorem 2, the ratio F w ( t ) = F ( t ) F ( w ) can be seen to satisfy
t w ID F + ϵ < F ( t ) F ( w ) = A F t , w · t w ID F < t w ID F ϵ .
Over the domain of interest 0 < t w < δ , the assumption that 0 < ϵ < min { r , ID F } ensures that 0 < t w 1 , and that the upper and lower bounds of Inequality (1) lie in the interval ( 0 , 1 ] . Since ψ ( t , w , z ) has been assumed to be monotone with respect to z ( 0 , 1 ] , the maximum and minimum attained by ψ over choices of z restricted to any (closed) subinterval of ( 0 , 1 ] must occur at opposite endpoints of the subinterval. With this in mind, for any choice of ϵ ( 0 , min { r , ID F } ) , Inequality (1) implies that
B min ( t , w , ϵ ) ψ t , w , F w ( t ) B max ( t , w , ϵ ) and 0 w B min ( t , w , ϵ ) d t 0 w ψ t , w , F w ( t ) d t 0 w B max ( t , w , ϵ ) d t ,
where
B min ( t , w , ϵ ) min ψ t , w , t w ID F ϵ , ψ t , w , t w ID F + ϵ , B max ( t , w , ϵ ) max ψ t , w , t w ID F ϵ , ψ t , w , t w ID F + ϵ .
Since ψ ( t , w , z ) and 0 w ψ ( t , w , z ) d t are also continuously partially differentiable with respect to z over z ( 0 , 1 ] ,
lim ϵ 0 + B min ( t , w , ϵ ) = lim ϵ 0 + B max ( t , w , ϵ ) = ψ t , w , t w ID F and lim w < ϵ ϵ 0 + 0 w B min ( t , w , ϵ ) d t = lim w < ϵ ϵ 0 + 0 w B max ( t , w , ϵ ) d t = lim w 0 + 0 w ψ t , w , t w ID F d t .
It therefore follows from the squeeze theorem for integrals that
lim w 0 + 0 w ψ t , w , F w ( t ) d t = lim w 0 + 0 w ψ t , w , t w ID F d t ,
whenever the right-hand limit exists or diverges. □
In a manner similar to that of the preceding lemma, the following result allows terms of the form F w 1 (the inverse of F w ) to be converted into a term that depends only on the variable of integration, the tail length w and the local intrinsic dimension ID F . Here, in order to ensure the existence of the inverse function, F (and by extension F w and F w 1 ) must be strictly monotonically increasing over the tail.
Lemma 2.
Let F be a smooth growth function over the interval [ 0 , r ) . Let us also assume that, over the interval, the monotonicity of F is strict. Consider the function ϕ : R + 2 R admitting a representation of the form
ϕ ( u , w ) ψ u , w , z ( u , w ) ,
where:
  • ψ : R + 3 R ;
  • z ( u , w ) = F w 1 ( u ) for all w ( 0 , r ) , where F w ( t ) F ( t ) / F ( w ) is restricted to values of t in [ 0 , w ] ; and
  • for all fixed choices of u and w satisfying u [ 0 , 1 ] and 0 < w < r , ψ ( u , w , z ) is monotone and continuously partially differentiable with respect to z over the interval z ( 0 , r ) .
Then
lim w 0 + 0 1 ϕ ( u , w ) d u lim w 0 + 0 1 ψ u , w , F w 1 ( u ) d u = lim w 0 + 0 1 ψ u , w , w u 1 ID F d u
whenever the latter limit exists or diverges to + or .
Proof. 
First, we note that the strict monotonicity of F implies that for all u [ 0 , 1 ] and w ( 0 , r ) , the function F w 1 ( u ) is uniquely defined when F w is restricted to [ 0 , w ] .
As in the proof of Lemma 1, an ‘epsilon-delta’ argument based on the existence of the limit ID F = lim v 0 + ID F ( v ) yields the following: for any real value ϵ > 0 satisfying ϵ < min { r , ID F } , there exists a value δ ( 0 , ϵ ) such that
t w ID F + ϵ < F w ( t ) = F ( t ) F ( w ) < t w ID F ϵ
holds for all 0 < t w < δ . Solving for t through exponentiation of the bounds, and then setting t = F w 1 ( u ) , we obtain
w · F w ( t ) 1 ID F ϵ < t < w · F w ( t ) 1 ID F + ϵ w · F w ( F w 1 ( u ) ) 1 ID F ϵ < F w 1 ( u ) < w · F w ( F w 1 ( u ) ) 1 ID F + ϵ w u 1 ID F ϵ < F w 1 ( u ) < w u 1 ID F + ϵ .
The remainder of the proof follows essentially the same path as that of Lemma 1. Over the domain of interest 0 < t w < δ , the assumption that 0 < ϵ < min { r , ID F } ensures that 0 < t w 1 , and that u lies in the interval ( 0 , w ] . Since ψ ( u , w , z ) has been assumed to be monotone with respect to z ( 0 , r ) , the maximum and minimum attained by ψ over choices of z restricted to any (closed) subinterval of ( 0 , r ) must occur at opposite endpoints. Therefore, for any choice of ϵ ( 0 , min { r , ID F } ) ,
C min ( u , w , ϵ ) ψ u , w , F w 1 ( u ) C max ( u , w , ϵ ) and 0 1 C min ( u , w , ϵ ) d u 0 1 ψ u , w , F w 1 ( u ) d u 0 1 C max ( u , w , ϵ ) d u ,
where
C min ( u , w , ϵ ) min ψ u , w , w u 1 ID F ϵ , ψ u , w , w u 1 ID F + ϵ , C max ( u , w , ϵ ) max ψ u , w , w u 1 ID F ϵ , ψ u , w , w u 1 ID F + ϵ .
Since ψ ( u , w , z ) is also continuously partially differentiable with respect to z over z ( 0 , r ) ,
lim ϵ 0 + C min ( u , w , ϵ ) = lim ϵ 0 + C max ( u , w , ϵ ) = ψ u , w , w u 1 ID F and lim w < ϵ ϵ 0 + 0 1 C min ( u , w , ϵ ) d u = lim w < ϵ ϵ 0 + 0 1 C max ( u , w , ϵ ) d u = lim w 0 + 0 1 ψ u , w , w u 1 ID F d u .
It therefore follows from the squeeze theorem for integrals that
lim w 0 + 0 1 ψ u , w , F w 1 ( u ) d u = lim w 0 + 0 1 ψ u , w , w u 1 ID F d u ,
whenever the right-hand limit exists or diverges. □
The third lemma facilitates the conversion of a term of the form F w to F w , together with a factor that depends only on the variable of integration and ID F . Since F is assumed to be a smooth growth function, F w must be smooth as well, and therefore F w satisfies the conditions of Theorem 1 over [ 0 , w ) . Hence, F w can be substituted by an expression involving F w :
F w ( t ) = ID F w ( t ) t · F w ( t ) = ID F ( t ) t · F w ( t ) .
The substitution comes at the cost of introducing a non-constant factor ID F ( t ) . The following lemma shows that ID F ( t ) can in turn be substituted by the constant ID F , provided that certain monotonicity assumptions are satisfied.
Lemma 3.
Let F be a smooth growth function over the interval [ 0 , r ) . Consider the function ϕ : R + 2 R admitting a representation of the form
ϕ ( t , w ) ψ t , w , z ( t , w ) ,
where:
  • ψ : R + 3 R ;
  • z ( t , w ) = ID F ( t ) , and
  • there exists a value γ ( 0 , ID F ) such that for all fixed choices of t satisfying 0 < t w < r , ψ ( t , w , z ) is monotone with respect to z over the interval z ( ID F γ , ID F + γ ) .
Then
lim w 0 + 0 w ϕ ( t , w ) d t lim w 0 + 0 w ψ t , w , ID F ( t ) d t = lim w 0 + 0 w ψ t , w , ID F d t
whenever the latter limit exists or diverges to + or .
Proof. 
Since F is assumed to be a smooth growth function, the limit ID F = lim v 0 + ID F ( v ) exists and is positive. We present an ‘epsilon-delta’ argument based on this limit. For any real value ϵ > 0 satisfying ϵ < min { r , γ } , there must exist a value 0 < δ < ϵ such that v < δ implies that | ID F ( v ) ID F | < ϵ .
Since ψ ( t , w , z ) has been assumed to be monotone with respect to z over the interval z ( ID F γ , ID F + γ ) , the restriction v < δ < ϵ < min { r , γ } ensures that ψ ( t , w , z ) is monotone over the entire domain of interest 0 < t w < δ . Therefore, the maximum and minimum attained by ψ over choices of z restricted to any (closed) subinterval of ( ID F γ , ID F + γ ) must occur at opposite endpoints of the subinterval. As in the proof of Lemma 1,
D min ( t , w , ϵ ) ψ t , w , ID F ( t ) D max ( t , w , ϵ ) and 0 w D min ( t , w , ϵ ) d t 0 w ψ t , w , ID F ( t ) d t 0 w D max ( t , w , ϵ ) d t ,
where
D min ( t , w , ϵ ) min ψ t , w , ID F ϵ , ψ t , w , ID F + ϵ , D max ( t , w , ϵ ) max ψ t , w , ID F ϵ , ψ t , w , ID F + ϵ .
Since ψ ( t , w , z ) is also continuously partially differentiable with respect to z over the range ( ID F γ , ID F + γ ) ,
lim ϵ 0 + D min ( t , w , ϵ ) = lim ϵ 0 + D max ( t , w , ϵ ) = ψ t , w , ID F and lim w < ϵ ϵ 0 + 0 w D min ( t , w , ϵ ) d t = lim w < ϵ ϵ 0 + 0 w D max ( t , w , ϵ ) d t = lim w 0 + 0 w ψ t , w , ID F d t .
It therefore follows from the squeeze theorem for integrals that
lim w 0 + 0 w ψ t , w , ID F ( t ) d t = lim w 0 + 0 w ψ t , w , ID F d t ,
whenever the right-hand limit exists or diverges. □

6. Derivation of the Limits of Tail Measures

In this section, we will see how the three substitution lemmas can be applied to the limits of tail measures of entropy, divergence or distance, so as to produce formulations that depend only on the local intrinsic dimensions of the functions involved. All three lemmas require that the integral function be monotone with respect to small variations in the term that is targeted for substitution. In the discussion, we choose two tail measures as running examples: the tail KL divergence and the second tail Wasserstein distance ( p = 2 ).

6.1. Handling Derivatives of Smooth Growth Functions

In the case of the tail KL divergence, Theorem 1 allows us to substitute out the first derivatives F w and G w for the functions F w and G w :
KL ( F ; G , w ) = 0 w F w ( t ) ln F w ( t ) G w ( t ) d t = 0 w ID F ( t ) F w ( t ) t ln ID F ( t ) F w ( t ) ID G ( t ) G w ( t ) d t .

6.2. Substitution of LID Functions by Constants

In the limit of the tail KL divergence, the functions ID F ( t ) and ID G ( t ) can be replaced by the constants ID F and ID G , respectively, through three successive applications of Lemma 3. To verify that the monotonicity condition of the Lemma is satisfied, we choose one of the terms and replace it by a new variable, z:
lim w 0 + KL ( F ; G , w ) = lim w 0 + 0 w z F w ( t ) t ln ID F ( t ) F w ( t ) ID G ( t ) G w ( t ) d t .
For any fixed values of t and w, it is easy to see that the integrand is locally monotone in the vicinity of z = ID F ( t ) —here, if ln ID F ( t ) F w ( t ) ID G ( t ) G w ( t ) is positive, a small increase in z (above the value ID F ( t ) ) would result in an increase in the value of the integrand, and a small decrease would cause the integrand to decrease. If instead the logarithmic factor were negative, an increase in z would result in a decrease in the value of the integrand. Either way, the integrand would be monotone in the vicinity of z = ID F ( t ) at each fixed value of t and w. Its monotonicity condition thus being satisfied, Lemma 3 allows the targeted instance of ID F ( t ) to be substituted by ID F :
lim w 0 + KL ( F ; G , w ) = lim w 0 + 0 w ID F F w ( t ) t ln ID F ( t ) F w ( t ) ID G ( t ) G w ( t ) d t .
Similarly, it can be verified that the new integrand is monotone in each of the remaining two factors ID F ( t ) and ID G ( t ) ; consequently, they too can be substituted by ID F and ID G , one at a time, to yield
lim w 0 + KL ( F ; G , w ) = lim w 0 + 0 w ID F F w ( t ) t ln ID F F w ( t ) ID G G w ( t ) d t .

6.3. Elimination of Tail-Conditioned Smooth Growth Functions

Now that the tail KL divergence has been reformulated in terms of the tail-conditioned smooth growth functions F w and G w , these two functions can be substituted out via three successive applications of Lemma 1, so as to obtain the limit of an integral involving only the variable of integration t, and the constants w, ID F and ID G :
lim w 0 + KL ( F ; G , w ) = lim w 0 + 0 w ID F t t w ID F ln ID F ID G t w ID F ID G d t .
As in the previous step in which ID F ( t ) and ID G ( t ) were substituted out, the monotonicity conditions of Lemma 1 can easily be verified.
Now that the integral involves only constants and the variable t, it can be solved straightforwardly using the integration-by-parts technique, yielding
lim w 0 + KL ( F ; G , w ) = lim w 0 + ID G ID F ln ID G ID F 1 = ID G ID F ln ID G ID F 1 .

6.4. Elimination of the Inverses of Tail-Conditioned Smooth Growth Functions

We now turn our attention to the limit of the tail Wasserstein distance for the case p = 2 . Using Lemma 2, the inverse functions F w 1 and G w 1 can be substituted out, provided that the monotonicity requirements are satisfied. However, immediate application of the lemma to F w 1 ( u ) or G w 1 ( u ) does not necessarily work—to see this, consider substituting F w 1 ( u ) by the new variable z.
WD 2 ( F ; G , w ) = 0 1 F w 1 ( u ) G w 1 ( u ) 2 d u = 0 1 z G w 1 ( u ) 2 d u .
Clearly, the integrand is not necessarily monotone in z in the vicinity of those values of the integration variable u where G w 1 ( u ) = z .
Instead, we expand the squared difference and apply Lemma 3 to each of the resulting four occurrences of F w 1 and G w 1 , one by one. By way of illustration, we consider substitution by z for the factor of F w 1 ( u ) in the cross term:
lim w 0 + WD 2 ( F ; G , w ) 2 = lim w 0 + 0 1 F w 1 ( u ) G w 1 ( u ) 2 d u = lim w 0 + 0 1 F w 1 ( u ) 2 2 F w 1 ( u ) G w 1 ( u ) + G w 1 ( u ) 2 d u = lim w 0 + 0 1 F w 1 ( u ) 2 2 z · G w 1 ( u ) + G w 1 ( u ) 2 d u .
With respect to small variations in the variable z about the value F w 1 ( u ) , noting that G w 1 is always non-negative, the integrand is easily seen to be monotone in z when G w 1 ( u ) is non-zero: for any increase in z, the value of the integrand decreases, and for any decrease in z, the value of the integrand increases. Lemma 2 can therefore be applied, producing
lim w 0 + WD 2 ( F ; G , w ) 2 = lim w 0 + 0 1 F w 1 ( u ) 2 2 w u 1 ID F · G w 1 ( u ) + G w 1 ( u ) 2 d u .
After three more applications of Lemma 2, followed by taking the square root of the integral, we obtain
lim w 0 + WD 2 ( F ; G , w ) = lim w 0 + 0 1 w 2 u 2 ID F 2 w 2 u 1 ID F + 1 ID G + w 2 u 2 ID G d u 1 2 = lim w 0 + w · 1 2 ID F + 1 2 1 ID F + 1 ID G + 1 + 1 2 ID G + 1 1 2 = 0 .

6.5. Normalization

Even though the limit of the second tail Wasserstein distance is zero and therefore uninteresting, we observe that by normalizing it by the tail length w, we arrive at a more useful result:
lim w 0 + 1 w WD 2 ( F ; G , w ) = 1 2 ID F + 1 2 1 ID F + 1 ID G + 1 + 1 2 ID G + 1 1 2 .
In general, reweighting by a power of w may be required to expose a relationship between the tail limit of an entropy measure or divergence and an expression in terms of the local intrinsic dimensions of the functions involved. Since local intrinsic dimension is a unitless quantity, in order to establish a non-trivial formulation solely in terms of LID values, any tail measure whose values are not unitless will generally require some form of normalization.

6.6. Summary of Results

Table 1 provides a summary of results. All the results stated in Table 1 can be derived either using the techniques outlined earlier in this section, or through direct substitution of another result in the table. The derivations are outlined in Table 2 (tail entropy variants), Table 3 (tail divergence variants), Table 4 (tail distance variants) and Table 5 (tail Wasserstein distances). Most of these derivations are straightforward; however, for two of the tail measures, some clarifications are required.
Generally speaking, for the normalized tail Wasserstein distances with p non-integer or p odd (Table 5), Lemma 2 cannot be applied, due to the absolute value operation in the integrand. It may happen that the functions F 1 ( u ) and G 1 ( u ) may have crossing points for many (possibly even infinitely many) values of u between 0 and 1. At these values of u, F 1 ( u ) G 1 ( u ) = 0 , and neither z G 1 ( u ) nor F 1 ( u ) y would be monotone in the vicinity of z = F 1 ( u ) or y = G 1 ( u ) , as the case may be.
For the tail JS divergence (Table 3), the derivation relies on the fact that the LID of the sum (or average) of two non-negative smooth growth functions is the smaller of the two individual LID values. This is an implication of the fact that lim t 0 + V ( t ) W ( t ) = 0 whenever the smooth growth functions V ( t ) and W ( t ) have 0 < ID W < ID V (see [84] for more details). Accordingly, if ID F ID G , then the function (F or G) with smaller LID value must have the same LID value as the average function M ( t ) = F ( t ) + G ( t ) 2 , and the other function (G or F) must have LID value equal to the maximum of the two. From these observations, the derivation can be seen to hold.
The result for the limit of the tail KL divergence has an interesting interpretation in light of the so-called Itakura–Saito (IS) divergence (or distance) [85]:
d IS ( x | y ) = i = 1 n x i y i ln x i y i 1 .
As the tail boundary w tends to 0, the tail KL divergence between smooth functions F and G tends to the (univariate) IS divergence between their associated LID values ID G and ID F :
lim w 0 + KL ( F ; G , w ) = ID G ID F ln ID G ID F 1 = d IS ( ID G | ID F ) .
When F and G are interpreted as the CDFs of distance distributions, the shape parameters of the extreme-value-theoretic generalized pareto distributions (GPDs) that asymptotically characterize their lower tails are known to equal 1 ID F and 1 ID G , respectively [40]. Since the ratio of these parameters is equal to (the reciprocal of) the ratio of LID values, the tail KL divergence between F and G can also be interpreted as tending to the IS divergence between GPD parameters.
The IS divergence is popular as an objective for matrix factorization of audio spectra [86], for assessing the loss of using entry y i , j to approximate a true entry x i , j ; more precisely, to approximate a matrix V by factorization WH , the loss is i j d IS ( [ V ] i j | [ WH ] i j ) . The IS divergence is a convenient choice for this scenario due to its scale-free property ( d IS ( x | y ) = d IS ( α x | α y ) for any α 0 ), thus giving the same relative weight to both small and large values of x i and y i , since they only appear as the ratio x i y i . This is important for scenarios such as audio spectra, where the magnitudes of x i and y i can vary greatly.
The Itakura–Saito divergence falls into the family of so-called Bregman divergences (or distances) [87], which have a geometric interpretation as the difference between the value of a convex generator function at x on the one hand, and the value at x of a hyperplane function that is tangent to the generator curve at y . Bregman divergences are a highly expressive family of distances with a wide range of applications [88]. For the IS divergence, the convex generator function is the negative logarithm i = 1 n ln x i . Interestingly, the KL divergence is also a Bregman divergence, with its convex generator being the negative entropy function i = 1 n x i ln x i [89].

7. Extension to Multivariate Distributions

Thus far, our results have focused on a univariate scenario, wherein entropy and divergence variants were shown to be asymptotically equivalent to formulations involving the local intrinsic dimensionalities of smooth distributions of a single random variable. As discussed in Section 3, these results can be applied to distance-based analysis, through characterizations involving the LIDs of local (univariate) distance distributions induced by the overall (global) multivariate distribution. These characterizations are indirect, in that they do not explicitly involve (nor do they require) any knowledge of the underlying global distribution and its parameters. However, characterizations in terms of induced distance distributions may not be entirely satisfying when the nature of the global multivariate distribution is either known or assumed. In this section, we will assume that our domain S is the n-dimensional space R n equipped with the Euclidean distance, d ( x , y ) = x y . Within S , we will also assume that we are given a data distribution D with probability density function p : R n R + 0 .

7.1. Multivariate Tail Distributions with Local Spherical Symmetry

Within the Euclidean domain, the challenge is to analyze distributions in terms of the probability measure captured within volumes associated with a distributional tail. However, unlike in univariate distributions, there is no universally accepted notion of ‘distributional tail’ for multivariate distributions. For our purposes, given a distance r > 0 , we define the tail of D of length r to be the region enclosed by the ball of radius r centered at the origin; that is, B ( r ) { x R n : x r } . The boundary of the tail is the ( n 1 ) -dimensional surface area of B ( r ) , which we denote by B ( r ) { x R n : x = r } .
To enable tractable analysis, we will assume that the PDF can be expressed in terms of a locally spherically symmetrical function. One example of where local spherical symmetry can be expected to hold is for a locally isotropic context. This is a common assumption for physical systems, including metals, glasses, fluids and polymers, for which the distribution locally surrounding a particle in the system does not have a directional preference.
Formally, we say that a density function f is locally spherically symmetrical within radius w if for all x w , we have f ( x ) = f ( r ) for some univariate function f where r = x . For f to be locally spherically symmetrical, it suffices that f ( x ) be equal to f ( y ) whenever 0 x = y r . The assumption also implies the existence of a function f for which f ( x ) = f ( r ) , and therefore that f must be constant over all points of the sphere B ( r ) .
The probability measure captured by B ( r ) , which we denote by F ( r ) , is obtained through the integration of f over this ball:
F ( r ) B ( r ) f d B ( r ) .
It is not difficult to see that the univariate function F is simply the CDF of the distribution of distances to the origin induced by the global distribution D . If F is differentiable over the tail interval ( 0 , r ] , then the integral of F over this interval exists, and equals F:
B ( r ) f d B ( r ) = F ( r ) = 0 r F ( t ) d t .
The derivative F ( r ) can therefore be interpreted as the PDF of the radial distance distribution as measured from the origin.
For spherically symmetric distributions in Euclidean spaces, the multivariate density and radial density is related through a factor that depends on the surface area of spherical volumes. The formulae for the volume of an n-sphere and its ( n 1 ) -dimensional surface area are given by
V n ( r ) π n / 2 Γ ( ( n / 2 ) + 1 ) r n and S n 1 ( r ) 2 π n / 2 Γ ( n / 2 ) r n 1 ,
respectively. Γ is the common gamma function Γ ( n ) = ( n 1 ) ! if n is a positive integer and Γ ( n + 1 2 ) = ( n 1 2 ) ( n 3 2 ) 1 2 π if n is a non negative integer. Furthermore, the volume and surface area have a simple relationship that allows for easy conversion between the two:
r · S n 1 ( r ) = n · V n ( r ) .
Lemma 4
([90]). Let X be an n-dimensional random vector that is spherically symmetric with a radial distribution R . Then X has a density f ( x ) if and only if R has a density s and
s ( r ) = f ( x ) · S n 1 ( r ) .
If F is a smooth growth function that is locally spherically symmetric over [ 0 , r ] , Equation (2) and Lemma 4 together give us the following relationship between the radial density F and the multivariate density f:
f ( x ) = F ( x ) S n 1 ( x )
whenever x r . Conditioning the distribution to the ball B ( r ) , the tail distribution PDF becomes
f r ( x ) f ( x ) B ( w ) f d B ( w ) = F ( x ) S n 1 ( x ) · F ( r ) = F r ( x ) S n 1 ( x ) .

7.2. Multivariate Tail Entropy Variants

The aforementioned relationships between multivariate and radial densities can be immediately used to compute the various tail entropies for the locally spherically symmetric multivariate case. Useful background on evaluating radial integrals can be found in Baker [91]. For example, the multivariate Tail Entropy is
H ( f , w ) B ( w ) f w ln f w d B ( w ) = 0 w F w ( t ) S n 1 ( t ) ln F w ( t ) S n 1 ( t ) · S n 1 ( t ) d t = 0 w F w ( t ) ln F w ( t ) S n 1 ( t ) d t .
Although the multivariate formulation of Tail Entropy H ( f , w ) resembles that of the univariate formulation H ( F , w ) , the two are not identical. Nevertheless, the multivariate formulation can still be simplified using the technical lemmas introduced in Section 5. In much the same way as for the univariate Tail Entropy Power, we can use Theorem 1 together with Lemmas 1 and 3 to determine the limit of H ( f , w ) as w tends to 0. Replacing F w ( t ) by 1 t ID F ( t ) F w ( t ) , then ID F ( t ) by ID F , and finally F w ( t ) by t w ID F , we obtain
lim w 0 H ( f , w ) = lim w 0 0 w F w ( t ) ln F w ( t ) t n 1 S n 1 ( 1 ) d t = lim w 0 0 w ID F t t w ID F ln ID F t n S n 1 ( 1 ) t w ID F d t = lim w 0 0 w ID F w ID F t ID F 1 ln ID F w ID F S n 1 ( 1 ) t ID F n d t = lim w 0 0 w ID F w ID F t ID F 1 ln ID F w ID F S n 1 ( 1 ) + ( ID F n ) ln t d t .
Solving the integral, and then using Equation (3) to convert the surface area factor S n 1 to an expression involving the volume V n , we eventually arrive at
lim w 0 H ( f , w ) = lim w 0 1 n ID F ln ID F w n S n 1 ( 1 ) = lim w 0 1 n ID F ln ID F w S n 1 ( w ) = lim w 0 1 n ID F ln ID F n + ln V n ( w ) ,
which diverges even when the Tail Entropy is reweighted by V n ( w ) (or indeed, by any other polynomial in w). However, the Tail Entropy Power, when normalized by V n ( w ) , does converge to a strictly positive value:
lim w 0 1 V n ( w ) HP ( f , w ) lim w 0 1 V n ( w ) exp H ( f , w ) = lim w 0 1 V n ( w ) exp 1 n ID F ln ID F n + ln V n ( w ) = 1 φ exp 1 1 φ , where φ = ID F n .
As one might expect in the n-dimensional Euclidean setting, the (normalized asymptotic) multivariate Tail Entropy Power is maximized whenever ID F , the local intrinsic dimensionality of the associated radial CDF F, is equal to n.

7.3. Multivariate Cumulative Tail Entropy

In the multivariate setting, cumulative entropy is defined in terms of the distributional tail, according to the notion laid out in Section 7.1. In place of the usual probability density f ( x ) , the entropy function is applied to the probability measure associated with the ball centered at the origin with radius x ; that is, with
Pr [ X x ] B ( x ) f d B ( x ) = F ( x ) .
Note that since F takes the same value at x and y whenever x = y , the quantity F ( x ) is locally spherically symmetric even when the underlying density function f is not.
We can adapt the multivariate formulation of cumulative residual entropy that was originally proposed by Rao [56]. The multivariate Cumulative Tail Entropy, conditioned to a distributional tail of radius w, is expressed as a multivariate integral involving F w ( x ) , or as a radial integral involving F w , as follows:
cH ( f , w ) x B ( w ) F w ( x ) ln F w ( x ) d B ( w ) = 0 w F w ( t ) ln F w ( t ) · S n 1 ( t ) d t .
As in the treatment of the univariate tail entropies, we can use Lemma 1 to determine the limit of cH ( f , w ) as w tends to 0. Replacing F w ( t ) by t w ID F ,
lim w 0 cH ( f , w ) = lim w 0 0 w t n 1 S n 1 ( 1 ) F w ( t ) ln F w ( t ) d t = lim w 0 0 w t n 1 S n 1 ( 1 ) t w ID F ln t w ID F d t = lim w 0 0 w S n 1 ( 1 ) ID F w ID F · t ID F + n 1 ln t ln w d t .
Solving the integral, and then converting the surface area factor S n 1 to a volume factor V n using Equation (3), we obtain
lim w 0 cH ( f , w ) = lim w 0 w S n 1 ( w ) · ID F ( ID F + n ) 2 = lim w 0 V n ( w ) · φ ( φ + 1 ) 2 , where φ = ID F n .
Although the multivariate Cumulative Tail Entropy vanishes as the tail boundary w tends to zero, when normalized by the tail volume V n ( w ) it converges to a strictly positive value:
lim w 0 1 V n ( w ) cH ( f , w ) = φ ( φ + 1 ) 2 .
Again, as with the Normalized Tail Entropy Power, the (asymptotic) multivariate Tail Cumulative Entropy is maximized whenever φ = 1 . That is, when ID F = n .

7.4. Multivariate Tail Divergences

Several of the tail divergence measures, when considered in the multivariate setting under the assumptions of locally spherical symmetry, turn out to be identical to those of the radial (univariate) setting. As an example, consider the multivariate Tail KL Divergence, defined as
KL ( f ; g , w ) B ( w ) f w ln f w g w d B ( w ) .
Applying Lemma 4 and integrating radially over the tail, we see that
KL ( f ; g , w ) = 0 w F w ( t ) S n 1 ( t ) ln F w ( t ) / S n 1 ( t ) G w ( t ) / S n 1 ( t ) · S n 1 ( t ) d t = 0 w F w ( t ) ln F w ( t ) G w ( t ) d t = KL ( F ; G , w ) ,
the Tail KL Divergence of F and G, which (as stated in Table 1) has the limit ID G ID F ln ID G ID F 1 as the tail length w tends to zero.
Similarly, it can easily be seen that the multivariate versions of the JS Divergence, the Hellinger Distance, the χ 2 -Divergence and the α -Divergence all have radial integral formulations identical to their corresponding univariate versions.

7.5. Observations

The general strategy for deriving these results is essentially the same as for the multivariate Tail Entropy: first use Lemma 4 to convert the multidimensional integral to an integral in one dimension, then use the technical lemmas of Section 5 to simplify the univariate integral as before.
Our results for the locally spherically symmetric multivariate case are shown in Table 6; however, since their derivations greatly resemble those of the analogous univariate cases, we omit the details. Some remarks:
  • A result for the Wasserstein Distance is not included, since its formulation does not generalize straightforwardly to higher dimensions, unlike the other divergence measures.
  • The normalizations and weightings used depend only on the tail volume V n ( w ) and (for the Tsallis entropy variants) the parameter q. This generalizes our earlier univariate results where normalization was performed with regard to the tail length w.
  • All the multivariate tail variants considered Table 6 are elegant generalizations of their corresponding univariate formulations, and all explicitly depend on the ratios between the LIDs and the dimension of the space n ( φ = ID F n and γ = ID G n ), or on the ratio of two LID values ( ρ = ID G ID F = γ φ ). Among these, the Normalized Entropy Power and the Normalized Cumulative Entropy are maximized when ID F = n , which can occur when the tail distribution is uniform. The Varentropy is minimized when ID F = n , which can occur when the variance of the log-likelihood for a uniform distribution is equal to zero.
  • As mentioned in Related Work, a number of previous studies in deep learning have found that the local intrinsic dimension in learned representations is lower than the dimension of the full space [32,33,34,35] (i.e., ID F < n ) and that the learning process progressively reduces local intrinsic dimension. Consider a concrete example where n = 100 and ID F = 12 and the learning process is reducing ID F at a point from 12 to 11. The consequent effect on entropy can be interpreted from two different perspectives, either as an increase in tail distance entropy or a decrease in tail location entropy:
    • Considering univariate normalized entropy power or normalized cumulative entropy (Table 1), reduction of ID F corresponds to an increase in entropy. Here, the entropy is measuring the uncertainty of the univariate random variable modeling distances to nearest neighbors. Thus, reduction of ID F corresponds to an increase in “distance entropy”.
    • Considering multivariate normalized entropy power or multivariate normalized cumulative entropy (Table 6), reduction of ID F corresponds to an decrease in entropy. Here, the entropy is measuring the uncertainty of the multivariate random variable modeling locations of nearest neighbors, assuming local spherical symmetry. So reduction of ID F corresponds to a decrease in “location entropy”.
We will see a visualization of these scenarios in Section 7.6.
5.
All four of the multivariate tail divergences listed in Table 6, as well as the Hellinger Distance, have radial integral formulations that are identical to their univariate counterparts. All the divergences and distances (including the Weighted L2 Distance) are minimized when ID F = ID G .
6.
By setting n = 1 , we can recover the univariate results from Table 1. However, note that the range of integration used in Table 6 is a hypersphere of radius w, where for n = 1 it is the interval [ w , w ] . In contrast, the integral formulations listed in Table 1 were taken over the interval [ 0 , w ] . For some results, this means a minor (constant factor of 2) difference between Table 1 and the result from Table 6 when n = 1 .

7.6. Visualization of Behavior

Our results in Table 6 relate local intrinsic dimensionality to entropies and divergences. If analyzing an n dimensional global distribution such as the standard normal distribution or uniform distribution, then the dimension of every sub-manifold (i.e., the local intrinsic dimensionality ID F ) will be n. However, our interest is in situations where the local intrinsic dimensionality differs from the representation dimension n. To provide further intuition on this aspect, two plots are shown in Figure 1.
Figure 1a compares the behavior of the normalized entropy power and the normalized cumulative entropy (multiplied by a constant factor of 4) in n-dimensional space, as the ratio ϕ = ID F n is varied. We see that these measures have similar trends and they are maximized when ID F = n . We also see that when 1 ID F < n , these entropic measures will decrease if ID F is decreased (for a fixed n). On the other hand, if n = 1 and 1 ID F , then these entropic measures will increase if ID F is decreased, where n = 1 corresponds to the scenario where we are modeling the uncertainty of a distance distribution. This illustrates remark number 4 from Section 7.5 above.
Figure 1b compares the behavior of different tail divergences as the ratio ρ = ID G ID F varies. The divergences shown are the KL divergence, the Jensen–Shannon divergence and the Hellinger distance. These measures have similar trends as ρ varies and are minimized and equal to zero when ID F = ID G . Also, the Hellinger distance is bounded above by 1.

8. Conclusions

In this theoretical investigation, we have established asymptotic relationships between tail entropy variants, tail divergences and the theory of local intrinsic dimensionality. Our results are derived under the assumption that the distribution(s) under consideration are being analyzed in a highly local context, within the distribution tail(s), an asymptotically small neighborhood whose radius approaches zero. These results show that tail entropies and tail divergences depend in a fundamental way on local intrinsic dimensionality and help form a theoretical foundation for cross-fertilization between intrinsic dimensionality research and entropy research. As future work, we plan to investigate the potential of these new characterizations in a range of application settings. For example, for use as a basis in machine learning to characterize and improve representations and representation learning, as well as use in understanding behavior of physical systems such as fluids and helping characterize their critical transitions in time and space.
Our results from both univariate and multivariate cases, show that the tail entropies and divergences considered in this paper depend only on (i) the embedding (representation) dimension in which the distribution is situated, and (ii) the local intrinsic dimension(s) of the distribution(s). Furthermore, in many cases there is dependence involving the ratio between the intrinsic dimension and the embedding dimension.
Consider the context of distance based analysis, when a distribution models distances from a central query location to its nearest neighbors, and the distances are induced by global data. In this situation, our characterization of entropy might be termed as ‘personalized’, in that entropy expresses the uncertainty (or complexity) from the perspective of the query, in regard to the distances to samples within an asymptotically small neighborhood. Phrased another way, these local entropies are ‘observer-dependent’, since they are tied to the choice of query (the observer). This can be contrasted with the more common notion of entropy, where one analyzes a global distribution, and there is no requirement of a query point or its local neighborhood.
As alluded to in the introduction, divergences between tail distributions could be used for comparison of real and synthetic distributions, as is commonly required for generative adversarial networks (GANs). Given a particular query location we may either: (i) compute the divergence between the univariate tail distance distributions of synthetic and real examples, as measured from a query point; or (ii) compute the divergence between the multivariate tail distance distributions of synthetic and real examples, again as measured from the query, under an assumption of local isotropy. Our results show that under the assumption of local spherical symmetry, the use of divergences (such as KL) between tail distance distributions is asymptotically equivalent to the standard multivariate formulations with the same divergences, when restricted to the neighborhoods around locations of interest. For future work it will be interesting to consider whether it is possible to further extend our multivariate results to elliptically symmetric distributions or skew-elliptical distributions, such as those studied by Contreras-Reyes [65].
Lastly, our results in Table 1 and Table 6 show theoretical relationships for entropies and divergences, but in practice one must estimate the measures using samples of data. A natural approach here is to first estimate local intrinsic dimensional values such as ID F and ID G using any desired estimator (such as the maximum likelihood estimator [39,40,41]), and then plug in the estimated LID value into the desired tail entropy or tail divergence formula. For example, an estimator of the (univariate) Normalized Cumulative Entropy could be obtained by computing ID F ^ ( ID F ^ + 1 ) 2 , where ID F ^ is the estimated LID of the distance distribution F.

Author Contributions

Conceptualization, J.B., M.E.H. and X.M.; methodology, J.B., M.E.H. and X.M.; formal analysis, J.B., M.E.H. and X.M.; writing—original draft preparation, J.B., M.E.H. and X.M.; writing—review and editing, J.B., M.E.H. and X.M. All authors have read and agreed to the published version of the manuscript.

Funding

James Bailey acknowledges the support of ARC Discovery Grant DP170102472. Michael E. Houle acknowledges the financial support of JSPS Kakenhi Kiban (B) Research Grant 18H03296.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
  2. Houle, M.E. Local Intrinsic Dimensionality I: An Extreme-Value-Theoretic Foundation for Similarity Applications. In Proceedings of the International Conference on Similarity Search and Applications, Munich, Germany, 4–6 October 2017; pp. 64–79. [Google Scholar]
  3. Bailey, J.; Houle, M.E.; Ma, X. Relationships Between Local Intrinsic Dimensionality and Tail Entropy. In Proceedings of the Similarity Search and Applications—Proc. of the 14th International Conference, SISAP 2021, Dortmund, Germany, 29 September–1 October 2021. [Google Scholar]
  4. Heller, R.; Heller, Y. Multivariate tests of association based on univariate tests. In Advances in Neural Information Processing Systems 29 (NIPS 2016); Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 208–216. [Google Scholar]
  5. Maa, J.; Pearl, D.; Bartoszynski, R. Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. Ann. Stat. 1996, 24, 1069–1074. [Google Scholar] [CrossRef]
  6. Li, A.; Qi, J.; Zhang, R.; Ma, X.; Ramamohanarao, K. Generative image inpainting with submanifold alignment. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, Hong Kong, 10–16 August 2019; pp. 811–817. [Google Scholar]
  7. Camastra, F.; Staiano, A. Intrinsic dimension estimation: Advances and open problems. Inf. Sci. 2016, 328, 26–41. [Google Scholar] [CrossRef]
  8. Campadelli, P.; Casiraghi, E.; Ceruti, C.; Rozza, A. Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework. Math. Probl. Eng. 2015, 2015, 759567. [Google Scholar] [CrossRef]
  9. Verveer, P.J.; Duin, R.P.W. An evaluation of intrinsic dimensionality estimators. IEEE Trans. Pattern Anal. Mach. Intell. 1995, 17, 81–86. [Google Scholar] [CrossRef]
  10. Bruske, J.; Sommer, G. Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 572–575. [Google Scholar] [CrossRef]
  11. Pettis, K.W.; Bailey, T.A.; Jain, A.K.; Dubes, R.C. An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 25–37. [Google Scholar] [CrossRef]
  12. Navarro, G.; Paredes, R.; Reyes, N.; Bustos, C. An empirical evaluation of intrinsic dimension estimators. Inf. Syst. 2017, 64, 206–218. [Google Scholar] [CrossRef]
  13. Jolliffe, I.T. Principal Component Analysis; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
  14. Costa, J.A.; Hero III, A.O. Entropic Graphs for Manifold Learning. In Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 1, pp. 316–320. [Google Scholar]
  15. Hein, M.; Audibert, J.Y. Intrinsic dimensionality estimation of submanifolds in Rd. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 289–296. [Google Scholar]
  16. Rozza, A.; Lombardi, G.; Rosa, M.; Casiraghi, E.; Campadelli, P. IDEA: Intrinsic Dimension Estimation Algorithm. In Proceedings of the International Conference on Image Analysis and Processing, Ravenna, Italy, 14–16 September 2011; pp. 433–442. [Google Scholar]
  17. Rozza, A.; Lombardi, G.; Ceruti, C.; Casiraghi, E.; Campadelli, P. Novel High Intrinsic Dimensionality Estimators. Mach. Learn. 2012, 89, 37–65. [Google Scholar] [CrossRef]
  18. Ceruti, C.; Bassis, S.; Rozza, A.; Lombardi, G.; Casiraghi, E.; Campadelli, P. DANCo: An intrinsic dimensionality estimator exploiting angle and norm concentration. Pattern Recognit. 2014, 47, 2569–2581. [Google Scholar] [CrossRef]
  19. Facco, E.; d’Errico, M.; Rodriguez, A.; Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci. Rep. 2017, 7, 12140. [Google Scholar] [CrossRef]
  20. Zhou, S.; Tordesillas, A.; Pouragha, M.; Bailey, J.; Bondell, H. On local intrinsic dimensionality of deformation in complex materials. Nat. Sci. Rep. 2021, 11, 10216. [Google Scholar] [CrossRef]
  21. Tordesillas, A.; Zhou, S.; Bailey, J.; Bondell, H. A representation learning framework for detection and characterization of dead versus strain localization zones from pre- to post- failure. Granul. Matter 2022, 24, 75. [Google Scholar] [CrossRef]
  22. Faranda, D.; Messori, G.; Yiou, P. Dynamical proxies of North Atlantic predictability and extremes. Sci. Rep. 2017, 7, 41278. [Google Scholar] [CrossRef]
  23. Messori, G.; Harnik, N.; Madonna, E.; Lachmy, O.; Faranda, D. A dynamical systems characterization of atmospheric jet regimes. Earth Syst. Dynam. 2021, 12, 233–251. [Google Scholar] [CrossRef]
  24. Kambhatla, N.; Leen, T.K. Dimension Reduction by Local Principal Component Analysis. Neural Comput. 1997, 9, 1493–1516. [Google Scholar] [CrossRef]
  25. Houle, M.E.; Ma, X.; Nett, M.; Oria, V. Dimensional Testing for Multi-Step Similarity Search. In Proceedings of the IEEE 12th International Conference on Data Mining, Brussels, Belgium, 10–13 December 2012; pp. 299–308. [Google Scholar]
  26. Campadelli, P.; Casiraghi, E.; Ceruti, C.; Lombardi, G.; Rozza, A. Local Intrinsic Dimensionality Based Features for Clustering. In Proceedings of the International Conference on Image Analysis and Processing, Naples, Italy, 9–13 September 2013; pp. 41–50. [Google Scholar]
  27. Houle, M.E.; Schubert, E.; Zimek, A. On the correlation between local intrinsic dimensionality and outlierness. In Proceedings of the International Conference on Similarity Search and Applications, Lima, Peru, 7–9 October 2018; pp. 177–191. [Google Scholar]
  28. Carter, K.M.; Raich, R.; Finn, W.G.; Hero, A.O., III. FINE: Fisher Information Non-parametric Embedding. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 2093–2098. [Google Scholar] [CrossRef]
  29. Ma, X.; Li, B.; Wang, Y.; Erfani, S.M.; Wijewickrema, S.N.R.; Schoenebeck, G.; Song, D.; Houle, M.E.; Bailey, J. Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–15. [Google Scholar]
  30. Amsaleg, L.; Bailey, J.; Barbe, D.; Erfani, S.M.; Houle, M.E.; Nguyen, V.; Radovanović, M. The Vulnerability of Learning to Adversarial Perturbation Increases with Intrinsic Dimensionality. In Proceedings of the IEEE Workshop on Information Forensics and Security, Rennes, France, 4–7 December 2017; pp. 1–6. [Google Scholar]
  31. Amsaleg, L.; Bailey, J.; Barbe, A.; Erfani, S.M.; Furon, T.; Houle, M.E.; Radovanović, M.; Nguyen, X.V. High Intrinsic Dimensionality Facilitates Adversarial Attack: Theoretical Evidence. IEEE Trans. Inf. Forensics Secur. 2021, 16, 854–865. [Google Scholar] [CrossRef]
  32. Ma, X.; Wang, Y.; Houle, M.E.; Zhou, S.; Erfani, S.M.; Xia, S.; Wijewickrema, S.N.R.; Bailey, J. Dimensionality-Driven Learning with Noisy Labels. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3361–3370. [Google Scholar]
  33. Ansuini, A.; Laio, A.; Macke, J.H.; Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 6111–6122. [Google Scholar]
  34. Pope, P.; Zhu, C.; Abdelkader, A.; Goldblum, M.; Goldstein, T. The intrinsic dimension of images and its impact on learning. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
  35. Gong, S.; Boddeti, V.N.; Jain, A.K. On the intrinsic dimensionality of image representations. In Proceedings of the CVPR, Long Beach, CA, USA, 5–20 June 2019; pp. 3987–3996. [Google Scholar]
  36. Barua, S.; Ma, X.; Erfani, S.M.; Houle, M.H.; Bailey, J. Quality Evaluation of GANs Using Cross Local Intrinsic Dimensionality. arXiv 2019, arXiv:1905.00643. [Google Scholar]
  37. Romano, S.; Chelly, O.; Nguyen, V.; Bailey, J.; Houle, M.E. Measuring Dependency via Intrinsic Dimensionality. In Proceedings of the ICPR16, Cancun, Mexico, 4–8 December 2016; pp. 1207–1212. [Google Scholar]
  38. Lucarini, V.; Faranda, D.; de Freitas, A.; de Freitas, J.; Holland, M.; Kuna, T.; Nicol, M.; Todd, M.; Vaienti, S. Extremes and Recurrence in Dynamical Systems; Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
  39. Levina, E.; Bickel, P.J. Maximum Likelihood Estimation of Intrinsic Dimension. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 777–784. [Google Scholar]
  40. Amsaleg, L.; Chelly, O.; Furon, T.; Girard, S.; Houle, M.E.; Kawarabayashi, K.; Nett, M. Extreme-Value-Theoretic Estimation of Local Intrinsic Dimensionality. Data Min. Knowl. Discov. 2018, 32, 1768–1805. [Google Scholar] [CrossRef]
  41. Hill, B.M. A Simple General Approach to Inference About the Tail of a Distribution. Ann. Stat. 1975, 3, 1163–1174. [Google Scholar] [CrossRef]
  42. Johnsson, K.; Soneson, C.; Fontes, M. Low bias local intrinsic dimension estimation from expected simplex skewness. IEEE TPAMI 2015, 37, 196–202. [Google Scholar] [CrossRef] [PubMed]
  43. Amsaleg, L.; Chelly, O.; Houle, M.E.; Kawarabayashi, K.; Radovanović, R.; Treeratanajaru, W. Intrinsic dimensionality estimation within tight localities. In Proceedings of the 2019 SIAM International Conference on Data Mining, Calgary, AB, Canada, 2–4 May 2019; pp. 181–189. [Google Scholar]
  44. Farahmand, A.M.; Szepesvári, C.; Audibert, J.Y. Manifold-adaptive dimension estimation. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 265–272. [Google Scholar]
  45. Block, A.; Jia, Z.; Polyanskiy, Y.; Rakhlin, A. Intrinsic Dimension Estimation Using Wasserstein Distances. arXiv 2021, arXiv:2106.04018. [Google Scholar]
  46. Thordsen, E.; Schubert, E. ABID: Angle Based Intrinsic Dimensionality—Theory and analysis. Inf. Syst. 2022, 108, 101989. [Google Scholar] [CrossRef]
  47. Carter, K.M.; Raich, R.; Hero III, A.O. On Local Intrinsic Dimension Estimation and Its Applications. IEEE Trans. Signal Process. 2010, 58, 650–663. [Google Scholar] [CrossRef]
  48. Tempczyk, P.; Golinski, A.; Spurek, P.; Tabor, J. LIDL: Local Intrinsic Dimension estimation using approximate Likelihood. In Proceedings of the ICLR 2021 Workshop on Geometrical and Topological Representation Learning, Online, 7 May 2021. [Google Scholar]
  49. Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  50. Rioul, O. Information Theoretic Proofs of Entropy Power Inequalities. IEEE Trans. Inf. Theory 2011, 57, 33–55. [Google Scholar] [CrossRef]
  51. Jelinek, F.; Mercer, R.L.; Bahl, L.R.; Baker, J.K. Perplexity—A measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 1977, 62, S63. [Google Scholar] [CrossRef]
  52. Jost, L. Entropy and diversity. Oikos 2006, 113, 363–375. [Google Scholar] [CrossRef]
  53. Kostal, L.; Lansky, P.; Pokora, O. Measures of statistical dispersion based on Shannon and Fisher information concepts. Inf. Sci. 2013, 235, 214–223. [Google Scholar] [CrossRef]
  54. Stam, A.J. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control. 1959, 2, 101–112. [Google Scholar] [CrossRef]
  55. Di Crescenzo, A.; Longobardi, M. On cumulative entropies. J. Stat. Plan. Inference 2009, 139, 4072–4087. [Google Scholar] [CrossRef]
  56. Rao, M.; Chen, Y.; Vemuri, B.C.; Wang, F. Cumulative residual entropy: A new measure of information. IEEE Trans. Inf. Theory 2004, 50, 1220–1228. [Google Scholar] [CrossRef]
  57. Nguyen, H.V.; Mandros, P.; Vreeken, J. Universal Dependency Analysis. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; pp. 792–800. [Google Scholar] [CrossRef]
  58. Böhm, K.; Keller, F.; Müller, E.; Nguyen, H.V.; Vreeken, J. CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. In Proceedings of the 13th SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; pp. 198–206. [Google Scholar] [CrossRef]
  59. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  60. Calì, C.; Longobardi, M.; Ahmadi, J. Some properties of cumulative Tsallis entropy. Phys. A Stat. Mech. Its Appl. 2017, 486, 1012–1021. [Google Scholar] [CrossRef]
  61. Pele, D.T.; Lazar, E.; Mazurencu-Marinescu-Pele, M. Modeling Expected Shortfall Using Tail Entropy. Entropy 2019, 21, 1204. [Google Scholar] [CrossRef]
  62. MacKay, D.J. Information Theory, Inference, and Learning Algorithms, 1st ed.; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  63. Kac, M.; Kiefer, J.; Wolfowitz, J. On tests of normality and other tests of goodness of fit based on distance methods. Ann. Math. Stat. 1955, 26, 189–211. [Google Scholar] [CrossRef]
  64. Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th Annual Conference on Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 271–279. [Google Scholar]
  65. Contreras-Reyes, J. Asymptotic form of the Kullback-Leibler divergence for multivariate asymmetric heavy-tailed distributions. Phys. A Stat. Mech. Its Appl. 2014, 395, 200–208. [Google Scholar] [CrossRef]
  66. Houle, M.E.; Kashima, H.; Nett, M. Generalized Expansion Dimension. In Proceedings of the IEEE 12th International Conference on Data Mining Workshops, Brussels, Belgium, 10 December 2012; pp. 587–594. [Google Scholar]
  67. Karger, D.R.; Ruhl, M. Finding nearest neighbors in growth-restricted metrics. In Proceedings of the 34th ACM Symposium on Theory of Computing, Montreal, QC, Canada, 19–21 May 2002; pp. 741–750. [Google Scholar]
  68. Houle, M.E. Dimensionality, Discriminability, Density and Distance Distributions. In Proceedings of the IEEE 13th International Conference on Data Mining Workshops, Dallas, TX, USA, 7–10 December 2013; pp. 468–473. [Google Scholar]
  69. Karamata, J. Sur un mode de croissance régulière. Théorèmes fondamentaux. Bull. Société Mathématique Fr. 1933, 61, 55–62. [Google Scholar] [CrossRef]
  70. Coles, S.; Bawa, J.; Trenner, L.; Dorazio, P. An Introduction to Statistical Modeling of Extreme Values; Springer: Berlin/Heidelberg, Germany, 2001; Volume 208. [Google Scholar]
  71. Houle, M.E. Local Intrinsic Dimensionality II: Multivariate Analysis and Distributional Support. In Proceedings of the International Conference on Similarity Search and Applications, Munich, Germany, 4–6 October 2017; pp. 80–95. [Google Scholar]
  72. Song, K. Renyi information, log likelihood and an intrinsic distribution measure. J. Statist. Plann. Inference 2001, 93, 51–69. [Google Scholar] [CrossRef]
  73. Buono, F.; Longobardi, M. Varentropy of past lifetimes. arXiv 2020, arXiv:2008.07423. [Google Scholar]
  74. Maadani, S.; Borzadaran, G.R.M.; Roknabadi, A.H.R. Varentropy of order statistics and some stochastic comparisons. Commun. Stat. Theory Methods 2021, 51, 6447–6460. [Google Scholar] [CrossRef]
  75. Raqab, M.Z.; Bayoud, H.A.; Qiu, G. Varentropy of inactivity time of a random variable and its related applications. IMA J. Math. Control. Inf. 2021, 39, 132–154. [Google Scholar] [CrossRef]
  76. Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  77. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
  78. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
  79. Hellinger, E. Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. J. Für Die Reine Und Angew. Math. 1909, 136, 210–271. [Google Scholar] [CrossRef]
  80. Cichocki, A.; Amari, S. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
  81. Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1900, 50, 157–175. [Google Scholar] [CrossRef]
  82. Kantorovich, L.V. Mathematical Methods of Organizing and Planning Production. Manag. Sci. 1939, 6, 366–422. [Google Scholar] [CrossRef]
  83. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR: Cambridge, MA, USA, 2017; Volume 70, pp. 214–223. [Google Scholar]
  84. Houle, M.E. Local Intrinsic Dimensionality III: Density and Similarity. In Proceedings of the International Conference on Similarity Search and Applications, Copenhagen, Denmark, 30 September–2 October 2020. [Google Scholar]
  85. Itakura, F.; Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968; pp. C17–C20. [Google Scholar]
  86. Fevotte, C.; Bertin, N.; Durrieu, J. Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef]
  87. Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  88. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef]
  89. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  90. Fang, K.W.; Kotz, S.; Wang Ng, K. Symmetric Multivariate and Related Distributions; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  91. Baker, J.A. Integration of Radial Functions. Math. Mag. 1999, 72, 392–395. [Google Scholar] [CrossRef]
Figure 1. Visualization of selected measures from Table 6 (a) Entropy behavior as the ratio ID F n varies; (b) Divergence/distance behavior as the ratio ID G ID F varies.
Figure 1. Visualization of selected measures from Table 6 (a) Entropy behavior as the ratio ID F n varies; (b) Divergence/distance behavior as the ratio ID G ID F varies.
Entropy 24 01220 g001
Table 1. Asymptotic equivalences between LID formulations and tail measures of entropy or divergence. In each case, the functions F and G are assumed to be smooth growth functions. In addition, for the Normalized Wasserstein Distance, F and G must be strictly monotonically increasing, thereby guaranteeing that the inverses of F w and G w exist near zero. In some cases, for the asymptotic limit to exist non-trivially (that is, to be both finite and non-zero), the tail entropy or tail divergence must be normalized by the multiplicative factor 1 w , w. For the Tail Entropy and Tail Cross Entropy, no reweighting by powers of w can lead to a non-trivial asymptotic limit as w tends to zero.
Table 1. Asymptotic equivalences between LID formulations and tail measures of entropy or divergence. In each case, the functions F and G are assumed to be smooth growth functions. In addition, for the Normalized Wasserstein Distance, F and G must be strictly monotonically increasing, thereby guaranteeing that the inverses of F w and G w exist near zero. In some cases, for the asymptotic limit to exist non-trivially (that is, to be both finite and non-zero), the tail entropy or tail divergence must be normalized by the multiplicative factor 1 w , w. For the Tail Entropy and Tail Cross Entropy, no reweighting by powers of w can lead to a non-trivial asymptotic limit as w tends to zero.
Tail MeasureFormulationLimit as w 0 +
Entropy H ( F , w ) = 0 w F w ( t ) ln F w ( t ) d t Diverges (no reweighting possible)
Varentropy VarH ( F , w ) = 0 w F w ( t ) ln 2 F w ( t ) d t 0 w F w ( t ) ln F w ( t ) d t 2 1 1 ID F 2
q-Entropy H q ( F , w ) = 1 q 1 0 w F w ( t ) F w ( t ) q d t 1 q 1   if   q < 1 , diverges if q > 1
Normalized Cumulative Entropy 1 w cH ( F , w ) = 1 w 0 w F w ( t ) ln F w ( t ) d t ID F ( ID F + 1 ) 2
Normalized Cumulative q-Entropy 1 w cH q ( F , w ) = 1 w ( q 1 ) 0 w F w ( t ) F w ( t ) q d t ID F ( ID F + 1 ) ( q ID F + 1 )   if    q 1
Normalized Entropy Power 1 w HP ( F , w ) = 1 w exp H ( F , w ) 1 ID F exp 1 1 ID F
Normalized q-Entropy Power 1 w HP q ( F , w ) = 1 w 1 + ( 1 q ) H q ( F , w ) 1 1 q ( ID F ) q q ID F q + 1 1 1 q
if    q 1    and    q ID F q + 1 > 0
Cross Entropy XH ( F ; G , w ) = 0 w F w ( t ) ln G w ( t ) d t Diverges (no reweighting possible)
Normalized Cross Entropy Power 1 w XHP ( F ; G , w ) = 1 w exp 0 w F w ( t ) ln G w ( t ) d t 1 ID G exp ID G 1 ID F
KL Divergence KL ( F ; G , w ) = 0 w F w ( t ) ln F w ( t ) G w ( t ) d t ρ ln ρ 1 ; ρ = ID G ID F
JS Divergence JS ( F ; G , w ) = 1 2 KL F ; F + G 2 , w + KL G ; F + G 2 , w 1 2 τ ln τ 1 ; τ = min { ρ , 1 ρ } ; ρ = ID G ID F
Weighted L2 Distance w L 2 D ( F ; G , w ) = w 0 w F w ( t ) G w ( t ) 2 d t ID F ID G 2 2 ( ID F + ID G 1 ) 1 + 1 ( 2 ID F 1 ) ( 2 ID G 1 )
ID F > 1 2 ; ID G > 1 2
Hellinger Distance HD ( F ; G , w ) = 1 2 0 w F w ( t ) G w ( t ) 2 d t 1 ρ 1 + ρ ; ρ = ID G ID F
χ 2 -Divergence χ 2 D ( F ; G , w ) = 0 w F w ( t ) G w ( t ) 2 G w ( t ) d t 1 ρ 2 ρ ( 2 ρ ) ; ρ = ID G ID F ; ρ < 2
α -Divergence α D ( F ; G , w ) = 1 α ( 1 α ) 0 w α F w ( t ) + ( 1 α ) G w ( t ) 1 α ( 1 α ) 1 1 α ρ α 1 + ( 1 α ) ρ α
F w ( t ) α G w ( t ) 1 α d t ρ = ID G ID F ; α + ρ ( 1 α ) > 0
Normalized Wasserstein Distance 1 w WD p ( F ; G , w ) = 1 w 0 1 F w 1 ( u ) G w 1 ( u ) p d u 1 p p = 2 1 2 ID F + 1 2 1 ID F + 1 ID G + 1 + 1 2 ID G + 1  
p even:  j = 0 p ( 1 ) j j p ( p j ) · ( ID F ) 1 + j · ( ID G ) 1 + 1 1 p  
Table 2. Derivations of asymptotic relationships between tail entropy variants and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. If a normalization other weighting is needed to avoid divergence, or convergence to a constant (independent of F), the details are shown in a comment in the final step. In all cases, F is assumed to be a smooth growth function.
Table 2. Derivations of asymptotic relationships between tail entropy variants and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. If a normalization other weighting is needed to avoid divergence, or convergence to a constant (independent of F), the details are shown in a comment in the final step. In all cases, F is assumed to be a smooth growth function.
Tail MeasureDerivation StepsComments
Entropy H ( F , w ) 0 w F w ( t ) ln F w ( t ) d t
    → 0 w ID F ( t ) F w ( t ) t ln ID F ( t ) F w ( t ) t d t using Theorem 1
    → 0 w ID F F w ( t ) t ln ID F F w ( t ) t d t using Lemma 3
    → 0 w ID F t t w ID F ln ID F t t w ID F d t using Lemma 1
    → 1 1 ID F ln ID F w no reweighting
Varentropy VarH ( F , w ) 0 w F w ( t ) ln 2 F w ( t ) d t 0 w F w ( t ) ln F w ( t ) d t 2
       → 0 w ID F ( t ) F w ( t ) t ln 2 ID F ( t ) F w ( t ) t d t 0 w ID F ( t ) F w ( t ) t ln ID F ( t ) F w ( t ) t d t 2 using Theorem 1
       → 0 w ID F F w ( t ) t ln 2 ID F F w ( t ) t d t 0 w ID F F w ( t ) t ln ID F F w ( t ) t d t 2 using Lemma 3
       → 0 w ID F t t w ID F ln 2 ID F t t w ID F d t 0 w ID F t t w ID F ln ID F t t w ID F d t 2 using Lemma 1
       → 1 1 ID F 2
q-Entropy H q ( F , w ) 1 q 1 0 w F w ( t ) F w ( t ) q d t q > 1
      → 1 q 1 0 w ID F ( t ) F w ( t ) t ID F ( t ) F w ( t ) t q d t using Theorem 1
      → 1 q 1 0 w ID F F w ( t ) t ID F F w ( t ) t q d t using Lemma 3
      → 1 q 1 0 w ID F t t w ID F ID F t t w ID F q d t using Lemma 1
      → 1 q 1 1 1 w q 1 · ID F q q ID F q + 1
Cumulative Entropy cH ( F , w ) 0 w F w ( t ) ln F w ( t ) d t
     → 0 w t w ID F ln t w ID F d t using Lemma 1
     → w ID F ( ID F + 1 ) 2 weight by 1 w
Cumulative q-Entropy cH q ( F , w ) 1 q 1 0 w F w ( t ) F w ( t ) q d t q 1
     → 1 q 1 0 w t w ID F t w q ID F d t using Lemma 1
     → w ID F ( ID F + 1 ) ( q ID F + 1 ) weight by 1 w
Entropy Power HP ( F , w ) exp H ( F , w )
      → exp 1 1 ID F ln ID F w by substitution
      → w 1 ID F exp 1 1 ID F weight by 1 w
q-Entropy Power HP q ( F , w ) 1 + ( 1 q ) H q ( F , w ) 1 1 q q 1
     → 1 + ( 1 q ) · 1 q 1 1 1 w q 1 · ID F q q ID F q + 1 1 1 q by substitution
     → w ID F q q ID F q + 1 1 1 q weight by 1 w
Table 3. Derivations of asymptotic relationships between tail divergences and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. If a normalization or weighting is needed, the details are shown in a comment in the final step. In all cases, F and G are assumed to be smooth growth functions.
Table 3. Derivations of asymptotic relationships between tail divergences and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. If a normalization or weighting is needed, the details are shown in a comment in the final step. In all cases, F and G are assumed to be smooth growth functions.
Tail MeasureDerivation StepsComments
Cross Entropy XH ( F ; G , w ) 0 w F w ( t ) ln G w ( t ) d t
      → 0 w ID F ( t ) F w ( t ) t ln ID G ( t ) G w ( t ) t d t using Theorem 1
      → 0 w ID F F w ( t ) t ln ID G G w ( t ) t d t using Lemma 3
      → 0 w ID F t t w ID F ln ID G t t w ID G d t using Lemma 1
      → ID G 1 ID F ln ID G w no reweighting
Cross Entropy Power XHP ( F ; G , w ) exp XH ( F ; G , w )
        → exp ID G 1 ID F ln ID G w by substitution
        → w 1 ID G exp ID G 1 ID F weight by 1 w
KL Divergence KL ( F ; G , w ) 0 w F w ( t ) ln F w ( t ) G w ( t ) d t
      → 0 w ID F ( t ) F w ( t ) t ln ID F ( t ) F w ( t ) ID G ( t ) G w ( t ) d t using Theorem 1
      → 0 w ID F F w ( t ) t ln ID F F w ( t ) ID G G w ( t ) d t using Lemma 3
      → 0 w ID F t t w ID F ln ID F ID G t w ID F ID G d t using Lemma 1
      → ρ ln ρ 1 ρ = ID G ID F
JS Divergence JS ( F ; G , w ) 1 2 KL ( F ; M , w ) + KL ( G ; M , w ) M ( t ) = 1 2 F ( t ) + G ( t )
      → 1 2 ID M ID F ln ID M ID F 1 + ID M ID G ln ID M ID G 1 ID M = min { ID F , ID G }
      → 1 2 ID M B + ID M ID M ln ID M B ln ID M ID M 2 let B = max { ID F , ID G }
      → 1 2 τ ln τ 1 τ = min ID G ID F , ID F ID G
Table 4. Derivations of asymptotic relationships between tail distances and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. For each tail distance, the first step of the derivations shows an expansion by which the monotonicity of each factor can be verified. If a normalization or weighting is needed, the details are shown in a comment in the final step. In all cases, F and G are assumed to be smooth growth functions.
Table 4. Derivations of asymptotic relationships between tail distances and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. For each tail distance, the first step of the derivations shows an expansion by which the monotonicity of each factor can be verified. If a normalization or weighting is needed, the details are shown in a comment in the final step. In all cases, F and G are assumed to be smooth growth functions.
Tail MeasureDerivation StepsComments
L2 Distance L 2 D ( F ; G , w ) 0 w F w ( t ) G w ( t ) 2 d t
       → 0 w ID F ( t ) F w ( t ) t ID G ( t ) G w ( t ) t 2 d t using Theorem 1
       → 0 w ID F F w ( t ) t 2 2 ID F F w ( t ) t · ID G G w ( t ) t + ID G G w ( t ) t 2 d t using Lemma 3
       → 0 w ID F 2 t 2 t w 2 ID F 2 ID F ID G t 2 t w ID F + ID G + ID G 2 t 2 t w 2 ID G d t using Lemma 1
       → 1 w · ID F ID G 2 2 ( ID F + ID G 1 ) 1 + 1 ( 2 ID F 1 ) ( 2 ID G 1 ) weight by w
Hellinger Distance HD ( F ; G , w ) 1 2 0 w F w ( t ) G w ( t ) 2 d t
      → 1 2 0 w ID F ( t ) F w ( t ) t ID G ( t ) G w ( t ) t 2 d t using Theorem 1
      → 1 2 0 w ID F F w ( t ) t 2 ID F F w ( t ) · ID G G w ( t ) t + ID G G w ( t ) t d t using Lemma 3
      → 0 w 1 2 t ID F t w ID F 2 ID F ID G t w ( ID F + ID G ) / 2 + ID G t w ID G d t using Lemma 1
      → 1 ρ 1 + ρ ρ = ID G ID F
χ 2 -Divergence χ 2 D ( F ; G , w ) 0 w F w ( t ) G w ( t ) 2 G w ( t ) d t
       → 0 w ID F ( t ) F w ( t ) t ID G ( t ) G w ( t ) t 2 t ID G ( t ) G w ( t ) d t using Theorem 1
       → 0 w ID F F w ( t ) t 2 2 ID F F w ( t ) t · ID G G w ( t ) t + ID G G w ( t ) t 2 t ID G G w ( t ) d t using Lemma 3
       → 0 w ID F 2 t 2 t w 2 ID F 2 ID F ID G t 2 t w ID F + ID G + ID G 2 t 2 t w 2 ID G t ID G w t ID G d t using Lemma 1
       → 1 ρ 2 ρ ( 2 ρ ) ρ = ID G ID F
α -Divergence α D ( F ; G , w ) 1 α ( 1 α ) 0 w α F w ( t ) + ( 1 α ) G w ( t ) F w ( t ) α G w ( t ) 1 α d t
      → 1 α ( 1 α ) 0 w α ID F ( t ) F w ( t ) t + ( 1 α ) ID G ( t ) G w ( t ) t ID F ( t ) F w ( t ) t α ID G ( t ) G w ( t ) t 1 α d t using Theorem 1
      → 1 α ( 1 α ) 0 w α ID F F w ( t ) t + ( 1 α ) ID G G w ( t ) t ID F F w ( t ) t α ID G G w ( t ) t 1 α d t using Lemma 3
      → 1 α ( 1 α ) 0 w α ID F t t w ID F + ( 1 α ) ID G t t w ID G ( ID F ) α ( ID G ) 1 α t t w α ID F + ( 1 α ) ID G d t using Lemma 1
      → 1 α ( 1 α ) 1 ( ID F ) α ( ID G ) 1 α α ID F + ( 1 α ) ID G
      → 1 α ( 1 α ) 1 1 α ρ α 1 + ( 1 α ) ρ α ρ = ID G ID F
Table 5. Derivations of asymptotic relationships between tail Wasserstein distances and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. Normalization details are shown in a comment in the final step. In all cases, F and G are assumed to be invertible smooth growth functions.
Table 5. Derivations of asymptotic relationships between tail Wasserstein distances and local intrinsic dimensionality. Each step shows the equivalences between the formulations when w is allowed to tend to zero. In the comments column, for each step of the derivation, the lemmas invoked are stated, as well as any additional assumptions made. Normalization details are shown in a comment in the final step. In all cases, F and G are assumed to be invertible smooth growth functions.
Tail MeasureDerivation StepsComments
Wasserstein Distance WD 2 ( F ; G , w ) 0 1 F w 1 ( u ) G w 1 ( u ) 2 d u
       → 0 1 F w 1 ( u ) 2 2 F w 1 ( u ) · G w 1 ( u ) + G w 1 ( u ) 2 d u
p = 2        → 0 1 w 2 u 2 ID F 2 w 2 u 1 ID F + 1 ID G + w 2 u 2 ID G d u using Lemma 2
       → w 1 2 ID F + 1 2 1 ID F + 1 ID G + 1 + 1 2 ID G + 1 weight by 1 w
Wasserstein Distance WD p ( F ; G , w ) 0 1 F w 1 ( u ) G w 1 ( u ) p d u 1 p
p N ,   p even       → 0 1 j = 0 p ( 1 ) j j p F w 1 ( u ) p j G w 1 ( u ) j d u 1 p
       → 0 1 j = 0 p ( 1 ) j j p w u 1 ID F p j w u 1 ID G j d u 1 p using Lemma 2
       → w j = 0 p ( 1 ) j j p ( p j ) · ( ID F ) 1 + j · ( ID G ) 1 + 1 1 p weight by 1 w
Table 6. Asymptotic equivalences between LID formulations and tail measures of entropy or divergence for locally spherically symmetric distributions in the n-dimensional Euclidean setting. In each case, the density functions are assumed to be f and g, and the CDFs F and G of their induced distance distributions are assumed to be smooth growth functions. In the results, V n ( r ) and S n 1 ( r ) denote the volume and surface area of the n-dimensional ball with radius r (respectively). In some cases, for the asymptotic limit to exist non-trivially (that is, to be both finite and non-zero), the tail entropy or tail divergence must be normalized by some multiplicative factor dependent on the tail volume V n ( w ) .
Table 6. Asymptotic equivalences between LID formulations and tail measures of entropy or divergence for locally spherically symmetric distributions in the n-dimensional Euclidean setting. In each case, the density functions are assumed to be f and g, and the CDFs F and G of their induced distance distributions are assumed to be smooth growth functions. In the results, V n ( r ) and S n 1 ( r ) denote the volume and surface area of the n-dimensional ball with radius r (respectively). In some cases, for the asymptotic limit to exist non-trivially (that is, to be both finite and non-zero), the tail entropy or tail divergence must be normalized by some multiplicative factor dependent on the tail volume V n ( w ) .
Tail MeasureFormulationLimit as w 0 +
Entropy H ( f , w ) = B ( w ) f w ln f w d B ( w ) = 0 w F w ( t ) ln F w ( t ) S n 1 ( t ) d t Diverges (no reweighting possible)
Varentropy VarH ( f , w ) = B ( w ) f w ln 2 f w d B ( w ) B ( w ) f w ln f w d B ( w ) 2 1 1 φ 2
    = 0 w F w ( t ) ln 2 F w ( t ) S n 1 ( t ) d t 0 w F w ( t ) ln F w ( t ) S n 1 ( t ) d t 2 φ = ID F n
q-Entropy H q ( f , w ) = 1 q 1 B ( w ) f w f w q d B ( w ) 1 q 1   if   q < 1
     = 1 q 1 0 w F w ( t ) F w ( t ) q S n 1 ( t ) q 1 d t diverges if q > 1
Normalized 1 V n ( w ) cH ( f , w ) = 1 V n ( w ) x B ( w ) F w ( x ) ln F w ( x ) d B ( w ) φ ( φ + 1 ) 2
Cumulative Entropy        = 1 V n ( w ) 0 w F w ( t ) ln F w ( t ) · S n 1 ( t ) d t φ = ID F n
Normalized 1 V n ( w ) cH q ( f , w ) = 1 V n ( w ) · 1 q 1 x B ( w ) F w ( x ) F w ( x ) q d B ( w ) φ ( q φ + 1 ) ( φ + 1 )   if    q 1
Cumulative q-Entropy       = 1 V n ( w ) · 1 q 1 0 w F w ( t ) F w ( t ) q · S n 1 ( t ) d t φ = ID F n
Normalized Entropy Power 1 V n ( w ) HP ( f , w ) = 1 V n ( w ) exp H ( f , w ) 1 φ exp 1 1 φ ; φ = ID F n
Normalized q-Entropy Power 1 V n ( w ) HP q ( f , w ) = 1 V n ( w ) 1 + ( 1 q ) H q ( f , w ) 1 1 q φ q q φ q + 1 1 1 q ;    φ = ID F n
if    q 1    and    q φ q + 1 > 0
Cross Entropy XH ( f ; g , w ) = B ( w ) f w ln g w d B ( w ) = 0 w F w ( t ) ln G w ( t ) S n 1 ( t ) d t Diverges (no reweighting possible)
Normalized Cross Entropy Power 1 V n ( w ) XHP ( f ; g , w ) = 1 V n ( w ) exp XH ( f ; g , w ) 1 γ exp γ 1 φ ; φ = ID F n ; γ = ID G n
Weighted V n ( w ) · L 2 D ( f ; g , w ) = V n ( w ) B ( w ) f w g w 2 d B ( w ) φ γ 2 2 ( φ + γ 1 ) 1 + 1 ( 2 φ 1 ) ( 2 γ 1 )
L2 Distance     = V n ( w ) 0 w 1 S n 1 ( t ) F w ( t ) G w ( t ) 2 d t φ = ID F n ; γ = ID G n
ID F > 1 2 ; ID G > 1 2
Hellinger Distance HD ( f ; g , w ) = 1 2 B ( w ) f w g w 2 d B ( w ) 1 ρ 1 + ρ
      = 1 2 0 w F w ( t ) G w ( t ) 2 d t ρ = ID G ID F
χ 2 -Divergence χ 2 D ( f ; g , w ) = B ( w ) f w g w 2 g w d B ( w ) 1 ρ 2 ρ ( 2 ρ )
        = 0 w F w ( t ) G w ( t ) 2 G w ( t ) d t ρ = ID G ID F ; ρ < 2
α -Divergence α D ( f ; g , w ) = 1 α ( 1 α ) B ( w ) α f w + ( 1 α ) g w f w α g w 1 α d B ( w ) 1 α ( 1 α ) 1 1 α ρ α 1 + ( 1 α ) ρ α
      = 1 α ( 1 α ) 0 w α F w ( t ) + ( 1 α ) G w ( t ) ρ = ID G ID F
            F w ( t ) α G w ( t ) 1 α d t Require α + ρ ( 1 α ) > 0
KL Divergence KL ( f ; g , w ) = B ( w ) f w ln f w g w d B ( w ) = 0 w F w ( t ) ln F w ( t ) G w ( t ) d t ρ ln ρ 1 ; ρ = ID G ID F
JS Divergence JS ( f ; g , w ) = 1 2 KL f ; f + g 2 , w + KL g ; f + g 2 , w τ ln τ 1 2 ; τ = min { ρ , 1 ρ } ; ρ = ID G ID F
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bailey, J.; Houle, M.E.; Ma, X. Local Intrinsic Dimensionality, Entropy and Statistical Divergences. Entropy 2022, 24, 1220. https://doi.org/10.3390/e24091220

AMA Style

Bailey J, Houle ME, Ma X. Local Intrinsic Dimensionality, Entropy and Statistical Divergences. Entropy. 2022; 24(9):1220. https://doi.org/10.3390/e24091220

Chicago/Turabian Style

Bailey, James, Michael E. Houle, and Xingjun Ma. 2022. "Local Intrinsic Dimensionality, Entropy and Statistical Divergences" Entropy 24, no. 9: 1220. https://doi.org/10.3390/e24091220

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop