Next Article in Journal
A Modified Hestenes-Stiefel-Type Derivative-Free Method for Large-Scale Nonlinear Monotone Equations
Next Article in Special Issue
Lifting Dual Connections with the Riemann Extension
Previous Article in Journal
A Virus Infected Your Laptop. Let’s Play an Escape Game
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Diffeological Statistical Models, the Fisher Metric and Probabilistic Mappings

Institute of Mathematics, Czech Academy of Sciences, Zitna 25, 11567 Praha 1, Czech Republic
Mathematics 2020, 8(2), 167; https://doi.org/10.3390/math8020167
Submission received: 20 December 2019 / Revised: 19 January 2020 / Accepted: 21 January 2020 / Published: 30 January 2020
(This article belongs to the Special Issue Geometry and Topology in Statistics)

Abstract

:
We introduce the notion of a C k -diffeological statistical model, which allows us to apply the theory of diffeological spaces to (possibly singular) statistical models. In particular, we introduce a class of almost 2-integrable C k -diffeological statistical models that encompasses all known statistical models for which the Fisher metric is defined. This class contains a statistical model which does not appear in the Ay–Jost–Lê–Schwachhöfer theory of parametrized measure models. Then, we show that, for any positive integer k , the class of almost 2-integrable C k -diffeological statistical models is preserved under probabilistic mappings. Furthermore, the monotonicity theorem for the Fisher metric also holds for this class. As a consequence, the Fisher metric on an almost 2-integrable C k -diffeological statistical model P P ( X ) is preserved under any probabilistic mapping T : X Y that is sufficient w.r.t. P. Finally, we extend the Cramér–Rao inequality to the class of 2-integrable C k -diffeological statistical models.

1. Introduction

In mathematical statistics, the notion of a statistical model and the notion of a parameterized statistical model are of central importance [1]. For a measurable space X , let us denote by P ( X ) the space of all probability measures on X . According to currently accepted theories, see e.g., [1] and the references therein, a statistical model is a subset P X P ( X ) and a parameterized statistical model is a parameter set Θ , together with a mapping p : Θ P ( X ) . The image p ( Θ ) P ( X ) is a statistical model endowed with the parameterization p : Θ p ( Θ ) . If the parameter set Θ is a smooth manifold, then we can study a statistical model p ( Θ ) , endowed with a parameterization p : Θ p ( Θ ) P ( X ) , by applying differential geometric techniques to Θ and to smooth the mappings p : Θ P ( X ) .
This idea lies in the heart of the field of information geometry, which is in the domain of mathematical statistics, where we study (parameterized) statistical models using techniques of differential geometry [2,3,4,5]. In the book “Information Geometry" by Ay, Jost, Lê, and Schwachhöfer, a parameterized statistical model is a triple ( M , X , p ) where M is a Banach manifold, X is a measurable space, and i p : M p P ( X ) i S ( X ) is a C 1 -map. Here S ( X ) is the Banach space of all signed finite measures on X endowed with the total variation norm · T V and i is the natural inclusion. We would like to emphasize that the concept of a parameterized statistical model introduced in [5,6,7] encompasses statistical models endowed with the structure of a finite dimensional manifold [2,3,8], or with the structure of an infinite dimensional Banach manifold [9]. The theory of parameterized measure models, moreover, allows us to study singular statistical models P X using differential geometric techniques, if P X is endowed with a parameterization by a Banach manifold.
In this study, inspired by the theory of diffeological spaces founded by Souriau and developed further by many people, we shall generalize the concept of a parameterized statistical model to the concept of a C k -diffeological statistical model P P ( X ) , which, by definition, is a subset in P ( X ) endowed with a compatible C k -diffeology. We shall show that the concept of a C k -diffeological statistical model is more flexible than the concept of a parameterized statistical model. In particular, the image p ( M ) of any parameterized statistical model ( M , X , p ) has a natural compatible C 1 -diffeology. Moreover, for any k N + , any subset in P ( X ) can be provided with a compatible C k -diffeology (and hence it has a structure of a C k -diffeological statistical model).
Furthermore, not every subset in P ( X ) can be written as p ( M ) for some parameterized statistical model ( M , X , p ) . Hence the class of C 1 -diffeological statistical models is larger than the class of statistical models parameterized by Banach manifolds as the Ay–Jost–Lê–Schwachhöfer theory. We also extend conceptually many results in the Ay–Jost–Lê–Schwachhöfer theory concerning the differential geometry of parameterized statistical models and their application to statistics and to the class of C k -diffeological statistical models, using the theory of probabilistic mappings, developed in a recent work by Jost, Lê, Luu and Tran [10].
Our paper is organized as follows. In the second section we introduce the notions of C k -diffeological statistical models, almost 2-integrable C k -diffeological statistical models, and 2-integrable C k -diffeological statistical models. In the third section we recall the notion of probabilistic mappings and related results in [10] and prove that the class of (almost 2-integrable/resp. 2-integrable) C k -statistical models is preserved under probabilistic mappings (Theorem 1). Then we extend the monotonicity of the Fisher metric on 2-integrable parameterized statistical models to the class of almost 2-integrable C k -diffeological statistical models (Theorem 2). In the last section, we prove a diffeological version of the Cramér–Rao inequality (Theorem 3) which extends previously known versions of the Cramér–Rao inequality in [5,11]. We conclude our paper with a discussion on some future directions and open questions.

2. Almost 2-Integrable Diffeological Statistical Models

Given a statistical model, P P ( X ) , which we also denote by P X , it is known that P X is endowed with a natural geometric structure induced from the Banach space ( S ( X ) , | | , | | T V ) .
Definition 1.
(cf. [5], Definition 3.2, p. 141) (1) Let ( V , · ) be a Banach space, X i V be an arbitrary subset, where i denotes the inclusion, and x 0 X . Then v V is called a tangent vector of X at x 0 , if there is a C 1 -map c : R X , i.e., the composition i c : R V is a C 1 -map, such that c ( 0 ) = x 0 and c ˙ ( 0 ) = v .
(2) The tangent (double) cone C x X at a point x X is defined as the subset of the tangent space T x V = V that consists of tangent vectors of X at x. The tangent space T x X is the linear hull of the tangent cone C x X .
(3) The tangent cone fibration C X (resp. the tangent fibration T X ) is the union x X C x X (resp. x X T x X ), which is a subset of V × V and, therefore, it is endowed with the induced topology from V × V .
Remark 1.
(1) The notion of a tangent cone in Definition 1 occurs in a similar fashion in the theory of singular spaces, see e.g., [12], §3, [13], §3, [14], p. 166.
(2) Definition 1 differs from [5], Definition 3.1, in that, in Definition 1, the domain of a C 1 -curve c is R and in [5] the domain of a C 1 -curve c is ( ε , ε ) . Since ( ε , ε ) is diffeomorphic to R , both the two choices of the domain of c are equivalent.
Example 1.
Let us consider a mixture family P X of probability measures p η μ 0 on X that are dominated by μ 0 P ( X ) , where the density functions, p η , are of the following form
p η ( x ) : = g 1 ( x ) η 1 + g 2 ( x ) η 1 + g 3 ( x ) ( 1 η 1 η 2 ) for x X .
Here g i , for i = 1 , 2 , 3 , are nonnegative functions on X , such that E μ 0 ( g i ) = 1 and η = ( η 1 , η 2 ) D b R 2 is a parameter, which will be specified as follows. Let us divide the square D = [ 0 , 1 ] × [ 0 , 1 ] R 2 into smaller squares and color them in black and white as with a chessboard. Let D b be the closure of the subset of D colored in black. If η is an interior point of D b , then C p η P X = R 2 . If η is a boundary point of D b , then C p η P X = R . If η is a corner point of D b , then C p η P X consists of two intersecting lines.
  • Let P X be a statistical model. Then it is known that any v C ξ P X is dominated by ξ . Hence the logarithmic representation of v
    log v : = d v / d ξ
    is an element of L 1 ( X , ξ ) . The set { log v | v C ξ P X } is a subset in L 1 ( X , ξ ) . We denote it by log ( C ξ P X ) and will call it the logarithmic representation of C ξ P X .
  • Next we want to put a Riemannian metric on a statistical model P X i.e., to put a positive quadratic form g on each tangent space T ξ P X L 1 ( X , ξ ) . The space L 1 ( X , ξ ) does not have a natural metric but its subspace L 2 ( X , ξ ) is a Hilbert space.
Definition 2.
A statistical model P X will be called almost 2-integrable, if
log ( C ξ P X ) L 2 ( X , ξ )
for all ξ P X . In this case we define the Fisher metric g on P X as follows. For each v , w C ξ P X
g ξ ( v , w ) : = log v , log w L 2 ( X , ξ ) = X log v · log w d ξ .
Since T ξ P X is the linear hull of C ξ P X , Formula (4) extends uniquely to a positive quadratic form on T ξ P X , which is called the Fisher metric.
Example 2.
Let us reconsider Example 1. Recall that our statistical model P X is parameterized by a map
p : D b S ( X ) , η p η · μ 0 ,
which is the restriction of the affine map L : R 2 S ( X ) , defined by the same formula. Hence, any tangent vector v ˜ T η P X can be written as v ˜ = d p ( v ) where v T η D b . For v = ( v 1 , v 2 ) T η D b , we have d p ( v ) = [ ( g 1 g 3 ) v 1 + ( g 2 g 3 ) v 2 ] μ 0 . If g i ( x ) > 0 for all x X and i = 1 , 2 , 3 , then p η ( x ) > 0 for all x X and all η D b . Therefore
log d p ( v ) | p ( η ) = d p ( v ) d ( p η μ 0 ) = ( g 1 g 3 ) v 1 + ( g 2 g 3 ) v 2 p η L 1 ( X , p ( η ) ) .
Hence P X is almost 2-integrable, if
g 1 g 3 p η , g 2 g 3 p η L 2 ( X , μ 0 ) η D b .
In this case we have
g | p ( η ) ( d p ( v ) , d p ( w ) ) = log d p ( v ) , log d p ( w ) L 2 ( X , p ( η ) ) .
Next we shall introduce the notion of a C k -diffeological statistical model.
Definition 3.
For k N + and a nonempty set X, a C k -diffeology of X is a set D of mappings p : U X , where U is an open domain in R n , and n runs over nonnegative integers, such that the three following axioms are satisfied.
D1. Covering. The set D contains the constant mappings x : r x , defined on R n , for all x X and for all n N .
D2. Locality. Let p : U X be a mapping. If for every point r U there exists an open neighborhood V of r, such that p | V belongs to D then the map p belongs to D .
D3. Smooth compatibility. For every element p : U X of D , for every real domain V, for every ψ C k ( V , U ) , p ψ belongs to D .
A C k -diffeological space is a nonempty set equipped with a C k -diffeology D . Elements p : U X of D will be called C k -maps from U to X.
A statistical model P X endowed with a C k -diffeology D X will be called a C k -diffeological statistical model, if for any map p : U P X in D X the composition i p : U S ( X ) is a C k -map.
Remark 2.
(1) In [14], Iglesias-Zemmour considered only C -diffeologies. The notion of a C k -diffeology, as given in Definition 3 is a straightforward adaptation of the concept of a smooth diffeology, as given in [14], §1.5.
(2) As ( S ( X ) , · T V ) is a Banach space, by [15], Lemma 3.11, p. 30, a compatible C -diffeology on a statistical model P X is defined by smooth maps c : R P X .
(3) Given a C k -diffeological statistical model ( P X , D X ) and ξ P X , the tangent cone C ξ ( P X , D X ) is the subset of C ξ P X that consists of the tangent vectors c ˙ ( 0 ) of C k -curves c : R X in D X , such that c ( 0 ) = ξ . Similarly, the tangent space T ξ ( P X , D X ) is the linear hull of C ξ ( P X , D X ) .
(4) Let ( P X , D X ) be a C k -diffeological statistical model and V a locally convex vector space. A map φ : P X V is called Gateaux-differentiable on ( P X , D X ) if for any C k -curve c : R P X in D X the composition φ c : R V is differentiable. We recommend [15] for differential calculus on locally convex vector spaces.
Example 3.
(1) Let ( M , X , p ) be a parametrized statistical model. Then ( p ( M ) , D X ) is a C 1 -diffeological statistical model where D X consists of all C 1 -maps q : R n U p ( M ) , such that there exists a C 1 -map ψ M : U M and q = p ψ M .
(2) Let P X be a statistical model. Then P X can be endowed with a structure of a C k -diffeological statistical model for any k N + , where its diffeology D X ( k ) consists of all mappings p : U P X , such that the composition i p : U S ( X ) is of the class C k , where U is any open domain in R n for n N .
(3) Let X be the closed interval [ 0 , 1 ] . Let P X : = f · μ 0 , where f C ( X ) , such that X f d μ 0 = 1 and f ( x ) > 0 for all x X . We claim that, there does not exist a parameterized statistical model ( M , X , p ) , such that P X = p ( M ) . Assume the opposite, i.e., there is a C 1 -map p : M S ( X ) , such that p ( M ) = P X . Then for any m M we have d p ( T m ( M ) ) = T p ( m ) P X = { f C ( X ) | X f d μ 0 = 0 } . However, this is not the case, as it is known that the space C ( [ 0 , 1 ] ) cannot be the image of a linear bounded map from a Banach space M to L 1 ( [ 0 , 1 ] ) , see e.g., [16], p. 1434.
Definition 4.
A C k -diffeological statistical model ( P X , D X ) will be called almost 2-integrable, if log ( C ξ ( P X , D X ) ) L 2 ( X , ξ ) for all ξ P X .
An almost 2-integrable C k -diffeological statistical model ( P X , D X ) will be called 2-integrable, if for any C k -map p : U P X in D X , the function v | d p ( v ) | g is continuous on T U .
Example 4.
(1) By [5], Theorem 3.2, p. 155, a parameterized statistical model ( M , X , p ) is 2-integrable, if and only if ( p ( M ) , p ( D M ) ) is a 2-integrable C 1 -diffeological statistical model.
(2) The C 1 -diffeological statistical model ( P X , D X ( 1 ) ) in Example 3(3) is 2-integrable, though there is no parameterized statistical model ( M , X , p ) such that p ( M ) = P X .
(3) Let X be a measurable space and λ be a σ-finite measure. In [17], p. 274, Friedrich considered a family P ( λ ) : = { μ P ( X ) | μ λ } that is endowed with the following diffeology D ( λ ) . A curve c : R P ( λ ) is a C 1 -curve, if
log c ˙ ( t ) L 2 ( X , c ( t ) ) .
Hence ( P ( λ ) , D ( λ ) ) is an almost 2-integrable C 1 -diffeological statistical model.
Remark 3.
The axiomatics of Espaces différentiels, which became later the diffeological spaces, were introduced by J.-M. Souriau in the beginning of the nineteen-eighties [18]. Diffeology is a variant of the theory of differentiable spaces, introduced and developed a few years before by K.T. Chen [19]. As I have worked with a different theory of smooth structures on singular spaces [12,13], I appreciate the elegance of the theory of diffeology for its consistent and simple treatment of smooth structures on (possibly infinite dimensional) singular spaces. The best source for diffeology is the monograph by P. Iglesias-Zemmour [14].

3. Probabilistic Mappings

In 1962, Lawvere proposed a categorical approach to probability theory, where morphisms are Markov kernels, and most importantly, he supplied the space P ( X ) with a natural σ -algebra Σ w , making the notion of Markov kernels and hence many constructions in probability theory and mathematical statistics functorial.
Let us recall the definition of Σ w . Given a measurable space X , let F s ( X ) denote the linear space of simple functions on X . Recall that S ( X ) is the space of all signed finite measures on X . There is a natural homomorphism I : F s ( X ) S ( X ) : = H o m ( S ( X ) , R ) , f I f , defined by integration: I f ( μ ) : = X f d μ for f F s ( X ) and μ S ( X ) . Following Lawvere [20], we define Σ w to be the smallest σ -algebra on S ( X ) , such that I f is measurable for all f F s ( X ) . Let M ( X ) denote the space of all finite nonnegative measures on X . We also denote by Σ w , the restriction of Σ w to M ( X ) , M ( X ) : = M ( X ) { 0 } , and P ( X ) .
  • For a topological space X we shall consider the natural Borel σ -algebra B ( X ) . Then, every continuous function is measurable w.r.t. B ( X ) . If X is, moreover, a metric space, then B ( X ) is the smallest algebra making any continuous function measurable ([21], Lemma 2.13).
  • Let C b ( X ) be the space of bounded continuous functions on a topological space X . We denote by τ v , the smallest topology on S ( X ) , such that for any f C b ( X ) the map I f : ( S ( X ) , τ v ) R is continuous. We also denote by τ v , the restriction of τ v to M ( X ) and P ( X ) , which is also called the weak topology that generates the weak convergence of probability measures. It is known that ( P ( X ) , τ v ) is separable, and metrizable if, and only if, X is [21], Theorem 3.1.4, p. 104. If X is separable and metrizable then the Borel σ -algebra on P ( X ) generated by τ v coincides with Σ w .
Definition 5.
([10], Definition 2.4) A probabilistic mapping (or an arrow) from a measurable space X to a measurable space Y is a measurable mapping from X to ( P ( Y ) , Σ w ) .
We shall denote by T ¯ : X ( P ( Y ) , Σ w ) the measurable mapping defining/generating a probabilistic mapping T : X Y . Similarly, for a measurable mapping p : X P ( Y ) we shall denote by p ̲ : X Y the generated probabilistic mapping. Note that a probabilistic mapping is denoted by a curved arrow and a measurable mapping by a straight arrow.
Example 5.
([10], Example 2.6) (1) Assume that X is separable and metrizable. Then the identity mapping I d P : ( P ( X ) , τ v ) ( P ( X ) , τ v ) is continuous, and hence measurable w.r.t. the Borel σ-algebra Σ w = B ( τ v ) . Consequently, I d P generates a probabilistic mapping e v : ( P ( X ) , B ( τ v ) ) ( X , B ( X ) ) and we write e v ¯ = I d P . Similarly, for any measurable space X , we also have an arrow (a probabilistic mapping) e v : ( P ( X ) , Σ w ) X generated by the measurable mapping e v ¯ = I d P .
(2) Let δ x denote the Dirac measure concentrated at x. It is known that the map δ : X ( P ( X ) , Σ w ) , x δ ( x ) : = δ x , is measurable [22]. If X is a topological space, then the map δ : X ( P ( X ) , τ v ) is continuous, as the composition I f δ : X R is continuous for any f C b ( X ) . Hence, if κ : X Y is a measurable mapping between measurable spaces (resp. a continuous mapping between separable metrizable spaces), then the map κ ¯ : X δ κ P ( Y ) is a measurable mapping (resp. a continuous mapping). We regard κ as a probabilistic mapping defined by δ κ : X P ( Y ) . In particular, the identity mapping I d : X X of a measurable space X is a probabilistic mapping generated by δ : X P ( X ) . Graphically speaking, any straight arrow (a measurable mapping) κ : X Y between measurable spaces can be seen as a curved arrow (a probabilistic mapping).
Given a probabilistic mapping T : X Y , we define a linear map S ( T ) : S ( X ) S ( Y ) , called Markov morphism, as follows [2], Lemma 5.9, p. 72,
S ( T ) ( μ ) ( B ) : = X T ¯ ( x ) ( B ) d μ ( x )
for any μ S ( X ) and B Σ Y .
Proposition 1.
Assume that T : X Y is a probabilistic mapping.
(1) Then T induces a linear bounded map S ( T ) : S ( X ) S ( Y ) w.r.t. the total variation norm | | · | | T V . The restriction M ( T ) of S ( T ) to M ( X ) (resp. P ( T ) of S ( T ) to P ( X ) ) maps M ( X ) to M ( Y ) (resp. P ( X ) to P ( Y ) ).
(2) Probabilistic mappings are morphisms in the category of measurable spaces; i.e., for any probabilistic mappings T 1 : X Y and T 2 : Y Z , we have
M ( T 2 T 1 ) = M ( T 2 ) M ( T 1 ) , P ( T 2 T 1 ) = P ( T 2 ) P ( T 1 ) .
(3) M and P are faithful functors.
(4) If ν μ M ( X ) then M ( T ) ( ν ) M ( T ) ( μ ) .
Remark 4.
The first assertion of Proposition 1 is due to Chentsov [2], Lemma 5.9, p. 72. The second assertion has been proven in [10], Theorem 2.14 (1), extending Giry’s result in [22]. The third assertion has been proven in [10]. The last assertion of Proposition 1 is due to Morse–Sacksteder [23], Proposition 5.1.
We also denote by T the map S ( T ) , if no confusion can arise.
Given a probabilistic mapping T : X Y and a C k -diffeological statistical model ( P X , D X ) , we define a C k -diffeological space ( T ( P X ) , T ( D X ) ) as the image of D by T [14], §1.43, p. 24. In other words, a mapping p : U T ( P X ) belongs to T ( D X ) if and only if it satisfies the following condition. For every r U there exists an open neighborhood V U of r, such that either p | V is a constant mapping, or there exists a mapping q : U P X in D X , such that p | V = T q .
Theorem 1.
Let T : X Y be a probabilistic mapping and ( P X , D X ) is a C k -diffeological statistical model.
(1) Then ( T ( P X ) , T ( D X ) ) is a C k -diffeological statistical model.
(2) If ( P X , D X ) is an almost 2-integrable C k -diffeological statistical model, then ( T ( P X ) , T ( D X ) ) is also an almost 2-integrable C k -diffeological statistical model.
(3) If ( P X , D X ) is a 2-integrable C k -diffeological statistical model, then ( T ( P X ) , T ( D X ) ) is also a 2-integrable C k -diffeological statistical model.
Proof. 
(1) The first assertion is straightforward, since T : S ( X ) S ( Y ) is a linear bounded map by Proposition 1(1).
(2) Assume that ( P X , D X ) is an almost 2-integrable C k -statistical model and v C ξ ( P X , D X ) . Then there exits a C k -map c : R P X in D X , such that d d t | t = 0 c ( ξ ) = v . Since T : S ( X ) S ( Y ) is a bounded linear map,
d d t | t = 0 T c = T ( v ) .
By the monotonicity theorem [5], Corollary 5.1, p. 260, we have
d T v d T ξ L 2 ( Y , T ξ ) v L 2 ( X , ξ ) .
This proves that ( T ( P X ) , T ( D X ) ) is almost 2-integrable.
(3) Assume that ( P X , D X ) is a C k -diffeological statistical model. Let c : R T ( P X ) be an element in T ( D X ) . Then c = T c , where c : R P X is an element of D X , i.e., i c : R S ( X ) is of class C k and ( R , X , c ) is a parameterized 2-integrable statistical model. By [5], Theorem 5.4, p. 264, ( R , Y , T c ) is a 2-integrable parameterized statistical model. Combined with the first assertion of Theorem 1 this proves the last assertion of Theorem 1. □
Denote by L ( X ) , the space of bounded measurable functions on a measurable space X . Given a probabilistic mapping T : X Y , we define a linear map T : L ( Y ) L ( X ) , as follows [10], (2.2),
T ( f ) ( x ) : = I f ( T ¯ ( x ) ) = Y f d T ¯ ( x ) ,
which coincides with the classical formula (5.1) in [2], p. 66, for the transformation of a bounded measurable f under a Markov morphism (i.e., a probabilistic mapping) T. In particular, if κ : X Y is a measurable mapping, then we have κ ( f ) ( x ) = f ( κ ( x ) ) , since κ ¯ = δ κ .
Definition 6.
([10], Definition 2.22, cf. [23]) Let P X P ( X ) and P Y P ( Y ) . A probabilistic mapping T : X Y will be called sufficient for P X if there exists a probabilistic mapping p ̲ : Y X , such that for all μ P X and h L ( X ) we have
T ( h μ ) = p ̲ ( h ) T ( μ ) , i . e . , p ̲ ( h ) = d T ( h μ ) d T ( μ ) L 1 ( Y , T ( μ ) ) .
In this case we shall call the measurable mapping p : Y P ( X ) defining the probabilistic mapping p ̲ : Y X a conditional mapping for T.
Example 6.
Assume that κ : X Y is a measurable mapping (i.e., a statistic) which is a probabilistic mapping sufficient for P X P ( X ) . Let p : Y P ( X ) , y p y , be a conditional mapping for κ. By (9), p ̲ ( 1 A ) ( y ) = p y ( A ) , and we rewrite (10) as follows
p y ( A ) = d κ ( 1 A μ ) d κ μ L 1 ( Y , κ ( μ ) ) .
The RHS of (11) is the conditional measure of μ applied to A w.r.t. the measurable mapping κ. The equality (11) implies that this conditional measure is regular and independent of μ. Thus the notion of sufficiency of a measurable mapping κ for P X coincides with the classical notion of sufficiency of κ for P X , see e.g., [2], p. 28, [24], Definition 2.8, p. 85. We also note that the equality in (11) is understood as equivalence class in L 1 ( Y , κ ( μ ) ) and hence every statistic κ that coincides with a sufficient statistic κ except on a zero μ-measure set, for all μ P X , is also a sufficient statistic for P X .
Example 7.
(cf. [2], Lemma 2.8, p. 28) Assume that μ P ( X ) has a regular conditional distribution w.r.t. to a statistic κ : X Y ; i.e., there exists a measurable mapping p : Y P ( X ) , y p y , such that
E μ σ ( κ ) ( 1 A | y ) = p y ( A )
for any A Σ X and y Y . Let Θ be a set and P : = { ν θ P ( X ) | θ Θ } be a parameterized family of probability measures dominated by μ. If there exists a function h : Y × Θ R such that for all θ Θ , and we have
ν θ = h ( κ ( x ) ) μ ,
then κ is sufficient for P, since, for any θ Θ ,
p ( 1 A ) = d κ ( 1 A ν θ ) d κ ν θ
does not depend on θ. Condition (13) is the Fisher–Neymann sufficiency condition for a family of dominated measures.
Example 8.
Let κ : X Y be a measurable 1-1 mapping. Then for any statistical model P X P ( X ) , the statistic κ is sufficient w.r.t. P X , since, for any A Σ X and any μ P X , we have
d κ ( 1 A μ ) d κ μ = ( κ 1 ) ( 1 A ) L 1 ( Y , κ ( μ ) ) .
Next, we shall show that probabilistic mappings do not increase the Fisher metrics on almost 2-integrable C k -diffeological statistical models. Thus the Fisher metric serves as a “information quantity” of almost 2-integrable C k -diffeological statistical models.
Theorem 2.
Let T : X Y be a probabilistic mapping and ( P X , D X ) an almost 2-integrable C k -diffeological statistical model. Then for any μ P X and any v T μ ( P X , D X ) , we have
g μ ( v , v ) g T μ ( T v , T v )
with the equality, if T is sufficient w.r.t. P X .
Proof. 
The monotonicity assertion of Theorem 2 follows from (8). The second assertion of Theorem 2 follows from the first assertion, taking into account Theorem 2.8.2 in [10], which states the existence of a probabilistic mapping p : Y X , such that p ( T ( P X ) ) = P X , and therefore p ( T ( D X ) ) = D X . □
Let us apply Theorem 2 to Example 4 (3), originally from [17]. In [17], Satz 1, p.274, Friedrich considered the group G ( X , Σ X , λ ) of all measurable 1-1 mappings Φ : X X , such that Φ ( λ ) λ . Clearly Φ ( P ( λ ) ) P ( λ ) . Example 8 says that Φ is a sufficient statistic w.r.t. P ( λ ) . Hence Theorem 2 implies the following
Corollary 1.
([17], Satz 1) The group G ( X , Σ X , λ ) acts isometrically on P ( λ ) .
Remark 5.
Theorem 2 extends the Monotonicity Theorem [5], Theorem 5.5, p. 265, for 2-integrable parameterized statistical models. (As we remarked in Section 5, Theorem 2 can be easily extended to the case of almost l-integrable C k -diffeological measure models.)

4. The Cramér–Rao Inequality for 2-Integrable Diffeological Statistical Models

In this section we shall prove a version of the Cramér–Rao inequality for estimators with values in a 2-integrable C k -diffeological statistical model.
Definition 7.
Let P X P ( X ) be a statistical model. An estimator is a map σ ^ : X P X .
Assume that V is a locally convex topological vector space. Then we denote, by M a p ( P X , V ) , the space of all mappings φ : P X V and by V , the topological dual of V. It is usually easier to estimate only a “coordinate" φ ( ξ ) of a probability measure ξ P X , which determines ξ uniquely, if φ is embedded.
Definition 8.
Let P X be a statistical model and φ M a p ( P X , V ) . A φ-estimator σ ^ φ is a composition φ σ ^ : X σ ^ P X φ V .
Example 9.
Assume that k : X × X R is a symmetric and positive definite kernel function and let V be the associated RKHS. For any x X , we denote by k x , the function on X defined by k x ( y ) : = k ( x , y ) , for any y X . Then k x is an element of V. Let P X = P ( X ) . Then we define the kernel mean embedding φ : P ( X ) V as follows [25]
φ ( ξ ) : = X k x d ξ ( x ) ,
where the integral should be understood as a Bochner integral.
Remark 6.
(1) In classical statistics (see e.g., [26], §13, p. 51, [27], p. 4, [8], §4, p. 82, [5], Definition 5.1, p. 277) one considers only the parameter estimations for parameterized statistical models. In this case, an estimator is a map from X to the parameter set Θ of a statistical model p ( Θ ) P ( X ) . Usually one assumes that the parameterization p : Θ p ( Θ ) is 1-1, hence, a parameter estimation is equivalent to a nonparametric estimation in the sense of Definition 7. Note that the ultimate aim of a statistical experiment is to estimate the probability measure generating the observable of the experiment. In general, we can only assume that the unknown generating probability measure belongs to a statistical model P X P ( X ) . In this case, we need to use non-parametric estimation; see e.g., [28], p. 1. Note that, by Example 3, P X has a natural structure of a C 1 -diffeological statistical model.
(2) The notion of a φ-estimation occurs in classical statistics in similar fashion; see e.g., [26], p. 52, where the author called similar estimators substitution estimators, and in [29], Definition 1.2, p. 4, where the authors consider estimands, which are versions of φ-estimators for a parameter estimation problem, see [5], p. 279.
For φ M a p ( P X , V ) and l V we denote by φ l the composition l φ . Then we set
L φ 2 ( X , P X ) : = { σ ^ : X P X | φ l σ ^ L ξ 2 ( X ) for all ξ P X and l V } .
For σ ^ L φ 2 ( X , P X ) we define the φ -mean value of σ ^ , denoted by φ σ ^ : P X V , as follows (cf. [5], (5.54), p. 279)
φ σ ^ ( ξ ) ( l ) : = E ξ ( φ l σ ^ ) for ξ P X and l V .
Let us identify V with a subspace in V via the canonical pairing.
The difference b σ ^ φ : = φ σ ^ φ M a p ( P X , V ) will be called the bias of the φ -estimator σ ^ φ .
For all ξ P X we define a quadratic function M S E ξ φ [ σ ^ ] on V , which is called the mean square error quadratic function at ξ , by setting for l , h V (cf. [5], (5.56), p. 279)
M S E ξ φ [ σ ^ ] ( l , h ) : = E ξ [ ( φ l σ ^ ( x ) φ l ( ξ ) ) · ( φ h σ ^ ( x ) φ h ( ξ ) ) ] .
Similarly we define the variance quadratic function of the φ -estimator φ σ ^ at ξ P X is the quadratic form V ξ φ [ σ ^ ] on V , such that, for all l , h V we have (cf. [5], (5.57), p. 279)
V ξ φ [ σ ^ ] ( l , h ) = E ξ [ φ l σ ^ ( x ) E ξ ( φ l σ ^ ( x ) ) · φ h σ ^ ( x ) E ξ ( φ h σ ^ ( x ) ) ] .
Then it is known that [5], (5.58), p. 279,
M S E ξ φ [ σ ^ ] ( l , h ) = V ξ φ [ σ ^ ] ( l , h ) + b σ ^ φ ( ξ ) , l · b σ ^ φ ( ξ ) , h .
Remark 7.
Assume that V is a real Hilbert space with a scalar product · , · and the associated norm · . Then the scalar product defines a canonical isomorphism V = V , v ( w ) : = v , w , for all v , w V . For σ ^ L φ 2 ( X , P X ) , the mean square error M S E ξ φ ( σ ^ ) of the φ-estimator φ σ ^ is defined by
M S E ξ φ ( σ ^ ) : = E ξ ( φ σ ^ φ ( ξ ) 2 ) .
The RHS of (16) is well-defined, since σ ^ L φ 2 ( X , P X ) , and therefore
φ σ ^ ( x ) , φ σ ^ ( x ) L 1 ( X , ξ ) and φ σ ^ ( x ) , φ ( ξ ) L 2 ( X , ξ ) .
Similarly, we define the variance of a φ-estimator φ σ ^ at ξ as follows
V ξ φ ( σ ^ ) : = E ξ ( φ σ ^ E ξ ( φ σ ^ ) 2 ) .
If V has a countable basis of orthonormal vectors v 1 , , v , then we have
M S E ξ φ ( σ ^ ) = i = 1 M S E ξ φ [ σ ^ ] ( v i , v i ) ,
V ξ φ ( σ ^ ) = i = 1 V ξ φ [ σ ^ ] ( v i , v i ) .
Now, we assume that ( P X , D X ) is an almost 2-integrable C k -diffeological statistical model. For any ξ P X , let T ξ g ( P X , D X ) be the completion of T ξ ( P X , D X ) w.r.t. the Fisher metric g . Since T ξ g ( P X , D X ) is a Hilbert space, the map
L g : T ξ g ( P X , D X ) ( T ξ g ( P X , D X ) ) , L g ( v ) ( w ) : = v , w g ,
is an isomorphism. Then we define the inverse g 1 of the Fisher metric g on ( T ξ g ( P X , D X ) ) as follows
L g v , L g w g 1 : = v , w g .
Definition 9.
(cf. [5], Definition 5.18, p. 281) Assume that σ ^ L φ 2 ( X , P X ) . We shall call σ ^ a φ-regular estimator, if for all l V the function ξ φ l σ ^ L 2 ( X , ξ ) is locally bounded, i.e., for all ξ 0 P X
lim ξ ξ 0 sup φ l σ ^ L 2 ( X , ξ ) < .
Proposition 2.
Assume that ( P X , D X ) is a 2-integrable C k -diffeological statistical model, V is a topological vector space, φ M a p ( P X , V ) and σ ^ : X P X is a φ-regular estimator. Then the V -valued function φ σ ^ is Gateaux-differentiable on ( P X , D X ) . Furthermore, for any l V , the differential d φ σ ^ l ( ξ ) extends to an element in ( T ξ g ( P X , D X ) ) for all ξ P X .
Proof. 
Assume that a map c : R P X belongs to D X . Then ( R , X , c ) is a 2-integrable parametrized statistical model. By Lemma 5.2 in [5], p. 282, the composition φ σ ^ c is differentiable. This proves the first assertion of Proposition 2.
Next, we shall show that d φ σ ^ ( ξ ) extends to an element in ( T ξ g ( P X , D X ) ) for all ξ P X . Let X C ξ ( P X , D X ) and c : R P X be a C k -curve, such that c ( 0 ) = ξ and c ˙ ( 0 ) = X . By Lemma 5.3 [5], p. 284, we have
X ( φ σ ^ l ) = X ( φ l σ ^ ( x ) E ξ ( φ l σ ^ ) · log X d ξ ( x ) ,
where φ l σ ^ ( x ) E ξ ( φ l σ ^ ) L 2 ( X , ξ ) . Denote by Π ξ : L 2 ( X , ξ ) · ξ T ξ g P X , the orthogonal projection. Set
grad g ( φ σ ^ l ) : = Π ξ [ ( φ l σ ^ ( x ) E ξ ( φ l σ ^ ) ) · ξ ] T ξ g P X .
Then we rewrite (20), as follows
X ( φ l ) = grad g ( φ σ ^ l ) , X g .
Hence d φ σ ^ l is the restriction of L g ( grad g ( φ σ ^ l ) ) ( T ξ g ( P X , D X ) ) . This completes the proof of Proposition 2. □
For any ξ P X , we denote ( g σ ^ φ ) 1 ( ξ ) to be the following quadratic form on V :
( g σ ^ φ ) 1 ( ξ ) ( l , k ) : = d φ σ ^ l , d φ σ ^ k g 1 ( ξ ) : = grad g ( φ σ ^ l ) , grad g ( φ σ ^ k ) .
Theorem 3
(Diffeological Cramér–Rao inequality). Let ( P X , D X ) be a 2-integrable C k -diffeological statistical model, φ, a V-valued function on P X and σ ^ L φ 2 ( X , P X ) , a φ-regular estimator. Then the difference V ξ φ [ σ ^ ] ( g ^ σ ^ φ ) 1 ( ξ ) is a positive semi-definite quadratic form on V for any ξ P X .
Proof. 
To prove Theorem 3 it suffices to show that for any l V we have
E ξ ( φ l σ ^ E ξ ( φ l σ ^ ) ) 2 grad g ( φ σ ^ l ) ) g 2 .
Clearly (23) follows from (21). This completes the proof of Theorem 3. □
Theorem 3 is an extension of the general Cramér–Rao inequality [11], Theorem 2, see also [5], Theorem 5.7, p. 286.

5. Discussion

The extension of the notion of a k-integrable parametrized measure model (as introduced in [6,7], see also [5]) to the notion of an almost k-integrable diffeological measure model can be done.
(1) There are two main differences between parameterized statistical models and C k -diffeological statistical models. First, the parameter space of a parameterized statistical model is a single smooth Banach manifold, and parameter spaces for a C k -diffeological statistical model can be different but compatible. Secondly, parameter spaces for a C k -diffeological statistical model are finite dimensional. If k = , this assumption is well-motivated [14], see also Remark 2 (2).
(2) It would be interesting to apply the theory of C k -statistical models to stochastic processes. It is known that Banach manifolds are not suitable for many questions of global analysis, see e.g., [15], p. 1, and therefore, the theory of parameterized measure models might have limited applications to stochastic processes. On the other hand, there are many open questions in the theory of C -diffeological spaces, e.g., we do not know under which conditions we can define the Levi–Civita connection on a Riemannian C -diffeological space. Furthermore, the theory of C k -diffeological spaces has not been considered before, with k .
(3) The variational calculus founded by Leibniz and Newton is a cornerstone of differential geometry and modern analysis. In our opinion, it is best expressed in the language of diffeological spaces that declare which mappings into a diffeological space are smooth. This language is a counterpart to the language of ringed spaces in algebraic geometry that declares which functions are algebraic.

Funding

This research was funded by the Institutional Research Plan RVO:67985840 and by the Grant Agency of Czech Republic, grant number GAČR-18-01953J.

Acknowledgments

The author would like to thank Patrick Iglesias-Zemmour for a stimulating discussion on diffeology, Lorenz Schwachhöfer for helpful comments on an early version of this paper and Tat Dat To for the suggestion to consider Friedrich’s examples in [17]. A part of this paper was completed during the Workshop “Information Geometry” in Toulouse 14–18 October 2019. The author would like to thank the organizers, and especially Stephane Puechmorel, for their invitation and hospitality during the workshop. The author is grateful to the anonymous referees for their critical comments and suggestions, which helped her to significantly improve the exposition of this paper.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. McCullagh, P. What is a statistical model. Ann. Stat. 2002, 30, 1225–1310. [Google Scholar] [CrossRef]
  2. Chentsov, N. Statistical Decision Rules and Optimal Inference; Nauka: Moscow, Russia, 1972; English translation in: Translation of Math. Monograph vol. 53, Amer. Math. Soc.: Providence, RI, USA, 1982. [Google Scholar]
  3. Amari, S. Differential-Geometric Methods in Statistics; Lecture Notes in Statistics 28; Springer: Heidelberg, Germany, 1985. [Google Scholar]
  4. Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin, Germany, 2016; Volume 194. [Google Scholar]
  5. Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Information Geometry; Springer Nature: Cham, Switzerland, 2017. [Google Scholar]
  6. Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Information geometry and sufficient statistics. Probab. Theory Relat. Fields 2015, 162, 327–364. [Google Scholar] [CrossRef] [Green Version]
  7. Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Parameterized measure models. Bernoulli 2018, 24, 1692–1725. [Google Scholar] [CrossRef] [Green Version]
  8. Amari, S.; Nagaoka, H. Methods of Information Geometry; Translations of Mathematical Monographs 191; Amer. Math. Soc.: Providence, RI, USA, 2000. [Google Scholar]
  9. Pistone, G.; Sempi, C. An infinite-dimensional structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
  10. Jost, J.; Lê, H.V.; Luu, D.H.; Tran, T.D. Probabilistic mappings and Bayesian nonparametrics. arXiv 2019, arXiv:1905.11448. [Google Scholar]
  11. Lê, H.V.; Jost, J.; Schwachhöfer, L. The Cramér-Rao Inequality on Singular Statistical Models. In Proceedings of the Conference “Geometric Science of Information”, GSI 2017, Paris, France, 7–9 November 2017; LNCS. Springer Nature: Cham, Switzerland, 2017; Volume 10589, pp. 552–560. [Google Scholar]
  12. Lê, H.V.; Somberg, P.; Vanžura, J. Smooth structures on pseudomanifolds with isolated conical singularities. Acta Math. Vietnam. 2013, 38, 33–54. [Google Scholar] [CrossRef] [Green Version]
  13. Lê, H.V.; Somberg, P.; Vanžura, J. Poisson smooth structures on stratified symplectic spaces. In The Springer Proceedings in Mathematics & Statistics “Mathematics in the 21st Century, 6th World Conference”, Lahore, March 2013; Springer: Basel, Switzerland, 2015; Volume 98, Chapter 7; pp. 181–204. [Google Scholar]
  14. Iglesias-Zemmour, P. Diffeology; Amer. Math. Soc.: Providence, RI, USA, 2013. [Google Scholar]
  15. Kriegl, A.; Michor, P.W. The Convenient Setting of Global Analysis; Amer. Math. Soc.: Providence, RI, USA, 1997. [Google Scholar]
  16. Grabiner, S. Range of products of operators. Can. J. Math. 1974, XXVI, 1430–1441. [Google Scholar] [CrossRef]
  17. Friedrich, T. Die Fisher-Information und symplektische Strukturen. Math. Nachr. 1991, 153, 273–296. [Google Scholar] [CrossRef]
  18. Souriau, J.-M. Groupes différentiels. In Lecture Notes in Mathematics, Vol. 836; Springer: Berlin, Germany, 1980; pp. 91–128. [Google Scholar]
  19. Chen, K.T. Iterated path integrals. Bull. Am. Math. Soc. 1977, 83, 831–879. [Google Scholar] [CrossRef] [Green Version]
  20. Lawvere, W.F. The Category of Probabilistic Mappings. 1962. Unpublished. Available online: https://ncatlab.org/nlab/files/lawvereprobability1962.pdf (accessed on 19 December 2019).
  21. Bogachev, V.I. Weak Convergence of Measures; Mathematical Surveys and Monographs; Amer. Math. Soc.: Providence, RI, USA, 2018; Volume 234. [Google Scholar]
  22. Giry, M. A categorical approach to probability theory. In Categorical Aspects of Topology and Analysis; Banaschewski, B., Ed.; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1982; Volume 915, pp. 68–85. [Google Scholar]
  23. Morse, N.; Sacksteder, R. Statistical isomorphism. Ann. Math. Stat. 1966, 37, 203–214. [Google Scholar] [CrossRef]
  24. Schervish, M.J. Theory of Statistics, 2nd ed.; Springer: New York, NY, USA, 1997. [Google Scholar]
  25. Muandet, K.; Fukumizu, K.; Sriperumbudur, B.; Schölkopf, B. Kernel Mean Embedding of Distributions: A Review and Beyonds. Found. Trends Mach. Learn. 2017, 10, 1–141. [Google Scholar] [CrossRef]
  26. Borovkov, A.A. Mathematical Statistics; Gordon and Breach Science Publishers: Amsterdam, The Netherlands, 1998. [Google Scholar]
  27. Ibragimov, I.A.; Has’minskii, R.Z. Statistical Estimation: Asymptotic Theory; Springer: New York, NY, USA, 1981. [Google Scholar]
  28. Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Science+Business Media: New York, NY, USA, 2009. [Google Scholar]
  29. Lehmann, E.L.; Casella, G. Theory of Point Estimation, 2nd ed.; Springer: New York, NY, USA, 1998. [Google Scholar]

Share and Cite

MDPI and ACS Style

Lê, H.V. Diffeological Statistical Models, the Fisher Metric and Probabilistic Mappings. Mathematics 2020, 8, 167. https://doi.org/10.3390/math8020167

AMA Style

Lê HV. Diffeological Statistical Models, the Fisher Metric and Probabilistic Mappings. Mathematics. 2020; 8(2):167. https://doi.org/10.3390/math8020167

Chicago/Turabian Style

Lê, Hông Vân. 2020. "Diffeological Statistical Models, the Fisher Metric and Probabilistic Mappings" Mathematics 8, no. 2: 167. https://doi.org/10.3390/math8020167

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop