Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm

Chen, Berlin; Mostajeran, Cyrus; Said, Salem

doi:10.3390/psf2022005010

Open AccessProceeding Paper

Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm^†

by

Berlin Chen

¹,

Cyrus Mostajeran

^2,3,*

and

Salem Said

⁴

¹

Princeton Neuroscience Institute, Princeton University, Princeton, NJ 08544, USA

²

School of Physical and Mathematical Sciences, Nanyang Technological University (NTU), Singapore 637371, Singapore

³

Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, UK

⁴

CNRS, Laboratoire Jean Kuntzmann, Université Grenoble-Alpes, 38400 Grenoble, France

^*

Author to whom correspondence should be addressed.

^†

Presented at the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris, France, 18–22 July 2022.

Phys. Sci. Forum 2022, 5(1), 10; https://doi.org/10.3390/psf2022005010

Published: 3 November 2022

(This article belongs to the Proceedings of The 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figure

Versions Notes

Abstract

:

We present a novel algorithm for learning the parameters of hidden Markov models (HMMs) in a geometric setting where the observations take values in Riemannian manifolds. In particular, we elevate a recent second-order method of moments algorithm that incorporates non-consecutive correlations to a more general setting where observations take place in a Riemannian symmetric space of non-positive curvature and the observation likelihoods are Riemannian Gaussians. The resulting algorithm decouples into a Riemannian Gaussian mixture model estimation algorithm followed by a sequence of convex optimization procedures. We demonstrate through examples that the learner can result in significantly improved speed and numerical accuracy compared to existing learners.

Keywords:

hidden Markov models; method of moments; Riemannian geometry; Riemannian Gaussian mixtures; covariance matrices; geometric statistics

1. Introduction

Hidden Markov models (HMMs) describe states with Markovian dynamics that are hidden (in the sense that they are only accessible via observations by a noisy sensor). Specifically, at every time-step k, an observation

y_{k}

is sampled from an observation space

Y

according to the HMM’s observation likelihoods, which specify the probability of making a particular observation, conditioned on the system being in a certain state. Despite their structural simplicity, HMMs are capable of modeling complex signals and have indeed become a standard tool in the modeling of stochastic time series [1] in recent decades and have found applications in a wide range of fields including computational biology [2,3], signal and image analysis [4], speech recognition [5,6], and financial modeling [7].

In order to apply an HMM, it is often necessary to estimate its parameters from data. The standard approach to estimating the parameters of an HMM is using a maximum likelihood (ML) criterion. Numerical algorithms for computing the ML estimate are dominated by iterative local-search procedures that aim to maximize the likelihood of observed data, such as the expectation-maximization (EM) algorithm [1,4]. Unfortunately, these schemes are only guaranteed to converge to local stationary points of the typically non-convex likelihood function and as a result often become trapped in local optima. Thus, to have a chance of converging to a global optimum, a good initialization is usually required. Another drawback of such methods is the significant computational cost associated with long runtimes due to costly iterations for large datasets.

In order to overcome such challenges, methods of moments have been introduced for HMMs [8,9,10,11,12,13,14]. Originally, these methods relied on empirical estimation of correlations between consecutive pair- or triplet-wise observations to compute estimates of the HMM parameters. Although computationally attractive, such methods suffered from a loss of accuracy due to a focus on low-order correlations in the data. In response, Mattila et al. [15,16] extended these methods to include non-consecutive correlations in the data, resulting in improved accuracy while retaining their attractive computational properties.

1.1. Hidden Markov Models with Manifold-Valued Observations

The development and analysis of statistical procedures and optimization algorithms on manifolds and nonlinear spaces more broadly have been the subject of intense and growing research interest in recent decades due to the ubiquity of manifold-valued data in a wide range of applications [17,18,19,20,21,22,23]. Since the application of Euclidean algorithms to such data often has a significantly negative impact on the accuracy and interpretability of the results, it is necessary to devise algorithms that respect the intrinsic geometry of the data. In this work, we turn our attention to HMMs with observations in a Riemannian manifold [24,25]. In particular, we restrict our attention to the class of models with observations in Riemannian symmetric spaces of non-positive curvature, which include hyperbolic spaces, as well as spaces of real, complex, and quaternionic positive definite matrices. We have three motivations for this restriction: (1) standard operations on such spaces have relatively favorable computational properties due to symmetries, (2) there exists a theory of Riemannian Gaussian distributions on such spaces together with associated algorithms such as Riemannian Gaussian mixture estimation [26,27], and (3) they apply to a substantial class of problems involving manifold-valued data, including applications with data in the form of covariance matrices [27].

1.2. Contributions and Paper Outline

Our main contribution in this paper is to extend the second-order method of moments algorithm with non-consecutive correlations developed by Mattila et al. [15,16] to the setting of HMMs with observations in a Riemannian symmetric space of non-positive curvature, where the observation likelihoods take the form of Riemannian Gaussians [27,28]. The paper is organized as follows. In Section 2, we describe HMMs with manifold-valued observations and review the necessary geometric background. In Section 3, we review the method of moments algorithms for HMMs and describe how they manifest in the geometric setting. In Section 4, we present a number of simulations based on these algorithms and conclude with a discussion in Section 5.

1.3. Notation

We denote the i-th entry of a vector by

{[\cdot]}_{i}

, and the element at row i and column j of a matrix by

{[\cdot]}_{i j}

. Vectors are assumed to be column vectors unless transposed. The vector of all ones is denoted

1

. We interpret inequalities between vectors and matrices to hold element-wise. The operator diag acts on vectors and returns the matrix where the vector has been placed on the diagonal, and all other elements are set to zero. The matrix Frobenius norm is denoted

{∥ \cdot ∥}_{F}

. The probability of an event A is denoted

P (A)

.

2. Hidden Markov Models on Manifolds

We consider a discrete-time hidden Markov model with a finite-state Markov chain on the state space

X = {1, \dots, N}

with time-homogeneous

N \times N

transition probability matrix P with elements

{[P]}_{i j} = P [x_{k + 1} = j | x_{k} = i] .

(1)

The initial and stationary distributions of the HMM exist under appropriate assumptions and are denoted by

π_{0} \in R^{N}

and

π_{\infty} \in R^{N}

, respectively. The HMM is said to be stationary if

π_{0} = π_{\infty}

.

We assume that the states are hidden and can only be accessed through observations in a Riemannian symmetric space of non-positive curvature so that the Riemannian Gaussian distribution with probability density function

p (y | \bar{y}, σ) = \frac{1}{Z (σ)} exp [- \frac{d^{2} (y, \bar{y})}{2 σ^{2}}]

(2)

with respect to the Riemannian volume measure

d v (y)

on

Y

is well-defined for any

\bar{y} \in Y

and

σ > 0

, as outlined in [27].

d (\cdot, \cdot)

denotes the Riemannian distance function on

Y

and

Z (σ)

denotes the normalization factor of the Riemannian Gaussian, whose efficient computation has been the subject of interest in recent years [28,29,30,31]. We assume that the observations are sampled from

Y

according to conditional probability densities

B (y_{k} = y | x_{k} = j) = p (y | {\bar{y}}_{j}, σ_{j}),

(3)

for

j = 1, \dots, N

where

p (\cdot | {\bar{y}}_{j}, σ_{j})

is a Riemannian Gaussian density function of the form (2) with mean

{\bar{y}}_{j} \in Y

and dispersion

σ_{j} > 0

.

To use an HMM for applications such as filtering or prediction, its model parameters must be specified or estimated in advance. This task can be formulated as the following learning problem for HMMs:

Problem 1.

Given a sequence

y_{1}, \dots, y_{D}

of observations in

Y

generated by an HMM of known state space

X = {1, \dots, N}

, estimate the conditional probability densities B and the matrix of transition probabilities P.

The learning problem is well-posed under the standard assumptions that the HMM is ergodic (irreducible and aperiodic) and identifiable [4,10,15,16]. A special case of the learning problem that is worth noting is that of the known-sensor HMM, in which the observation likelihoods B are assumed to be known. Known-sensor HMMs are motivated by applications in which the sensor is designed by the user, such as a target tracking system whose sensor specifications can be determined prior to deployment.

Various methods since the inception of HMMs have focused on maximizing the likelihood in terms of both B and P; however, recent efforts have demonstrated the potential of methods that decouple the problem [12,13] and estimate B and P sequentially. Specifically, in parametric-output HMMs (e.g., Gaussian HMMs), the observation likelihoods are estimated via a general mixture model learner as a first step, followed by the identification of the transition matrix P as a second step [12]. In the first step, assuming that the underlying Markov chain behaves well (e.g., is recurrent) and mixes rapidly, in stationarity, each observation

y_{k}

from the HMM can be interpreted as having been sampled from the mixture distribution density

p (y) = \sum_{i = 1}^{N} {[π_{\infty}]}_{i} B (y | {\bar{y}}_{i}, σ_{i}) .

(4)

Since we are assuming that the observation likelihoods belong to the family of isotropic Riemannian Gaussians on

Y

, the density (4) can be estimated using one of several algorithms for the estimation of mixtures of Riemannian Gaussian distributions including expectation-maximization (EM) [26,27], stochastic EM [32], and online variants [33]. The second step is then equivalent to the identification of a known-sensor HMM.

3. Method of Moments Algorithms for Geometric Learning of Hidden Markov Models

3.1. Method of Moments for HMMs

We begin with a brief review of the method of moments algorithm for HMMs developed by Mattila et al. in [15]. The significance of this work is that it extends the previous method of moments algorithms for HMMs that were based on correlations between consecutive pair- or triplet-wise observations to include non-consecutive correlations in the data. In doing so, the authors improve the accuracy of the approach by reducing the volume of neglected information inherent in the data while maintaining the computationally attractive properties of the previous method of moments algorithms.

Before presenting the algorithm in the setting of HMMs with manifold-valued observations, we briefly review a summary of the key steps involved in the second-order algorithm of Mattila et al. [15] in the simplest setting where the observations take place in a finite observation alphabet

{1, \dots, Y}

with a known

N \times Y

observation matrix B:

{[B]}_{i j} = P [y_{k} = j | x_{k} = i] .

(5)

Methods of moments for HMMs (e.g., [8,9,10,11,12,13,14]) involve the empirical estimation of low-order correlations in the data, such as pairs

P [y_{k}, y_{k + 1}]

or triplets

P [y_{k}, y_{k + 1}, y_{k + 2}]

, followed by computation of the HMM parameter estimates by minimizing the discrepancy between the empirical estimates and their analytical expressions via a series of convex optimization problems. In Mattila et al. [15], the authors extend such methods to include non-consecutive correlations of the form

P [y_{k}, y_{k + τ}]

with

τ = 1, 2, \dots, \bar{τ}

where the number

\bar{τ}

is a user-defined lag parameter.

The lag-

τ

second-order moments

M_{2} (k, τ) \in R^{Y \times Y}

of the HMM are defined as the matrices

{[M_{2} (k, τ)]}_{i j} = P [y_{k} = i, y_{k + τ} = j],

(6)

where

i, j = 1, \dots, Y

and

τ \geq 0

. The case

τ = 0

reduces to the first-order moments

{[M_{1} (k)]}_{i} = P [y_{k} = i]

, where

M_{1} (k) \in R^{Y}

, which for notational convenience is expressed as a special case of second-order moments by writing

M_{2} (k, 0) = diag (M_{1} (k))

. For a stationary HMM (i.e.,

π_{0} = π_{\infty}

), it can be readily verified that the lag-

τ

second-order moments are related to the HMM parameters according to the equations

M_{2} (k, τ) = B^{T} diag (π_{\infty}) P^{τ} B, M_{2} (k, 0) = diag (B^{T} π_{\infty}),

(7)

for any

τ > 0

.

The lag-

τ

second-order moments can be empirically estimated from data as

{\hat{M}}_{2} (τ)

according to the equation

{[{\hat{M}}_{2} (τ)]}_{i j} = \frac{1}{D - τ} \sum_{k = 1}^{D - τ} I {y_{k} = i, y_{k + τ} = j},

(8)

for

τ = 0, 1, \dots, \bar{τ}

, where D is the number of observations and I denotes the indicator function. The next step in the method is moment matching through the minimization of the discrepancy between the empirical estimate

\hat{M} (τ)

and its analytical expression by solving the following convex (quadratic) optimization problems:

Solve

$\begin{matrix} min_{{\hat{π}}_{\infty} \in R^{N \times N}} & ∥ {\hat{M}}_{2} (0) - diag (B^{T} {\hat{π}}_{\infty}) ∥_{F}^{2} \\ s . t . & {\hat{π}}_{\infty} \geq 0, 1^{T} {\hat{π}}_{\infty} = 1, \end{matrix}$

(9)

and set $\hat{A} (0) = diag ({\hat{π}}_{\infty})$ .
For $τ = 1, \dots, \bar{τ}$ , solve

$\begin{matrix} min_{\hat{P} (τ) \in R^{N \times N}} & ∥ {\hat{M}}_{2} (τ) - B^{T} \hat{A} (τ - 1) \hat{P} {(τ) B ∥}_{F}^{2} \\ s . t . & \hat{P} (τ) \geq 0, \hat{P} (τ) 1 = 1, \end{matrix}$

(10)

and set $\hat{A} (τ) = \hat{A} (τ - 1) \hat{P} (τ)$ .

The output of the above moment matching procedure is a sequence

\hat{A} (0), \dots, \hat{A} (\bar{τ})

. In the final step, we use this sequence to estimate the transition matrix P by solving the following least-squares problem, which incorporates information from every lag by construction.

\begin{matrix} min_{\hat{P} \in R^{N \times N}} & {∥[\begin{matrix} \hat{A} (0) \\ ⋮ \\ \hat{A} (\bar{τ} - 1) \end{matrix}] \hat{P} - [\begin{matrix} \hat{A} (1) \\ ⋮ \\ \hat{A} (\bar{τ}) \end{matrix}]∥}_{F}^{2} \\ s . t . & \hat{P} \geq 0, \hat{P} 1 = 1 . \end{matrix}

(11)

The dominant contribution to the computational cost of the above algorithm is independent of the data size D and scales linearly with the number of lags

\bar{τ}

included. In contrast, each iteration of the EM algorithm has a complexity of

O (N^{2} D)

. In addition to favorable computational properties, it is shown in [15,16] that the above algorithm is strongly consistent under reasonable assumptions. That is, as the number of samples grows, we expect the estimate of the transition matrix P to converge to its true value.

3.2. Geometric Learning of HMMs Using Method of Moments

We now return to the problem of estimating the parameters of an HMM with observations in a Riemannian manifold

Y

via an extension of the second-order method of moments presented earlier. We assume conditional probability densities to be given by Riemannian Gaussians of the form (2). The first stage of the process is to estimate the means and variances of the observation densities from data by employing a Riemannian Gaussian mixture learner [27,32,33]. In the case of a known-sensor HMM, this would be unnecessary as the observation densities are known a priori. In the next stage, we use a kernel trick outlined in [12,16] to extend the pairwise correlations between discrete-valued observations

M_{2} (τ)

to an analogous quantity

H (τ) \in R^{N \times N}

applicable in the setting of continuous observation spaces. H is then related to the parameters of the HMM according to the equations

\begin{matrix} H (0) & = diag (K π_{\infty}), \\ H (τ) & = K^{T} diag (π_{\infty}) P^{τ} K, \end{matrix}

(12)

for

τ = 1, \dots, \bar{τ} \in N

, where

π_{\infty}

is the HMM stationary distribution which can be estimated from (4), and

K \in R^{N \times N}

is defined as

{[K]}_{i j} = \int_{Y} B (y ∣ x = i) B (y ∣ x = j) d v (y) .

(13)

The

N \times N

matrix K in (13) is called the effective observation matrix in [12,16] and replaces the

N \times Y

observation matrix (5). We can compute K using Monte Carlo techniques based on sampling from Riemannian Gaussians [27].

The elements of the left-hand side of (12) can be interpreted as conditional expectations with respect to the joint probability distribution of

y_{k}

and

y_{k + τ}

, which can be empirically estimated from HMM observations as

\begin{matrix} {[\hat{H} (0)]}_{i i} & = \frac{1}{D} \sum_{k = 1}^{D} B (y_{k} | x = i), \end{matrix}

(14)

\begin{matrix} [\hat{H} {(τ)]}_{i j} & = \frac{1}{D - τ} \sum_{k = 1}^{D - τ} B (y_{k} | x = i) B (y_{k + τ} | x = j) \end{matrix}

(15)

in analogy with empirical estimate (8) employed in the case of HMMs with a discrete observation space.

Following the estimation of

H (τ)

and the computation of K, the moment matching procedure now takes the form of minimizing the discrepancy between the empirical estimate

\hat{H} (τ)

and the corresponding analytical expressions in (12). Specifically, in the case of the known-sensor HMM, we solve the following sequence of convex (quadratic) optimization problems:

Solve

$\begin{matrix} min_{{\hat{π}}_{\infty} \in R^{N \times N}} & ∥ \hat{H} (0) - diag (K^{T} {\hat{π}}_{\infty}) ∥_{F}^{2} \\ s . t . & {\hat{π}}_{\infty} \geq 0, 1^{T} {\hat{π}}_{\infty} = 1, \end{matrix}$

(16)

and set $\hat{A} (0) = diag ({\hat{π}}_{\infty})$ .
For $τ = 1, \dots, \bar{τ}$ , solve

$\begin{matrix} min_{\hat{P} (τ) \in R^{N \times N}} & ∥ \hat{H} (τ) - K^{T} \hat{A} (τ - 1) \hat{P} {(τ) K ∥}_{F}^{2} \\ s . t . & \hat{P} (τ) \geq 0, \hat{P} (τ) 1 = 1, \end{matrix}$

(17)

and set $\hat{A} (τ) = \hat{A} (τ - 1) \hat{P} (τ)$ .

The output is once again a sequence

\hat{A} (0), \dots, \hat{A} (\bar{τ})

, which is used to compute an estimate for the transition matrix P by solving (11).

To summarize, the algorithm follows a two-stage procedure to learn the parameters of an HMM with observations in a Riemannian manifold admitting well-defined Gaussian densities of the form (2) from data. In stage 1, Riemannian Gaussian mixture estimation is employed to compute estimates for the conditional likelihoods B, which are then used in stage 2 to compute an estimate for the transition probabilities P by solving a series of convex optimization problems.

4. Simulations

We now present the results of several numerical experiments on learning HMMs with manifold-valued observations. In the first example, observations take place in the Poincaré disk model of hyperbolic 2-space. Poincaré models of hyperbolic spaces have been a subject of increasing interest in machine learning in recent years due to their ability to efficiently represent hierarchical data [34]. In the second example, we consider a model with observations in the manifold of

2 \times 2

symmetric positive definite (SPD) matrices equipped with the standard affine-invariant Rao-Fisher metric [26].

4.1. Example 1: Observations in Hyperbolic Space

We consider the example of an HMM with

N = 3

hidden states with initial distribution

π_{0} = {(1, 0, 0)}^{T}

and transition matrix

P = [\begin{matrix} 0.4 & 0.3 & 0.3 \\ 0.2 & 0.6 & 0.2 \\ 0.1 & 0.1 & 0.8 \end{matrix}]

(18)

and observations generated from a Riemannian Gaussian model in the Poincaré disk

Y = {y \in C : | y | < 1}

with associated means

{\bar{y}}_{1} = 0

,

{\bar{y}}_{2} = 0.29 + 0.82 i

,

{\bar{y}}_{3} = - 0.29 + 0.82 i

and standard deviations

σ_{1} = 0.1

,

σ_{2} = 0.4

,

σ_{3} = 0.4

as studied in [24] in the context of estimation using the EM algorithm. The Riemannian distance function

d (\cdot, \cdot)

and the Riemannian Gaussian normalization factor

Z (σ)

are given by

\begin{matrix} d (y, z) = acosh (1 + \frac{{2 | y - z |}^{2}}{{(1 - | y |}^{2}) (1 - {| z |}^{2})}), Z (σ) = 2 π \sqrt{\frac{π}{2}} σ e^{\frac{σ^{2}}{2}} \erf (\frac{σ}{\sqrt{2}}), \end{matrix}

(19)

respectively, where erf denotes the error function [35].

We employed the second-order method of moments algorithm of Section 3.2 to learn the parameters of this HMM from observations alone. The model was fitted on 20 HMM chains, each with 10,000 observations. In our implementation, we used the mixture estimation algorithm of [26] to estimate the density (4). The full results are reported in Table 1, where the true and estimated Gaussian means are denoted by

{\bar{y}}_{i}

and

{\hat{y}}_{i}

, respectively. On repeating the experiment with varying

\bar{τ}

and the same random seed—and hence the same estimates for means and dispersions by construction—we observed that incorporating non-consecutive data (i.e.,

\bar{τ} > 1

) up to

\bar{τ} = 3

significantly improved our estimate for P and produced a more accurate estimate than alternative algorithms [24,25]. Comparing the empirical performance of our algorithm to the numerical results reported in [24], we observed that our algorithm performed competitively while requiring only a fraction of the runtime with the same number of observations. In comparison to the online learning algorithm of [25], which we employed on the same learning problem, we observed improved performance for

\bar{τ} > 1

, with the method of moments algorithm with

\bar{τ} = 3

producing the most accurate estimate of P out of all considered methods. Interestingly, the runtime of our algorithm was not noticeably affected by the choice of

\bar{τ}

in this example since the mixture estimation and computation of K (13) accounted for the dominant contribution to the computational cost.

4.2. Example 2: Observations in the Manifold of $2 \times 2$ SPD Matrices with $N = 5$ Hidden States

We now consider an HMM with

N = 5

hidden states that are accessible through noisy observations in the manifold of

2 \times 2

SPD matrices generated from a Riemannian Gaussian model with means

{\bar{y}}_{i}

and standard deviations

σ_{i}

given in Table 2. Here, the Riemannian distance function

d (\cdot, \cdot)

and the Riemannian Gaussian normalization factor

Z (σ)

are given by

\begin{matrix} d (y, z) = ∥ log (y^{- 1 / 2} z y^{- 1 / 2}) ∥_{F}, Z (σ) = {(2 π)}^{\frac{3}{2}} σ^{2} e^{\frac{σ^{2}}{4}} \erf (\frac{σ}{2}) . \end{matrix}

(20)

While the expression for the Riemannian distance function holds true for higher dimensional SPD matrices, the analytical expression for

Z (σ)

in (20) is only valid in the

2 \times 2

case. Nonetheless,

Z (σ)

can be directly computed or approximated for higher dimensional SPD matrices [26,27,28,29,30,31].

The transition matrix P of the underlying Markov chain is

P = [\begin{matrix} 0.3 & 0.1 & 0.2 & 0.1 & 0.3 \\ 0.1 & 0.4 & 0.2 & 0.2 & 0.1 \\ 0.2 & 0.2 & 0.3 & 0.1 & 0.2 \\ 0.1 & 0.1 & 0.2 & 0.5 & 0.1 \\ 0.4 & 0.1 & 0.1 & 0.1 & 0.3 \end{matrix}] .

(21)

We employed our proposed geometric second-order method of moments algorithm with

\bar{τ} = 1

to sequentially estimate the underlying Gaussian model and the probability transition matrix from 10,000 observations. The results of the Gaussian mixture estimation procedure are reported in Table 2 and demonstrate a high level of accuracy. The estimated Riemannian Gaussian model with means

{\hat{y}}_{i}

and standard deviations

{\hat{σ}}_{i}

as well as the observations used to learn the model are visualized in Figure 1.

The estimated transition matrix

\hat{P}

is

\hat{P} = [\begin{matrix} 0.291 & 0.088 & 0.195 & 0.092 & 0.334 \\ 0.104 & 0.409 & 0.185 & 0.188 & 0.114 \\ 0.199 & 0.206 & 0.297 & 0.098 & 0.200 \\ 0.091 & 0.113 & 0.202 & 0.482 & 0.112 \\ 0.407 & 0.105 & 0.106 & 0.083 & 0.299 \end{matrix}],

(22)

which yields a relative approximation error of

\frac{∥ P - \hat{P} ∥_{F}}{{∥ P ∥}_{F}} = 0.050

(23)

with respect to the Frobenius norm. The mean error in the estimated transition probabilities is

\frac{1}{N^{2}} \sum_{i, j = 1}^{N} | {[P]}_{i j} - {[\hat{P}]}_{i j} | \approx 0.01 .

(24)

5. Conclusions

In this paper, we have shown that the recent method of moments algorithms for HMMs can be generalized to geometric settings in which observations take place in Riemannian manifolds. We observed (through simple numerical simulations) that the documented advantages of the method of moments algorithms, including their competitive accuracies and attractive computational and statistical properties, may continue to hold in the geometric setting. Nonetheless, we expect unique computational challenges to arise in applications involving high-dimensional Riemannian manifolds. Specifically, using Markov chain Monte Carlo (MCMC) algorithms to compute the effective observation matrix K defined in (13) may become prohibitively expensive in high dimensions, which is not the case in the Euclidean setting as K admits a closed-form analytic expression for multivariate Gaussian HMMs. Thus, a key technical challenge for the effective application of the proposed algorithm in problems involving high-dimensional manifolds is to devise algorithms for the efficient and scalable computation of K. Further developments of the approach may include extensions to models that incorporate third- or higher-order moments or more elaborate dynamics and control inputs.

Author Contributions

Conceptualization, B.C., C.M. and S.S.; writing, methodology, and analysis, B.C. and C.M.; supervision, C.M. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

B.C. acknowledges funding from the Faculty of Mathematics at the University of Cambridge as part of the Cambridge Mathematics Placements (CMP) program. C.M. was supported by an NTU Presidential Postdoctoral Fellowship and an Early Career Research Fellowship at Fitzwilliam College, Cambridge.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krishnamurthy, V. Partially Observed Markov Decision Processes: From Filtering to Controlled Sensing; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar] [CrossRef]
Durbin, R.; Eddy, S.R.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar] [CrossRef]
Vidyasagar, M. Hidden Markov Processes: Theory and Applications to Biology; Princeton University Press: Princeton, NJ, USA, 2014. [Google Scholar]
Cappé, O.; Moulines, E.; Rydén, T. Inference in hidden Markov models; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef] [Green Version]
Gales, M.; Young, S. The application of hidden Markov models in speech recognition. Found. Trends Signal Process. 2008, 1, 195–304. [Google Scholar] [CrossRef]
Mamon, R.S.; Elliott, R.J. Hidden Markov Models in Finance; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Chang, J.T. Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Math. Biosci. 1996, 137, 51–73. [Google Scholar] [CrossRef]
Mossel, E.; Roch, S. Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab. 2006, 16, 583–614. [Google Scholar] [CrossRef] [Green Version]
Hsu, D.; Kakade, S.M.; Zhang, T. A spectral algorithm for learning Hidden Markov Models. J. Comput. Syst. Sci. 2012, 78, 1460–1480. [Google Scholar] [CrossRef] [Green Version]
Anandkumar, A.; Hsu, D.; Kakade, S.M. A Method of Moments for Mixture Models and Hidden Markov Models. In Proceedings of Machine Learning Research, Proceedings of the 25th Annual Conference on Learning Theory; Mannor, S., Srebro, N., Williamson, R.C., Eds.; PMLR: Edinburgh, UK, 2012; Volume 23, pp. 33.1–33.34. [Google Scholar]
Kontorovich, A.; Nadler, B.; Weiss, R. On Learning Parametric-Output HMMs. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML’13, Atlanta, GA, USA, 16–21 June 2013; JMLR.org. Volume 28, pp. III–702–III–710. [Google Scholar]
Mattila, R.; Rojas, C.R.; Krishnamurthy, V.; Wahlberg, B. Asymptotically Efficient Identification of Known-Sensor Hidden Markov Models. IEEE Signal Process. Lett. 2017, 24, 1813–1817. [Google Scholar] [CrossRef]
Huang, K.; Fu, X.; Sidiropoulos, N. Learning Hidden Markov Models from Pairwise Co-occurrences with Application to Topic Modeling. In Proceedings of Machine Learning Research, Proceedings of the 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR: Edinburgh, UK, 2018; Volume 80, pp. 2068–2077. [Google Scholar]
Mattila, R.; Rojas, C.; Moulines, E.; Krishnamurthy, V.; Wahlberg, B. Fast and Consistent Learning of Hidden Markov Models by Incorporating Non-Consecutive Correlations. In Proceedings of Machine Learning Research, Proceedings of the 37th International Conference on Machine Learning; Daumé, H.D., III, Singh, A., Eds.; PMLR: Edinburgh, UK, 2020; Volume 119, pp. 6785–6796. [Google Scholar]
Mattila, R. Hidden Markov models: Identification, inverse filtering and applications. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2020. [Google Scholar]
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef] [Green Version]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Multiclass Brain–Computer Interface Classification by Riemannian Geometry. IEEE Trans. Biomed. Eng. 2012, 59, 920–928. [Google Scholar] [CrossRef] [PubMed]
Boumal, N.; Mishra, B.; Absil, P.A.; Sepulchre, R. Manopt, a Matlab Toolbox for Optimization on Manifolds. J. Mach. Learn. Res. 2014, 15, 1455–1459. [Google Scholar]
Pennec, X.; Sommer, S.; Fletcher, T. Riemannian Geometric Statistics in Medical Image Analysis; Academic Press: Cambridge, MA, USA, 2020. [Google Scholar]
Miolane, N.; Guigui, N.; Brigant, A.L.; Mathe, J.; Hou, B.; Thanwerdas, Y.; Heyder, S.; Peltre, O.; Koep, N.; Zaatiti, H.; et al. Geomstats: A Python Package for Riemannian Geometry in Machine Learning. J. Mach. Learn. Res. 2020, 21, 1–9. [Google Scholar]
Mostajeran, C.; Grussler, C.; Sepulchre, R. Geometric Matrix Midranges. SIAM J. Matrix Anal. Appl. 2020, 41, 1347–1368. [Google Scholar] [CrossRef]
Van Goffrier, G.W.; Mostajeran, C.; Sepulchre, R. Inductive Geometric Matrix Midranges. IFAC-PapersOnLine 2021, 54, 584–589. [Google Scholar] [CrossRef]
Said, S.; Le Bihan, N.; Manton, J. Hidden Markov chains and fields with observations in Riemannian manifolds. IFAC-PapersOnLine 2021, 54, 719–724. [Google Scholar] [CrossRef]
Tupker, Q.; Said, S.; Mostajeran, C. Online Learning of Riemannian Hidden Markov Models in Homogeneous Hadamard Spaces. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 37–44. [Google Scholar]
Said, S.; Bombrun, L.; Berthoumieu, Y.; Manton, J.H. Riemannian Gaussian Distributions on the Space of Symmetric Positive Definite Matrices. IEEE Trans. Inf. Theory 2017, 63, 2153–2170. [Google Scholar] [CrossRef] [Green Version]
Said, S.; Hajri, H.; Bombrun, L.; Vemuri, B.C. Gaussian Distributions on Riemannian Symmetric Spaces: Statistical Learning With Structured Covariance Matrices. IEEE Trans. Inf. Theory 2018, 64, 752–772. [Google Scholar] [CrossRef]
Said, S.; Mostajeran, C.; Heuveline, S. Gaussian distributions on Riemannian symmetric spaces of nonpositive curvature. In Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2022. [Google Scholar] [CrossRef]
Heuveline, S.; Said, S.; Mostajeran, C. Gaussian Distributions on Riemannian Symmetric Spaces in the Large N Limit. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 20–28. [Google Scholar]
Santilli, L.; Tierz, M. Riemannian Gaussian distributions, random matrix ensembles and diffusion kernels. Nucl. Phys. B 2021, 973, 115582. [Google Scholar] [CrossRef]
Said, S.; Heuveline, S.; Mostajeran, C. Riemannian statistics meets random matrix theory: Towards learning from high-dimensional covariance matrices. IEEE Trans. Inf. Theory 2022. submitted for publication. [Google Scholar] [CrossRef]
Zanini, P.; Said, S.; Cavalcante, C.C.; Berthoumieu, Y. Stochastic EM algorithm for mixture estimation on manifolds. In Proceedings of the 2017 IEEE 7th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Curacao, The Netherlands, 10–13 December 2017; pp. 1–5. [Google Scholar] [CrossRef]
Zanini, P.; Said, S.; Berthoumieu, Y.; Congedo, M.; Jutten, C. Riemannian Online Algorithms for Estimating Mixture Model Parameters. In Geometric Science of Information (GSI 2017); Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
Nickel, M.; Kiela, D. Poincaré Embeddings for Learning Hierarchical Representations. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Said, S.; Bombrun, L.; Berthoumieu, Y. New Riemannian Priors on the Univariate Normal Model. Entropy 2014, 16, 4015–4031. [Google Scholar] [CrossRef]

Figure 1. Visual representation of the Riemannian Gaussian model estimated from 10,000 observations from three vantage points: top view (left), side view (middle), and front view (right). Each

2 \times 2

SPD-valued observation is plotted as a point in the interior of the pointed convex cone

{(a, b, c) \in R^{3} : a \geq 0, a c - b^{2} \geq 0}

. The shaded compact regions within the cone are superlevel sets of the 5 estimated Riemannian Gaussian densities that represent the observation likelihoods.

Figure 1. Visual representation of the Riemannian Gaussian model estimated from 10,000 observations from three vantage points: top view (left), side view (middle), and front view (right). Each

2 \times 2

SPD-valued observation is plotted as a point in the interior of the pointed convex cone

{(a, b, c) \in R^{3} : a \geq 0, a c - b^{2} \geq 0}

. The shaded compact regions within the cone are superlevel sets of the 5 estimated Riemannian Gaussian densities that represent the observation likelihoods.

Table 1. Comparison of the performance of the method of moments algorithm proposed in this paper against previously published algorithms for estimating HMMs with observations in the Poincaré disk.

	EM Algorithm from [24]	Online Algorithm from [25]	Our proposed algorithm with (a) $\bar{τ} = 1$ , (b) $\bar{τ} = 2$ , (c) $\bar{τ} = 3$
Mean error, ${(\sum_{i} d^{2} ({\bar{y}}_{i}, {\hat{y}}_{i}))}^{1 / 2}$	0.88	0.97	0.69
Dispersion error, ${(\sum_{i} {(σ_{i} - {\hat{σ}}_{i})}^{2})}^{1 / 2}$	0.42	0.37	0.34
Transition matrix error, $∥ P - \hat{P} ∥_{F}$	0.35	0.30	(a) 0.42, (b) 0.26, (c) 0.21
Average runtime	∼1 h	∼190 s	∼20 s

Table 2. True and estimated Riemannian Gaussian mixture model parameters.

{\hat{y}}_{i}

and

{\hat{σ}}_{i}

denote the estimated Riemannian Gaussian means and standard deviations, respectively.

π_{\infty}

and

{\hat{π}}_{\infty}

denote the true and estimated stationary distributions, respectively.

Table 2. True and estimated Riemannian Gaussian mixture model parameters.

{\hat{y}}_{i}

and

{\hat{σ}}_{i}

denote the estimated Riemannian Gaussian means and standard deviations, respectively.

π_{\infty}

and

{\hat{π}}_{\infty}

denote the true and estimated stationary distributions, respectively.

	$i = 1$	$i = 2$	$i = 3$	$i = 4$	$i = 5$
${\bar{y}}_{i}$	$[\begin{matrix} 1.646 & 0.056 \\ 0.056 & 2.379 \end{matrix}]$	$[\begin{matrix} 2.294 & 0.744 \\ 0.744 & 1.415 \end{matrix}]$	$[\begin{matrix} 2.631 & - 0.127 \\ - 0.127 & 1.277 \end{matrix}]$	$[\begin{matrix} 0.674 & 0.454 \\ 0.454 & 2.056 \end{matrix}]$	$[\begin{matrix} 1.829 & - 0.919 \\ - 0.919 & 1.602 \end{matrix}]$
${\hat{y}}_{i}$	$[\begin{matrix} 1.642 & 0.051 \\ 0.051 & 2.383 \end{matrix}]$	$[\begin{matrix} 2.300 & 0.743 \\ 0.743 & 1.412 \end{matrix}]$	$[\begin{matrix} 2.642 & - 0.128 \\ - 0.128 & 1.277 \end{matrix}]$	$[\begin{matrix} 0.672 & 0.454 \\ 0.454 & 2.057 \end{matrix}]$	$[\begin{matrix} 1.830 & - 0.920 \\ - 0.920 & 1.604 \end{matrix}]$
$σ_{i}$	$0.1$	$0.1$	$0.1$	$0.1$	$0.1$
$\hat{σ_{i}}$	$0.099$	$0.100$	$0.099$	$0.101$	$0.101$
$π_{\infty}$	$0.227$	$0.171$	$0.199$	$0.195$	$0.207$
${\hat{π}}_{\infty}$	$0.229$	$0.159$	$0.201$	$0.195$	$0.216$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, B.; Mostajeran, C.; Said, S. Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm. Phys. Sci. Forum 2022, 5, 10. https://doi.org/10.3390/psf2022005010

AMA Style

Chen B, Mostajeran C, Said S. Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm. Physical Sciences Forum. 2022; 5(1):10. https://doi.org/10.3390/psf2022005010

Chicago/Turabian Style

Chen, Berlin, Cyrus Mostajeran, and Salem Said. 2022. "Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm" Physical Sciences Forum 5, no. 1: 10. https://doi.org/10.3390/psf2022005010

Article Menu

Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm^†

Abstract

1. Introduction

1.1. Hidden Markov Models with Manifold-Valued Observations

1.2. Contributions and Paper Outline

1.3. Notation

2. Hidden Markov Models on Manifolds

3. Method of Moments Algorithms for Geometric Learning of Hidden Markov Models

3.1. Method of Moments for HMMs

3.2. Geometric Learning of HMMs Using Method of Moments

4. Simulations

4.1. Example 1: Observations in Hyperbolic Space

4.2. Example 2: Observations in the Manifold of $2 \times 2$ SPD Matrices with $N = 5$ Hidden States

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm †

Abstract

1. Introduction

1.1. Hidden Markov Models with Manifold-Valued Observations

1.2. Contributions and Paper Outline

1.3. Notation

2. Hidden Markov Models on Manifolds

3. Method of Moments Algorithms for Geometric Learning of Hidden Markov Models

3.1. Method of Moments for HMMs

3.2. Geometric Learning of HMMs Using Method of Moments

4. Simulations

4.1. Example 1: Observations in Hyperbolic Space

4.2. Example 2: Observations in the Manifold of 2 × 2 SPD Matrices with N = 5 Hidden States

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm^†

4.2. Example 2: Observations in the Manifold of $2 \times 2$ SPD Matrices with $N = 5$ Hidden States