Next Article in Journal
Generalized Unit Half-Logistic Geometric Distribution: Properties and Regression with Applications to Insurance
Next Article in Special Issue
Image Segmentation of the Sudd Wetlands in South Sudan for Environmental Analytics by GRASS GIS Scripts
Previous Article in Journal
Wavelet Support Vector Censored Regression
Previous Article in Special Issue
Data Stream Analytics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clustering Matrix Variate Longitudinal Count Data

School of Mathematics and Statistics, Carleton University, Ottawa, ON K1S 5B6, Canada
Analytics 2023, 2(2), 426-437; https://doi.org/10.3390/analytics2020024
Submission received: 24 November 2022 / Revised: 28 January 2023 / Accepted: 3 April 2023 / Published: 5 May 2023
(This article belongs to the Special Issue Feature Papers in Analytics)

Abstract

:
Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Y n for n = 1 , , N can be written as an r × t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modified Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure.

1. Introduction

Biological studies are a major source of longitudinal data. While modelling such longitudinal datasets, it is important to take into account the correlations among the measurements at different time points. Gene expression time course studies present important clustering and classification problems. Understanding how different genes are modulated over a period of time during an event of interest can provide key insight in understanding their involvement in various biological pathways [1,2,3,4]. Cluster analysis allows us to group genes into clusters with similar patterns, or ‘expression profiles’, over time.
Model-based clustering approaches have been shown to be an effective approach for clustering a wide variety of biological datasets, such as microarray datasets [5,6,7], RNA-seq data [8,9,10], microbiome data [11,12], flow cytometry data [13,14], etc. Several studies have utilized cluster analysis to gain novel insights into various biological phenomenon, such as identification of novel tumour subtypes [15,16], understanding diseases progression [17,18], understanding crops’ response to abiotic stresses [19,20], and others.
Model-based clustering utilizes finite mixture models [21] for cluster analysis, which assumes that a population is a mixture of G subpopulations or components and each component can be modelled using a probability distribution. A random vector Y is said to arise from a parametric finite mixture distribution; we can write its density in the form
f ( y Θ ) = g = 1 G π g f ( y θ g )
where π g [ 0 , 1 ] such that g = 1 G π g = 1 is the mixing proportion of the g t h component, f ( y θ g ) is the density of the g t h component, and Θ = ( π 1 , , π G , θ 1 , , θ G ) are the model parameters. The choice of an appropriate probability density/mass function depends on the data type.
Various approaches have been developed for the clustering of time course gene expression data (e.g., [6,7,22,23,24,25]). However, most statistical approaches for clustering gene expression data are tailored for microarray studies. While some of this could be attributed to RNA-seq technology being more recent compared to microarrays, the computational cost that comes with fitting multivariate discrete models needed for RNA-seq data has also led to challenges. The transcriptomics data arising from RNA-seq studies are discrete, high-dimensional, and over-dispersed count data. Efficiently analyzing these data remains a challenge. Because of the restrictive mean-variance relationship that comes with the Poisson distribution, the negative binomial distribution emerged as the univariate distribution of choice [26,27]. However, the multivariate negative binomial distribution [28] is seldom used in practice due to the computational cost that comes with fitting such a model [29].
Recently, [10] proposed mixtures of multivariate Poisson lognormal (MPLN) models for clustering over-dispersed, multivariate count data. In the MPLN model, the counts of the n t h gene (with n = 1 , 2 , , N ) are modelled using a hierarchical Poisson log-normal distribution such that Y n j Poisson ( e X n j + log c j ) and X n N p ( μ , Σ ) , where X n = ( X n 1 , X n 2 , , X n p ) , N p ( μ , Σ ) denotes a p-dimensional Gaussian distribution with mean μ and covariance Σ , and c j is a known constant representing the normalized library size of the j t h sample. This hierarchical structure allows for over-dispersion similar to the negative binomial distribution, but it also allows for correlation between the variables. An efficient framework for parameter estimation for mixtures of MPLN distributions was developed by [30] that utilizes a variational Gaussian approximation.
In RNA-seq studies, it is common to obtain expression levels of n genes at r conditions over t occasions in one study. A natural framework for modelling such data is a matrix variate approach such that each observation Y n is a r × t matrix. Alternately, a multivariate framework can also be utilized, where each observation Y n can be written as a vector vec ( Y n ) of dimensionality r t = r × t . In [31], the authors developed mixtures of matrix variate Poisson lognormal distributions (MVPLN). In the MVPLN model, Y n , i j Poisson ( e X n , i j + log C i j ) and X n N r × t ( M , Φ , Ω ) , where X n is an r × t matrix, and N r × t ( M , Φ , Ω ) denotes a matrix normal distribution with location matrix M and scale matrices Φ and Ω , and C is a r × t matrix where C i j denotes the normalized library size of the i t h condition from the j t h time point. Mathematically, the MVPLN model is equivalent to Y n , i j Poisson ( e X n , i j + log C i j ) and vec ( X n ) N r t ( vec ( M ) , Σ = Φ Ω ) , where N r t is an r t -dimensional multivariate normal distribution and ⊗ is a Kronecker product. By adopting a matrix variate form, the large r t × r t covariance matrix of the latent variable X can be written as the Kronecker product of two smaller r × r and t × t scale matrices Φ and Ω , i.e., Σ r t × r t = Φ r × r Ω t × t . This can greatly reduce the number of parameters in Σ .
In Section 2, we extend the mixtures of the matrix variate Poisson log-normal model for clustering matrix variate longitudinal RNA-seq data by incorporating a modified Cholesky decomposition of the scale matrix Ω that captures the covariances of the t occasions of the latent variable X . Furthermore, by imposing constraints on the components of scale matrices to be equal or different across the group, a family of eight models is obtained. Parameter estimation is performed using a variational variant of the expectation-maximization algorithm. In Section 3, the proposed models are applied to simulated and real datasets, and Section 4 concludes the paper.

2. Methods

A matrix variate Poisson log-normal distribution was proposed by [31] for modelling RNA-seq data. This arises from a hierarchical mixture of independent Poisson distributions with a matrix variate Gaussian distribution. Suppose Y 1 , , Y N are N observations from a matrix variate Poisson log-normal distribution, where the n t h observation Y n is an r × t dimensional matrix representing r conditions and t time points. A matrix variate Poisson log-normal distribution for modelling RNA-seq data can be written as
Y n , i j X n , i j Poisson ( e X n , i j + log C i j ) ,   and   X n N r × t ( M , Φ , Ω ) ,
where M is an r × t matrix of means, C is an r × t matrix of fixed, known constants to account for the differences in library sizes across each sample, and Φ and Ω are r × r and t × t scale matrices, respectively. Suppose the least-squares predictor of unobserved latent variable X n , i j of the n t h observation from the i t h condition at the j t h time point can be written as
X ^ n , i j = M i j + k = 1 j 1 ( ρ j k ) ( X n , i k M i k ) + d j ε j ,
where X n , i 1 , X n , i 2 ,…, X n , i ( j 1 ) are the unobserved latent variables from the preceding j 1 time points, ε j N ( 0 , 1 ) , and ρ j k and d j are the autoregressive parameters and innovation variances, respectively [32]. Thus, when the responses are recorded over a period of time (i.e., t time points), the parameter Ω that relates to the covariance of the t time points can be parameterized to account for the relationship between measurements at different time points. Now, Ω can be decomposed using the modified Cholesky decomposition [7,23,32] such that T Ω T = D . This can alternately be written as Ω 1 = T D 1 T , where D is a unique diagonal matrix with innovation variances d 1 , , d t , and T is a unique lower triangular matrix with 1 as the diagonal elements and the autoregressive parameters ( ρ j k with j > k ) as the off-diagonal elements:
T = 1 0 0 ρ 12 1 0 ρ j , j 1 ρ 2 , j 1 1 .
In the context of a G-component mixture of matrix variate Poisson log-normal distributions [31], Equation (1) can be written as
f ( Y Θ ) = g = 1 G π g f ( Y M g , Φ g , Ω g ) ,
where π g > 0 is the mixing proportion of the g t h component such that g = 1 G π g = 1 , and f ( Y M g , Φ g , Ω g ) is the marginal distribution function of the g t h component with parameters M g , Φ g , and Ω g , and Θ denotes all the model parameters.

2.1. Longitudinal Data and Family of Models

Due to the longitudinal nature of t time points, one can utilize the modified Cholesky decomposition for Ω g such that Ω g 1 = T g D g 1 T g . The number of parameters in Ω g (i.e., t ( t + 1 ) / 2 ) increases quadratically with respect to time points, and this is further compounded in mixture models as G different Ω s need to be estimated. Thus, similar to [7], constraints can be imposed on T g and D g to be equal or different across groups, and an isotropic constraint can also be imposed on D g = δ g I , where δ g is a scalar. Various combinations of these constraints result in a family of eight models (see Table 1).
Parameter estimation is outlined in Section 2.2. In cluster analysis, the best constraint for a specific dataset is also unknown. Selection of the best fitting model among the eight models in the family for a given dataset is performed using a model selection criteria, which is discussed in detail in Section 2.3.

2.2. Parameter Estimation

Parameter estimation for mixture models is typically conducted using the traditional expectation-maximization (EM; [33]) algorithm. In the case of an MVPLN model, this requires computing the posterior expectations of E ( X Y ) and E ( X X Y ) . However, the posterior distribution of a latent variable (i.e., X Y ) does not have a known form; thus, a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm is typically employed for empirically estimating E ( X Y ) and E ( X X Y ) . Such an approach can be computationally intensive [10,31].
Subedi and Browne [30] developed an efficient parameter estimation algorithm for a matrix variate Poisson log-normal distribution using variational approximations [34]. This utilized a computationally convenient approximating density to approximate a more complex but ‘true’ posterior density through minimization of the Kullback– Leibler (KL) divergence between the true and the approximating densities. Suppose we have an approximating density q ( X ) ; then, the marginal log of the probability mass function can be written as log f Y ( Y ) = F ( q , Y ) + D K L ( q f ) , where D K L ( q f ) = q ( X ) log q ( X ) f ( X Y ) d X is the KL divergence between f ( X Y ) , and the approximating distribution q ( X ) , and F ( q , Y ) = [ log f ( Y , X ) log q ( X ) ] q ( X ) d X is our evidence lower bound (ELBO). A Gaussian density is utilized as the approximating density for variational Gaussian approximations. Similar to [31], assuming q ( X n g ) = N r × t ( ξ n g , Δ n g , κ n g ) , the ELBO for each observation y n from the g t h component can be written as
F ( q n g , Y n ) = i = 1 r j = 1 t exp ( ξ n g ) i j + 1 2 ( Δ n g ) i i ( Ω n g ) j j + log C i j   + vec ξ n g + log vec ( C ) vec ( Y n ) k = 1 r t log ( vec ( Y n ) k ! ) p 2 log | Φ g | r 2 log | T g D g 1 T g |   1 2 vec ( ξ n g ) vec ( M g ) Φ g 1 ( T g D g 1 T g ) vec ( ξ n g ) vec ( M g )   + tr Φ g 1 Δ n g tr T g D g 1 T g κ n g + t 2 log | Δ n g | + r 2 log | κ n g | + r t 2 .
Thus, the complete-data log-likelihood for the mixtures of MVPLN distributions can be written as
l c ( Θ ) = g = 1 G n = 1 N z n g log π g + g = 1 G n = 1 N z n g log f ( Y n M g , Φ g , T g , D g ) = g = 1 G n = 1 N z n g log π g + g = 1 G n = 1 N z n g F ( Y n , q n g ) + D K L ( q n g f n g ) ,
where D K L ( q n g f n g ) = q ( X n g ) log q ( X n g ) f ( X n Y n , Z n g = 1 ) d X n g is the KL divergence between f ( X n Y n , Z n g = 1 ) and approximating distribution q ( X n g ) . As the variational parameters that maximize the ELBO also minimize the KL divergence, the estimates of the model parameters are obtained by maximizing the variational approximation of the complete-data log-likelihood using the ELBO, i.e.,
l c ( Θ ) g = 1 G n = 1 N z n g log π g + g = 1 G n = 1 N z n g F ( Y n , q n g ) .
Similar to [31], an iterative EM-type algorithm is utilized for parameter estimation. At the ( k + 1 ) t h step,
  • Conditional on the variational parameters ξ n g , Δ n g , and κ n g and on M g , Φ g , T g , and D g , the E ( Z n g ) is given by
    E ( Z n g Y n ) = π g f ( Y n M g , Φ g , T g , D g ) h = 1 G π h f ( Y n M h , Φ h , T h , D h ) .
    As the marginal distribution of Y n is difficult to compute, we use an approximation of E ( Z n g ) using the ELBO such that
    Z ^ n g ( k + 1 ) = def π g exp F q n g , Y n h = 1 G π h exp F q n h , Y n .
  • Given Z ^ n g ( k + 1 ) , variational parameters ξ n g , Δ n g , and κ n g are updated conditional on M g ( k ) , Φ g ( k ) , T g ( k ) , and D g ( k ) .
    (a)
    A fixed-point method is used for updating Δ n g :
    Δ n g ( k + 1 ) = t [ I r × r { diag ( κ n g ( k ) ) [ exp { ξ n g ( k ) + log C   + 1 2 diag ( Δ n g ( k ) ) diag ( κ n g ( k ) ) + Φ g 1 ( k ) tr T g ( k ) D g 1 ( k ) T g ( k ) κ n g ( k ) 1 ,
    where the vector function exp a = ( e a 1 , , e a r ) is a vector of the exponential of each element of the r-dimensional vector a , diag ( κ ) = ( κ 11 , κ t t ) puts the diagonal elements of the t × t matrix κ into a t-dimensional vector, and ⊙ is the Hadamard product.
    (b)
    A fixed-point method is used for updating κ n g :
    κ n g ( k + 1 ) = r I p × p diag ( Δ n g ( k + 1 ) ) exp { ξ n g ( k ) + log C   + 1 2 diag ( κ n g ( k ) ) diag ( Δ n g ( k + 1 ) ) + Ω g 1 ( k ) tr Φ g 1 ( k ) Δ n g ( k + 1 ) 1 ,
    where the vector function exp a = ( e a 1 , , e a t ) is a vector of the exponential of each element of the t-dimensional vector a , diag ( Δ ) = ( Δ 11 , Δ r r ) puts the diagonal elements of the r × r matrix Δ into an r-dimensional vector, and ⊙ is the Hadamard product.
    (c)
    Newton’s method is used to update ξ n g :
    vec ( ξ n g ( k + 1 ) ) = vec ( ξ n g ( k ) ) Ψ n g 1 ( k + 1 ) { vec ( Y n ) exp [ log vec ( C )   + vec ( ξ n g ( k ) ) + 1 2 diag ( Ψ n g 1 ( k + 1 ) ) ] Ψ n g 1 ( k + 1 ) vec ( ξ n g ( k ) ) vec ( M g ( k ) ) vec ( Y n ) ,
    where Ψ n g ( k + 1 ) = Δ n g ( k + 1 ) κ n g ( k + 1 ) .
  • Given Z ^ n g ( k + 1 ) and the variational parameters ξ n g ( k + 1 ) , Δ n g ( k + 1 ) , and κ n g ( k + 1 ) , the updates of model parameters π g , M g , and Φ g are obtained as
    π g ( k + 1 ) = n = 1 N Z ^ n g ( k + 1 ) N , M g ( k + 1 ) = n = 1 N Z ^ n g ( k + 1 ) ξ n g ( k + 1 ) n = 1 N Z ^ n g ( k + 1 ) , Φ g ( t + 1 ) = n = 1 N Z ^ n g ( k + 1 ) ( ξ n g ( k + 1 ) M g ( k + 1 ) ) T g D g 1 ( t ) T g ( ξ n g ( t + 1 ) M g ( t + 1 ) ) t n = 1 N Z ^ n g ( k + 1 )   + n = 1 N Z ^ n g ( k + 1 ) Δ n g ( k + 1 ) tr T g D g 1 ( t ) T g κ n g ( k + 1 ) t n = 1 N Z ^ n g ( k + 1 ) .
    Estimates of the model parameters T g and D g can be obtained by maximizing the approximation of the log-likelihood with respect to the parameters T g and D g . If we define S g ( k + 1 ) as:
    S g ( k + 1 ) = n = 1 N Z ^ n g ( k + 1 ) ( ξ n g ( k + 1 ) M g ( k + 1 ) ) Φ g 1 ( k + 1 ) ( ξ n g ( k + 1 ) M g ( k + 1 ) ) r n = 1 N Z ^ n g ( k + 1 )   + n = 1 N Z ^ n g ( k + 1 ) κ n g ( k + 1 ) tr Φ g 1 ( k + 1 ) Δ n g ( k + 1 ) r n = 1 N Z ^ n g ( k + 1 ) ,
    the estimates of T g and D g are analogous to [7]. The elements of T g can be estimated as:
    ρ ^ j 1 ( g ) ρ ^ j 2 ( g ) ρ ^ j , j 1 ( g ) = S 11 ( g ) S 21 ( g ) S j 1 , 1 ( g ) S 12 ( g ) S 22 ( g ) S j 1 , 2 ( g ) S j , j 1 ( g ) S 2 , j 1 ( g ) S j 1 , j 1 ( g ) 1 S j 1 ( g ) S j 2 ( g ) S j , j 1 ( g ) ,
    where j = 2 , , t , and the updates for D g can be obtained as
    D ^ g = T ^ g S g T ^ g .
    Similarly, with the above defined S g in Equation (5), the updates of T g and D g are analogous to [7] for all eight models with various constraints, and thus the R package longclust [35] is utilized for the covariance decomposition.
Convergence of the iterative EM-type approach was determined using the Aitken’s acceleration-based [36] criteria which computes the asymptotic estimate of the log-likelihood at each iteration and assumes that the algorithm converges when the successive difference in the asymptotic estimate of the log-likelihood is less than ε [37] . Here, we used ε = 0.05 .

2.3. Model Selection and Performance Assessment

In clustering, the true number of components are typically unknown. Additionally, the best constraint for the covariance structure here is also unknown. It is common practice to fit the model for a range of G for all possible models and select the best fitting model a posteriori using a model selection criteria. The Bayesian information criterion (BIC; [38]) remains the most widely used criterion. In our case, similar to [30], we use an approximation of BIC defined as:
BIC = 2 log L p log N ,
where p is the number of parameters, N is the sample size, and log L is the approximation of the maximized log-likelihood using ELBO. This approximation is computationally efficient as the marginal of the probability mass function does not have a closed form to compute the maximized log-likelihood using the marginal probability mass function. When the true number of clusters is known, assessment of the clustering performance can be conducted using an adjusted Rand index (ARI; [39]). The ARI provides a measure of the clustering agreement between the true and predicted labels and adjusts for agreement by chance. The ARI for perfect agreement is 1, and the expected value of ARI under random classification is 0.

3. Results

3.1. Simulation Studies

Simulation studies were conducted to demonstrate clustering performance of the proposed family of models and show parameter recovery.

3.1.1. Scenario 1

In the first scenario, 25 datasets, each of size n = 1000 were generated from a two-component fully unconstrained model “VVA”. Here, each observation in the dataset is a 3 × 4 (i.e., r = 3 and p = 4 ) matrix. The parameters used to generate the dataset are summarized in Table 2.
All eight models with G = 1 to G = 5 were fitted, and BIC was used for model selection. In all 25 datasets, the approach selected a two-component “VVA” model with an ARI of 1.00 ± 0.00 . The mean of the estimated parameters along with their standard errors are also summarized in Table 2. The estimated parameters are close to the true parameters.

3.1.2. Scenario 2

In the second scenario, 25 datasets, each of size n = 1000 , were generated from a two-component model with the same set of parameters as Scenario 1 but with a constraint on D such that D 1 = = D g = δ I (i.e., “VEI” model). Again, each observation in the dataset is a 3 × 4 (i.e., r = 3 and p = 4 ) matrix. All eight models with G = 1 to G = 5 were fitted, and the BIC was used for model selection. In all 25 datasets, the approach selected a two-component model with an average ARI of 1.00 ± 0.00 . In 11 out of the 25 datasets, a “VEI” model was selected, and in 14 out of the 25 datasets, a “VVI” model was selected. Note that a “VVI” model also assumes an isotropic constraint on D g such that D g = δ g I , but in a “VVI” model, the δ g varies across groups. The mean of the estimated parameters along with their standard errors from the datasets where a two-component “VEI” model was selected are summarized in Table 3. The estimated parameters are close to the true parameters.

3.1.3. Comparison with Other Approaches

The performances of the proposed models were compared to other mixtures of discrete distributions. Since other approaches for matrix variate discrete data were not available, the matrix variate data was first converted to multivariate data before comparison. Two model-based clustering techniques for RNA-seq data: HTSCluster [9,40,41], which is a mixture of Poisson distributions, and MBCluster.Seq [8,42], which is a mixture of negative binomial distributions, were used. The comparison of the performance of the proposed approach with the two competitive approaches for the simulated datasets from both scenarios is summarized in Table 4. Both HTSCluster and MBCluster.Seq failed to recover the underlying cluster structure in both scenarios. This could be partly because both approaches are mixtures of independent univariate distributions, and in the presence of covariance, their performance suffers. This is in line with findings of [30]. Through simulation studies, ref. [30] previously showed that when the dataset is generated from mixtures of independent Poisson distribution, HTSCluster can recover the underlying cluster structure. However, in the presence of over-dispersion (i.e., when the data are generated from a model such as multivariate Poisson log-normal distribution or negative binomial distribution), the performance of HTSCluster suffers.

3.2. Transcriptomics Data Analysis

The proposed approach was used to cluster a transcriptomics dataset fission from the R package fission available through bioconductor. The dataset was originally proposed by [43]. The study consists of a time course RNA-Seq experiment of fission yeast in response to oxidative stress (1M sorbitol treatment) at 0, 15, 30, 60, 120, and 180 mins from two types of yeast: wild type and mutant type (aft21Δ strain).Thus, the measurements for each observation can be written using a matrix notation such that the two types of yeast are treated as rows (i.e., r = 2 ), and the time points are treated as columns (i.e., t = 6 ). We treat developmental stages as longitudinal.
For cluster analysis, we focused on the subset of the differentially expressed genes provided in the Supplementary Material by [43]. Genes were considered differentially expressed if their mean expression level differed in at least one time point relative to unstressed reference, and multiple testing correction was performed to ensure overall FDR was kept below 5%. A total of 3169 genes (out of 5957) were differentially expressed in wild type yeast, and 3044 genes were differentially expressed in the aft21Δ strain. For our analysis, we included the gene if it was differentially expressed in both wild type and aft21Δ strain; thus a total of 2476 genes were included.
All eight models were fitted to the dataset for G = 1 to G = 20 , and the best fitting model was selected by BIC. A G = 17 component “EVA” model with a constrained T g (i.e., T 1 = T 2 = = T 17 ) and unconstrained anisotropic D g was selected. The constrained T g suggests that the correlation structure (i.e., autoregressive relationship) among the developmental stages is the same for all groups. However, the unconstrained anisotropic D g suggests that the variances at the developmental stages are different, and it varies from cluster to cluster. Visualization of the log-transformed expression values of the genes in each group along with its mean-expression trends is shown in Figure 1. As seen in Figure 1, the clusters have distinctive mean-expression trends.

4. Conclusions

A novel family of matrix variate Poisson log-normal mixture models was developed for clustering longitudinal transcriptomics data. This approach utilized a modified Cholesky decomposition of a component of the covariance matrices of the latent variable, and constraints were imposed on various components of this decomposition which resulted in a family of eight models. Performance of the proposed approach was illustrated using both simulated and real datasets where the proposed approach showed good clustering performance.
One of the limitations with the proposed approach is that it assumes that measurements are taken at the same fixed intervals for all observations, which can be restrictive. Some future work will focus on extending these models to allow for varying interval lengths between observations. Furthermore, time is continuous; hence, discretizing it can result in a loss of information. Some work will also focus on developing a modelling framework that models time as a continuous variable.

Author Contributions

S.S. developed the method, conducted the analysis, and wrote the manuscript. The author has read and agreed to the published version of the manuscript.

Funding

This work was supported by Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) and Canada Research Chairs program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in the manuscript is publicly available through R package fission.

Acknowledgments

This research was enabled in part by support provided by Research Computing Services (https://carleton.ca/rcs) at Carleton University.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ARIadjusted Rand index
BICBayesian information criterion
ELBOevidence lower bound
EMexpectation-maximization
KLKullback-Leibler
MCMC-EMMarkov chain Monte Carlo expectation-maximization
MPLNmultivariate Poisson lognormal
MVPLNmatrix variate Poisson lognormal

References

  1. Spellman, P.T.; Sherlock, G.; Zhang, M.Q.; Iyer, V.R.; Anders, K.; Eisen, M.B.; Brown, P.O.; Botstein, D.; Futcher, B. Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998, 9, 3273–3297. [Google Scholar] [CrossRef]
  2. Lee, C.W.; Stabile, E.; Kinnaird, T.; Shou, M.; Devaney, J.M.; Epstein, S.E.; Burnett, M.S. Temporal patterns of gene expression after acute hindlimb ischemia in mice: Insights into the genomic program for collateral vessel development. J. Am. Coll. Cardiol. 2004, 43, 474–482. [Google Scholar] [CrossRef]
  3. Louis, E.; Raue, U.; Yang, Y.; Jemiolo, B.; Trappe, S. Time course of proteolytic, cytokine, and myostatin gene expression after acute exercise in human skeletal muscle. J. Appl. Physiol. 2007, 103, 1744–1751. [Google Scholar] [CrossRef] [PubMed]
  4. Li, Y.; Li, M.; Tan, L.; Huang, S.; Zhao, L.; Tang, T.; Liu, J.; Zhao, Z. Analysis of time-course gene expression profiles of a periodontal ligament tissue model under compression. Arch. Oral Biol. 2013, 58, 511–522. [Google Scholar] [CrossRef] [PubMed]
  5. McLachlan, G.J.; Bean, R.W.; Peel, D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18, 413–422. [Google Scholar] [CrossRef]
  6. Inoue, L.Y.; Neira, M.; Nelson, C.; Gleave, M.; Etzioni, R. Cluster-based network model for time-course gene expression data. Biostatistics 2007, 8, 507–525. [Google Scholar] [CrossRef] [PubMed]
  7. McNicholas, P.D.; Murphy, T.B. Model-based clustering of longitudinal data. Can. J. Stat. 2010, 38, 153–168. [Google Scholar] [CrossRef]
  8. Si, Y.; Liu, P.; Li, P.; Brutnell, T.P. Model-based clustering for RNA-seq data. Bioinformatics 2014, 30, 197–205. [Google Scholar] [CrossRef]
  9. Rau, A.; Maugis-Rabusseau, C.; Martin-Magniette, M.L.; Celeux, G. Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics 2015, 31, 1420–1427. [Google Scholar] [CrossRef]
  10. Silva, A.; Rothstein, S.J.; McNicholas, P.D.; Subedi, S. A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinform. 2019, 20, 394. [Google Scholar] [CrossRef]
  11. Holmes, I.; Harris, K.; Quince, C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 2012, 7, e30126. [Google Scholar] [CrossRef] [PubMed]
  12. Subedi, S.; Neish, D.; Bak, S.; Feng, Z. Cluster analysis of microbiome data by using mixtures of Dirichlet–multinomial regression models. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2020, 69, 1163–1187. [Google Scholar] [CrossRef]
  13. Lo, K.; Brinkman, R.R.; Gottardo, R. Automated gating of flow cytometry data via robust model-based clustering. Cytom. Part A J. Int. Soc. Anal. Cytol. 2008, 73, 321–332. [Google Scholar] [CrossRef] [PubMed]
  14. Chan, C.; Feng, F.; Ottinger, J.; Foster, D.; West, M.; Kepler, T.B. Statistical mixture modeling for cell subtype identification in flow cytometry. Cytom. Part A J. Int. Soc. Anal. Cytol. 2008, 73, 693–701. [Google Scholar] [CrossRef] [PubMed]
  15. Shen, R.; Mo, Q.; Schultz, N.; Seshan, V.E.; Olshen, A.B.; Huse, J.; Ladanyi, M.; Sander, C. Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 2012, 7, e35236. [Google Scholar] [CrossRef]
  16. Higgins, J.P.; Shinghal, R.; Gill, H.; Reese, J.H.; Terris, M.; Cohen, R.J.; Fero, M.; Pollack, J.R.; Van de Rijn, M.; Brooks, J.D. Gene expression patterns in renal cell carcinoma assessed by complementary DNA microarray. Am. J. Pathol. 2003, 162, 925–932. [Google Scholar] [CrossRef]
  17. Ma, X.J.; Salunga, R.; Tuggle, J.T.; Gaudet, J.; Enright, E.; McQuary, P.; Payette, T.; Pistone, M.; Stecker, K.; Zhang, B.M.; et al. Gene expression profiles of human breast cancer progression. Proc. Natl. Acad. Sci. USA 2003, 100, 5974–5979. [Google Scholar] [CrossRef]
  18. Haqq, C.; Nosrati, M.; Sudilovsky, D.; Crothers, J.; Khodabakhsh, D.; Pulliam, B.L.; Federman, S.; Miller III, J.R.; Allen, R.E.; Singer, M.I.; et al. The gene expression signatures of melanoma progression. Proc. Natl. Acad. Sci. USA 2005, 102, 6092–6097. [Google Scholar] [CrossRef]
  19. Humbert, S.; Subedi, S.; Cohn, J.; Zeng, B.; Bi, Y.M.; Chen, X.; Zhu, T.; McNicholas, P.D.; Rothstein, S.J. Genome-wide expression profiling of maize in response to individual and combined water and nitrogen stresses. BMC Genom. 2013, 14, 3. [Google Scholar] [CrossRef]
  20. Misyura, M.; Guevara, D.; Subedi, S.; Hudson, D.; McNicholas, P.D.; Colasanti, J.; Rothstein, S.J. Nitrogen limitation and high density responses in rice suggest a role for ethylene under high density stress. BMC Genom. 2014, 15, 681. [Google Scholar] [CrossRef]
  21. Wolfe, J.H. Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 1970, 5, 329–350. [Google Scholar] [CrossRef]
  22. Luan, Y.; Li, H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics 2003, 19, 474–482. [Google Scholar] [CrossRef]
  23. McNicholas, P.D.; Subedi, S. Clustering gene expression time course data using mixtures of multivariate t-distributions. J. Stat. Plan. Inference 2012, 142, 1114–1127. [Google Scholar] [CrossRef]
  24. Coffey, N.; Hinde, J.; Holian, E. Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data. Comput. Stat. Data Anal. 2014, 71, 14–29. [Google Scholar] [CrossRef]
  25. Koestler, D.C.; Marsit, C.J.; Christensen, B.C.; Kelsey, K.T.; Houseman, E.A. A recursively partitioned mixture model for clustering time-course gene expression data. Transl. Cancer Res. 2014, 3, 217. [Google Scholar] [PubMed]
  26. Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef]
  27. Dong, K.; Zhao, H.; Tong, T.; Wan, X. NBLDA: Negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinform. 2016, 17, 369. [Google Scholar] [CrossRef]
  28. Doss, D. Definition and characterization of multivariate negative binomial distribution. J. Multivar. Anal. 1979, 9, 460–464. [Google Scholar] [CrossRef]
  29. Brijs, T.; Karlis, D.; Swinnen, G.; Vanhoof, K.; Wets, G.; Manchanda, P. A multivariate Poisson mixture model for marketing applications. Stat. Neerl. 2004, 58, 322–348. [Google Scholar] [CrossRef]
  30. Subedi, S.; Browne, R.P. A family of parsimonious mixtures of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat 2020, 9, e310. [Google Scholar] [CrossRef]
  31. Silva, A.; Rothstein, S.J.; McNicholas, P.D.; Subedi, S. Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data. arXiv 2018, arXiv:1807.08380. [Google Scholar] [CrossRef]
  32. Pourahmadi, M. Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation. Biometrika 1999, 86, 677–690. [Google Scholar] [CrossRef]
  33. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar]
  34. Wainwright, M.J.; Jordan, M.I. Graphical models, exponential families, and variational inference. In Foundations and Trends in Machine Learning; The Essence of Knowledge: Delft, The Netherlands, 2008; Volume 1, pp. 1–305. [Google Scholar]
  35. McNicholas, P.D.; Jampani, K.R.; Subedi, S. Longclust: Model-Based Clustering and Classification for Longitudinal Data; R Package Version 1.2.3; R Package: Vienna, Austria, 2019. [Google Scholar]
  36. Aitken, A.C. A series formula for the roots of algebraic and transcendental equations. Proc. R. Soc. Edinb. 1926, 45, 14–22. [Google Scholar] [CrossRef]
  37. Böhning, D.; Dietz, E.; Schaub, R.; Schlattmann, P.; Lindsay, B. The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann. Inst. Stat. Math. 1994, 46, 373–388. [Google Scholar] [CrossRef]
  38. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  39. Hubert, L.; Arabie, P. Comparing Partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
  40. Rau, A.; Celeux, G.; Martin-Magniette, M.; Maugis-Rabusseau, C. Clustering High-Throughput Sequencing Data with Poisson Mixture Models; Technical Report RR-7786; INRIA: Saclay, France, 2011. [Google Scholar]
  41. Rau, A.; Celeux, G.; Martin-Magniette, M.L.; Maugis-Rabusseau, C. HTSCluster: Clustering High-Throughput Transcriptome Sequencing (HTS) Data; R Package Version 2.0.4; R Package: Vienna, Austria, 2016. [Google Scholar]
  42. Si, Y. MBCluster.Seq: Model-Based Clustering for RNA-Seq Data; R Package Version 1.0; R Package: Vienna, Austria, 2012. [Google Scholar]
  43. Leong, H.S.; Dawson, K.; Wirth, C.; Li, Y.; Connolly, Y.; Smith, D.L.; Wilkinson, C.R.; Miller, C.J. A global non-coding RNA system modulates fission yeast protein levels in response to stress. Nat. Commun. 2014, 5, 3947. [Google Scholar] [CrossRef]
Figure 1. Visualization of the log-transformed expression values along with its mean-expression values for the two yeast types (solid black line for the mutant and dashed black line for wild type) for all 17 clusters of the transcriptomics dataset.
Figure 1. Visualization of the log-transformed expression values along with its mean-expression values for the two yeast types (solid black line for the mutant and dashed black line for wild type) for all 17 clusters of the transcriptomics dataset.
Analytics 02 00024 g001
Table 1. The family of eight models obtained by imposing various constraints on the components of Ω g .
Table 1. The family of eight models obtained by imposing various constraints on the components of Ω g .
Model T g D g Total Parameters in Ω 1 , , Ω G .
GroupGroupDiagonal
“VVA”VariableVariableAnisotropic G t ( t 1 ) / 2 + G t
“EVA”EqualVariableAnisotropic t ( t 1 ) / 2 + G t
“VEA”VariableEqualAnisotropic G t ( t 1 ) / 2 + t
“EEA”EqualEqualAnisotropic t ( t 1 ) / 2 + t
“VVI”VariableVariableIsotropic G t ( t 1 ) / 2 + G
“EVI”EqualVariableIsotropic t ( t 1 ) / 2 + G
“VEI”VariableEqualIsotropic G t ( t 1 ) / 2 + 1
“EEI”EqualEqualIsotropic t ( t 1 ) / 2 + 1
Table 2. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 1 from all 25 datasets.
Table 2. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 1 from all 25 datasets.
ϑ g True ValueMeansStandard Deviations
M 1 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.19 6.21 6.20 6.19 6.21 6.20 6.19 6.20 6.19 6.21 6.20 6.21 0.05 0.05 0.07 0.07 0.06 0.07 0.05 0.08 0.05 0.07 0.06 0.05
M 2 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.82 1.68 1.63 1.63 1.62 1.78 1.66 1.64 1.65 1.62 1.78 1.67 0.13 0.08 0.08 0.07 0.08 0.13 0.07 0.08 0.09 0.05 0.12 0.09
Φ 1 1.00 0.00 0.00 0.00 1.46 0.00 0.00 0.00 1.44 1.00 0.01 0.01 0.01 1.46 0.00 0.01 0.00 1.46 0.00 0.03 0.02 0.03 0.05 0.03 0.02 0.03 0.05
Φ 2 1.00 0.00 0.00 0.00 0.82 0.00 0.00 0.00 0.90 1.00 0.00 0.00 0.00 0.82 0.01 0.00 0.01 0.89 0.00 0.03 0.04 0.03 0.06 0.03 0.04 0.03 0.06
T 1 1.00 0.75 1.00 0.25 0.75 1.00 0.05 0.25 0.75 1.00 1.00 0.75 1.00 0.25 0.75 1.00 0.05 0.25 0.76 1.00 0.00 0.03 0.00 0.02 0.03 0.00 0.02 0.02 0.02 0.00
T 2 1.00 0.20 1.00 0.05 0.30 1.00 0.01 0.02 0.35 1.00 1.00 0.24 1.00 0.08 0.39 1.00 0.03 0.08 0.44 1.00 0.00 0.02 0.00 0.02 0.05 0.00 0.02 0.04 0.06 0.00
D 1 0.53 0.74 1.15 1.82 0.53 0.75 1.15 1.85 0.02 0.04 0.06 0.08
D 2 0.10 0.45 0.47 0.33 0.14 0.62 0.67 0.48 0.01 0.06 0.06 0.03
Table 3. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 2 from all 25 datasets.
Table 3. True parameters along with the estimated means and standard deviations of the model parameters for Scenario 2 from all 25 datasets.
ϑ g True ValueMeansStandard Deviations
M 1 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.20 6.22 6.23 6.17 6.19 6.23 6.15 6.18 6.22 6.20 6.21 6.23 0.05 0.07 0.11 0.09 0.05 0.07 0.11 0.14 0.04 0.08 0.07 0.11
M 2 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.60 1.63 1.62 1.59 1.62 1.59 1.60 1.61 1.61 1.61 1.60 1.62 0.08 0.07 0.08 0.07 0.09 0.07 0.06 0.08 0.09 0.05 0.04 0.10
Φ 1 1.00 0.00 0.00 0.00 1.46 0.00 0.00 0.00 1.44 1.00 0.01 0.00 0.01 1.47 0.01 0.00 0.01 1.49 0.00 0.02 0.02 0.02 0.06 0.03 0.02 0.03 0.07
Φ 2 1.00 0.00 0.00 0.00 0.82 0.00 0.00 0.00 0.90 1.00 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.84 0.00 0.02 0.05 0.02 0.03 0.02 0.05 0.02 0.03
T 1 1.00 0.75 1.00 0.25 0.75 1.00 0.05 0.25 0.75 1.00 1.00 0.76 1.00 0.25 0.74 1.00 0.05 0.25 0.76 1.00 0.00 0.02 0.00 0.03 0.03 0.00 0.04 0.03 0.03 0.00
T 2 1.00 0.20 1.00 0.05 0.30 1.00 0.01 0.02 0.35 1.00 1.00 0.24 1.00 0.07 0.37 1.00 0.04 0.07 0.41 1.00 0.00 0.04 0.00 0.05 0.04 0.00 0.03 0.05 0.05 0.00
D 1 = D 2 0.45 0.45 0.45 0.45 0.50 0.50 0.50 0.50 0.02 0.02 0.02 0.02
Table 4. Comparison of the performance of the proposed approach (longitudinal MVPLN) with HTSCluster and MBCluster.Seq on simulated datasets from both Scenario 1 and Scenario 2.
Table 4. Comparison of the performance of the proposed approach (longitudinal MVPLN) with HTSCluster and MBCluster.Seq on simulated datasets from both Scenario 1 and Scenario 2.
Simulation Scenario 1
G Selected Average ARI Time in Minutes
Approach (# of Times) (SD) Average (SD)
Long. MVPLN2 (25)1.000 (0.000)76.590 (20.970)
HTSCluster5 (25)0.002 (0.005)0.057 (0.010)
MBCluster.Seq4 (1), 5 (24)0.000 (0.002)0.237 (0.004)
Simulation Scenario 2
G Selected Average ARI Time in Minutes
Approach (# of Times) (SD) Average (SD)
Long. MVPLN2 (25)1.00 (0.000)78.928 (7.886)
HTSCluster5 (25)−0.011 (0.012)0.047 (0.007)
MBCluster.Seq5 (25)−0.000 (0.004)0.237(0.003)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Subedi, S. Clustering Matrix Variate Longitudinal Count Data. Analytics 2023, 2, 426-437. https://doi.org/10.3390/analytics2020024

AMA Style

Subedi S. Clustering Matrix Variate Longitudinal Count Data. Analytics. 2023; 2(2):426-437. https://doi.org/10.3390/analytics2020024

Chicago/Turabian Style

Subedi, Sanjeena. 2023. "Clustering Matrix Variate Longitudinal Count Data" Analytics 2, no. 2: 426-437. https://doi.org/10.3390/analytics2020024

Article Metrics

Back to TopTop