Next Article in Journal
Cryptanalysis of a Semi-Quantum Bi-Signature Scheme Based on W States
Next Article in Special Issue
How to Evaluate Theory-Based Hypotheses in Meta-Analysis Using an AIC-Type Criterion
Previous Article in Journal
Quantum Energy Current Induced Coherence in a Spin Chain under Non-Markovian Environments
Previous Article in Special Issue
Bibliometric Analysis of Information Theoretic Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mixture Complexity and Its Application to Gradual Clustering Change Detection

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(10), 1407; https://doi.org/10.3390/e24101407
Submission received: 18 August 2022 / Revised: 27 September 2022 / Accepted: 28 September 2022 / Published: 1 October 2022

Abstract

:
We consider measuring the number of clusters (cluster size) in the finite mixture models for interpreting their structures. Many existing information criteria have been applied for this issue by regarding it as the same as the number of mixture components (mixture size); however, this may not be valid in the presence of overlaps or weight biases. In this study, we argue that the cluster size should be measured as a continuous value and propose a new criterion called mixture complexity (MC) to formulate it. It is formally defined from the viewpoint of information theory and can be seen as a natural extension of the cluster size considering overlap and weight bias. Subsequently, we apply MC to the issue of gradual clustering change detection. Conventionally, clustering changes have been regarded as abrupt, induced by the changes in the mixture size or cluster size. Meanwhile, we consider the clustering changes to be gradual in terms of MC; it has the benefits of finding the changes earlier and discerning the significant and insignificant changes. We further demonstrate that the MC can be decomposed according to the hierarchical structures of the mixture models; it helps us to analyze the detail of substructures.

1. Introduction

1.1. Motivation

Finite mixture models are widely used for model-based clustering (for overviews and references see McLachlan and Peel [1] and Fraley and Raftery [2]). In this field, determining the number of components is a typical issue. It refers to the following two aspects: the number of elements used to represent the density distribution and the number of clusters used to group the data (referred to as mixture size and cluster size, respectively). In this study, we consider the problem of interpreting the cluster size when the mixture size is given. Many existing information criteria have been applied for this issue by regarding it as the same as mixture size; however, it may not be valid when the components have overlaps or weight biases. Therefore, we need to reconsider the definitions and meanings of the cluster size.
For instance, let us observe three cases of the Gaussian mixture model, as shown in Figure 1. Although the mixture size is two in any case, the situations are different. In case (a), the two components are distinct from each other and their weights are not biased; therefore, it is sound to believe that the cluster size is two as well. Meanwhile, in case (b), although their weights are not biased, the two components are very close to each other; then, as proposed in the work of Hennig [3], we may need to regard them as one cluster by merging them. In case (c), although the two components are distinct from each other, their weights are biased; as proposed in Jiang et al. [4] and He et al. [5], we may need to regard the small components as outliers rather than a cluster. Overall, in cases (b) and (c), it may be more difficult to say that the cluster size is exactly two than in case (a). This observation gives rise to the problem of formally defining the complexity of clustering structures that reflects the overlaps and weight biases.
This paper introduces a novel concept of mixture complexity (MC) to resolve this problem. It is related to the logarithm of the cluster size. For example, the exponentials of the MC are 2.00, 1.39, and 1.21 for cases (a), (b), and (c), respectively. In other words, given the mixture size, MC estimates the cluster size continuously rather than discretely.
There are two reasons for the need of MC. First, it theoretically evaluates the cluster size in the finite mixture model considering the overlap and imbalance between the components. Although their impacts on the cluster size have been discussed independently, we present a unified framework to interpret the cluster size with a continuous index. It presents a new perspective on model-based clustering and can be practically applied to cluster merging or clustering-based outlier detection. The second is the application of MC to the issue of gradual clustering change detection. Conventionally, clustering changes have been considered to be abrupt, induced by changes in the mixture size or cluster size. In reality, however, there are cases where mechanisms for generating data change gradually (or incrementally in the context of concept drifts [6]). We thereby present a new methodology for tracking such changes by observing MC’s changes.
We further show that MC can be used to quantify the cluster size in hierarchical mixture models. We demonstrate that the MC of a hierarchical mixture model can be decomposed into the sum of MCs for local mixture models. It enables us to evaluate the complexity of the substructures as well as the entire structure.
The concept of MC has been applied to the clustering merging problem in [7]. This study further investigates the theoretical properties of MC and proposes a new application for the issue of gradual clustering change detection.

1.2. Significance and Novelty

The significance and novelty of this paper are summarized below.

1.2.1. Mixture Complexity for Finite Mixture Models

We introduce a novel concept of MC to continuously measure the cluster size in a mixture model. It is formally defined from the viewpoint of information theory and can be interpreted as a natural extension of the cluster size considering the overlaps and weight biases among the components. We further demonstrate that MC can be decomposed into a sum of MCs according to the mixture hierarchies; it helps us in analyzing MC in a decomposed manner.

1.2.2. Applications of MC to Gradual Clustering Change Detection

We apply MC to the issue of monitoring gradual changes in clustering structures. We propose methods to monitor changes in MC instead of the mixture size or cluster size. Because MC takes a real value, it is more suitable for observing gradual changes. We empirically demonstrate that MC elucidates the clustering structures and their changes more effectively than the mixture size or cluster size.
The remainder of this paper is organized as follows. Section 2 discusses related work. In Section 3, we introduce the concept of MC and present some examples. Theoretical properties of MC are shown in Section 4. Section 5 discusses the application of MC to clustering change detection problems and Section 6 describes the experimental results. Finally, Section 7 concludes this paper. Proofs of the propositions and theorems are described in Appendices. Programs for the experiments are available at https://github.com/ShunkiKyoya/MixtureComplexity, accessed on 17 August 2022.

2. Related Work

The issue of determining the best mixture size or cluster size (often referred to as model selection) has extensively been studied. For example, AIC [8], BIC [9], and MDL [10] have been used to select the mixture size; ICL [11] and MDL-based clustering criteria [12,13] have been invented to select the cluster size. These methods have conventionally considered the cluster size as the same as the mixture size by regarding one mixture component as one independent cluster. See also a recent review by McLachlan and Rathnayake [14] focusing on the number of components in a Gaussian mixture model.
Differences between the mixture size and cluster size have also been widely discussed. For example, McLachlan and Peel [1] pointed out that there were cases that Gaussian mixture models with more than one mixture sizes were needed to describe one skewed cluster; Biernacki et al. [11] argued that in many situations, the mixture size estimated by BIC was too large to regard it as the cluster size. The problem of estimating the cluster size under a given mixture size has also been investigated by Hennig [3]; he proposed methods to identify the cluster structure by merging heavily overlapped mixture components. MC differs from his approach in that it interprets the clustering structure by only measuring the overlap rate rather than deciding whether to merge based on a certain threshold.
The degree of overlap or closeness between components was evaluated using various measures, such as the classification error rate or the Bhattacharyya distance [15]. Wang and Sun [16] and Sun and Wang [17] formulated the overlap rate of Gaussian distributions from the geometric nature of them. All of the works above have been limited to the case of two components. On the other hand, MC considers the overlap between any number of components.
Deciding whether a small component is a cluster or a set of outliers is also a significant matter. For example, clustering algorithms such as DBSCAN [18] and constrained k-means [19] avoided generating small components to obtain a better clustering structure. Jiang et al. [4] and He et al. [5] associated the small components with outlier detection problems. MC evaluates the small components by continuously measuring the impacts on the cluster size.
Some other notions have been proposed to quantify the clustering structure. Fuzzy clustering [20] is also a method used to estimate the clustering structures with cluster overlap; however, MC is more suitable for consistent estimation in that it assumes the background mixture distributions. Rusch et al. [21] evaluated the crowdedness of the data under the concept of “clusteredness”. However, its relations to the cluster size are indirect. Recently, descriptive dimensionality (Ddim) [22] was proposed to define the model dimensionality continuously. It can be implemented to estimate the clustering structure under the assumption of model fusion, that is, models with a different number of components are probabilistically mixed. MC differs from Ddim because it evaluates the overlap and weight bias in the single model without model fusion.
Clustering under the data stream has been discussed with various objectives [23,24,25]. We consider the problem of detecting changes in the cluster structure; Dynamic model selection (DMS) [26,27,28] addressed this problem by observing the changes in the models (corresponding to mixture size or cluster size in this paper). Because the models are valued discretely, the detected changes have been considered to be abrupt. Refer also to the notions of tracking best experts [29], evolution graph [30], and switching distributions [31], which are similar to DMS.
Furthermore, the issues of gradual changes have been discussed to investigate the transition periods for absolute changes. The MDL change statistics [32] and differential MDL change statistics [33] were proposed to measure the degree of gradual changes. The notions of structural entropy [34] and graph entropy [35] were proposed to measure the degree of model uncertainty in the changes. This study quantifies the degree of gradual changes using the fluctuations in MC and presents a new methodology to detect them.
MC is based on the mutual information between the observed and latent variables, which has been considered in the clustering fields. For example, Still et al. [36] regarded clustering as data compression and applied mutual information to measure its degree. In this paper, we present a novel interpretation of mutual information as a continuous number of clusters. Furthermore, we also present its novel applications for interpreting clusterings and clustering change detection.

3. Mixture Complexity

In this section, we formally introduce the mixture complexity and describe its properties using some examples and theories.

3.1. Definitions

Given the data { x n } n = 1 N and the finite mixture model f that have generated them, we consider interpreting the cluster size of f. The distribution f is written as
f ( x ) : = k = 1 K ρ k g k ( x ) ,
where K denotes the mixture size, { ρ k } k = 1 K denote the proportions of each component summing up to one, and { g k } k = 1 K denotes the probability distributions. The random variable X following the distribution f is called an observed variable because it can be observed as a datum. We also define the latent variable  Z { 1 , , K } as the index of the component from which the observed variable X originated. The pair ( X , Z ) is called a complete variable. The distribution of the latent variable P ( Z ) and the conditional distribution of the observed variable P ( X | Z ) can be given by
P ( Z = k ) = ρ k , P ( X | Z = k ) = g k ( X ) .
To investigate the clustering structures in f, we consider the following quantity:
I ( Z ; X ) : = H ( Z ) H ( Z | X ) ,
where H ( Z ) and H ( Z | X ) denote the entropy and conditional entropy, respectively, of the latent variable Z defined as
H ( Z ) : = k = 1 K P ( Z = k ) log P ( Z = k ) = k = 1 K ρ k log ρ k , H ( Z | X ) : = E X k = 1 K P ( Z = k | X ) log P ( Z = k | X ) = E X k = 1 K γ k ( X ) log γ k ( X ) .
where
γ k ( X ) : = P ( Z = k | X ) .
The quantity I ( Z ; X ) is well-known as the mutual information between the observed and latent variables; it is also known as the (generalized) Jensen–Shannon Divergence [37]. We can interpret I ( Z ; X ) as the volume of cluster structures as follows. Because I ( Z ; X ) is a subtraction of the latent variable’s entropy with and without the knowledge of the observed variable, it represents the amount of information about the latent variable possessed by the observed data. Thus, its exponent exp ( I ( X ; Z ) ) denotes the number of the latent variables distinguished by the observed variable; it can be interpreted as a continuous extension of the cluster size. For more information about entropy and mutual information, see the book written by Cover and Thomas [38].
However, I ( Z ; X ) cannot be calculated analytically even if f is known. Thus, noting that ρ k = E X [ γ k ( X ) ] , we approximate I ( Z ; X ) using the data { x n } n = 1 N as follows:
I ( Z ; X ) H ˜ ( Z ) H ˜ ( Z | X ) ,
where
H ˜ ( Z ) : = k = 1 K ρ ˜ k log ρ ˜ k , H ˜ ( Z | X ) : = 1 N n = 1 N k = 1 K γ k x n log γ k x n , ρ ˜ k : = 1 N n = 1 N γ k x n .
We call this the MC of the mixture model f.
Definition 1.
Given the posterior probabilities { γ k ( x n ) } k , n , we define the mixture complexity (MC) as
MC γ k x n k , n : = k = 1 K ρ ˜ k log ρ ˜ k + 1 N n = 1 N k = 1 K γ k x n log γ k x n ,
where
ρ ˜ k : = 1 N n = 1 N γ k x n .
If the data have weights { w n } n , we define the MC as
MC γ k x n k , n ; w n n : = k = 1 K ρ ˜ k log ρ ˜ k + 1 n w n n = 1 N w n k = 1 K γ k x n log γ k x n ,
where
ρ ˜ k : = 1 n w n n = 1 N w n γ k x n .
The weighted version of MC is defined for later use.
Note that there are other ways to approximate I ( Z ; X ) ; we adopt the form of Definition 1 because it has the decomposition property shown in Section 4.2. See also the methods used to approximate the entropy of the mixture model [39,40] that can also be applied to approximate I ( Z ; X ) .
In practice, only the data { x n } n = 1 N can be obtained without the underlying distribution f. Then, we estimate the posterior probabilities { γ ^ k ( x n ) } k , n from the data { x n } n = 1 N and further estimate the MC as
MC γ k x n k , n MC γ ^ k x n k , n .
It can be calculated even if the model f cannot be estimated.

3.2. Examples

In this subsection, we discuss some examples of MC to understand its notions.

3.2.1. MC with Different Overlaps

First, we set N = 600 and generated the data x 1 , , x 600 R 2 as follows.
x n N x n | μ = [ 0 , 0 ] , Σ = I 2 ( 1 n 300 ) , N x n | μ = [ α , 0 ] , Σ = I 2 ( 301 n 600 ) ,
where N ( x | μ , Σ ) denotes a multivariate normal distribution with mean μ and covariance Σ , I d denotes a d-dimensional identity matrix, and α R is the parameter that determines the degree of overlap between two components.
By varying the value of α among 0, 0.6, …, 6.0, we generated the data and measured the MC by setting ρ 1 , ρ 2 = 1 / 2 and g 1 , g 2 as the actual distributions. The exponential of the MC for each α is plotted in Figure 2a. It is evident from the figure that the MC smoothly increases from 1.0 to 2.0 as the two components become isolated.

3.2.2. MC with Different Mixture Biases

Next, we set N = 600 and generated the data x 1 , , x 600 R 2 as follows:
x n N x n | μ = [ 0 , 0 ] , Σ = I 2 ( 1 n 300 + α ) , N x n | μ = [ 6 , 0 ] , Σ = I 2 ( 301 + α n 600 ) ,
where α { 0 , , 300 } is the parameter that determines the degree of bias between the proportion of two components.
By varying α among 0, 30, …, 300, we generated the data and measured the MC by setting ρ 1 = ( 300 + α ) / 600 , ρ 2 = ( 300 α ) / 600 and g 1 , g 2 as the actual distributions. The exponential of the MC for each α is plotted in Figure 2b. It is evident from the figure that the MC smoothly decreases from 2.0 to 1.0 as the balance becomes biased.

4. Theoretical Properties

In this subsection, we discuss the theoretical properties of MC.

4.1. Basic Properties

We discuss the basic properties of MC. The proofs are described in Appendix A.
First, we discuss the minimum and maximum of MC. We show that MC takes the minimum when the components entirely overlap and maximum when they are entirely separate.
Proposition 1.
If the components entirely overlap, i.e., there exists γ 1 , , γ K such that γ k ( x n ) = γ k for all k and n, then,
MC γ k x n k , n ; w n n = 0 .
Proposition 2.
If the components are entirely separate, i.e., for all x n , there is a unique index k n that satisfies
γ l x n = 1 l = k n 0 l k n ,
then,
MC γ k x n k , n ; w n n = H ˜ ( Z ) .
In particular, if the components are entirely balanced, i.e., ρ ˜ 1 = = ρ ˜ K = 1 / K , then,
MC γ k x n k , n ; w n n = log K .
Proposition 3.
For all { γ k ( x n ) } k , n , MC satisfies
0 MC γ k x n k , n ; w n n log K .
Moreover, MC takes 0 only if the components are entirely overlapping as stated in Proposition 1 and takes log K only if the components are entirely separate as stated in Proposition 2.
Next, we show that the value of MC is invariant with the representation of the mixture distribution. For example, consider the following three mixture distributions:
f 1 ( x ) = 1 2 g 1 ( x ) + 1 2 g 2 ( x ) , f 2 ( x ) = 1 2 g 1 ( x ) + 1 4 g 2 ( x ) + 1 4 g 2 ( x ) , f 3 ( x ) = 1 2 g 1 ( x ) + 1 4 g 2 ( x ) + 1 4 g 2 ( x ) + 0 · g 3 ( x ) .
In f 2 and f 3 , we need to manually remove the redundant components and regard the mixture size as two [1]. On the other hand, the following property indicates that the MCs for f 1 , f 2 , and f 3 are the same; thus, we need not to care about their differences in evaluating MC.
Proposition 4.
If there exists a set I 1 , , I L , I that partitions { 1 , , K } and distributions g 1 0 , , g L 0 such that
k I l g k = g l 0 ( l = 1 , , L ) , k I ρ k = 0 ,
then
MC γ k x n k , n ; w n n = MC γ l 0 x n l , n ; w n n ,
where
γ l 0 ( x ) = k I l ρ k g l 0 ( x ) f 0 ( x ) , f 0 ( x ) = l = 1 L k I l ρ k g l 0 ( x ) .

4.2. Decomposition Property

In this section, we discuss a method to decompose MC along the hierarchies in mixture models; this can help us in analyzing the structures in more detail.
Consider that the mixture distribution f has a two-stage hierarchy, as shown in Figure 3. It has K components { g k } k = 1 K on the lower side and L components { h l } l = 1 L on the upper side, where { g k } k = 1 K denote the probability distributions and { h l } l = 1 L denote their mixture distributions, respectively. We construct the hierarchy as follows. First, we estimate the distribution f = k = 1 K ρ k g k . Then, we obtain { h l } l = 1 L by partitioning (or clustering) the lower components into L groups. Formally, we denote Q k ( l ) R 0 as the proportion of the lower component k { 1 , , K } that belongs to the upper component l, which satisfies l = 1 L Q k ( l ) = 1 for all k. Then, we derive { h l } l = 1 L by rewriting f = k = 1 K ρ k g k as
f ( x ) = k = 1 K ρ k g k ( x ) = k = 1 K l = 1 L Q k ( l ) ρ k g k ( x ) = l = 1 L τ l h l ( x ) ,
where
τ l : = k = 1 K Q k ( l ) ρ k , h l ( x ) : = k = 1 K ρ k ( l ) g k ( x ) , ρ k ( l ) : = Q k ( l ) ρ k k Q k ( l ) ρ k .
According to the hierarchy, we can decompose the MC.
Theorem 1.
We can decompose the MC as follows:
MC γ k x n k , n ; w n n = MC k = 1 K Q k ( l ) γ k x n l , n ; w n n + l = 1 L W l · MC γ k ( l ) x n k , n ; w n ( l ) n ,
where
W l = n w n k Q k ( l ) γ k x n n w n = k = 1 K Q k ( l ) ρ ˜ k , w n ( l ) = w n k = 1 K Q k ( l ) γ k x n , γ k ( l ) x n = Q k ( l ) γ k x n k Q k ( l ) γ k x n .
The proof is described in the Appendix B. For notational simplicity, we will use the following terms:
MC ( total ) : = MC γ k x n k , n ; w n , MC ( interaction ) : = MC k Q k ( l ) γ k x n l , n ; w n n , Contribution ( component l ) : = W l · MC γ k ( l ) x n k , n ; w n ( l ) n , W ( component l ) : = W l , MC ( component l ) : = MC γ k ( l ) x n k , n ; w n ( l ) n .
Then, we can rewrite Theorem 1 as
MC ( total ) = MC ( interaction ) + l = 1 L Contribution ( component l ) , Contribution ( component l ) = W ( component l ) · MC ( component l ) .
In Theorem 1, the MC of the entire structure (MC(total)) is decomposed into a sum of the MC among the upper components (MC(interaction)) and their respective contributions (Contribution(component l)). Contribution(component l) is further decomposed into a product of the weight (W(component l)) and complexity (MC(component l)) of the component. Because w n ( l ) denotes the weight of x n that belongs to component l, its sum W(component l) represents the total weights of the data contained in it. Additionally, MC(component l) denotes the clustering structures in component l considering the data weights.
An example of the decomposition is illustrated in Figure 4 and Table 1. In this example, there are K = 4 lower components generated from a Gaussian mixture model; additionally, there are L = 2 upper components on the left and right sides. By decomposing MC(total), we can evaluate the complexities in the local structures as well as those in the entire structure.

4.3. Consistency

In this subsection, we discuss the consistency of the MC: as the estimated distribution becomes close to the true distribution, the estimated MC also converges to the true value. Formally, we define the set of K-component mixture models as
F K : = f ( x ) = k = 1 K ρ k g x | θ k | ρ 1 , , ρ K 0 , k = 1 K ρ k = 1 , θ 1 , , θ K Θ .
We assume that the space F K is weakly identifiable, that is,
k = 1 K ρ k 0 g · | θ k 0 = k = 1 K ρ k 1 g · | θ k 1 k = 1 K ρ k 0 δ Θ ( · = θ k 0 ) = k = 1 K ρ k 1 δ Θ ( · = θ k 1 ) ,
where δ Θ is the Kronecker’s delta function on Θ . This condition states that the same distributions should have the same mixtures of parameters. See Teicher [41] and Yakiwitz and Spragins [42] for sufficient conditions on this kind of identifiability; in their work, it has been shown that this is satisfied in Gaussian or gamma mixtures.
We also assume some true mixture distribution written as
f ( x ) = k = 1 K ρ k g x | θ k , ρ 1 , , ρ K > 0 , θ 1 , , θ K Θ , θ i θ j ( i j )
generates the data x N . We consider estimating the true mixture complexity written as MC ( { γ k ( x n ) } k , n ) by substituting the estimated distribution f F K into f . We restrict our analysis to the case that K K so that F K contains distributions that are equivalent to f . Then, we show that MC ( { γ k ( x n ) } k , n ) converges to MC ( { γ k ( x n ) } k , n ) as f and f become closer.
To analyze the convergence, we re-parametrize the estimated parameters using the method proposed in Liu and Shao [43]. They note that if f = f , there exist integers 0 = i 0 < < i K K such that the following holds under some permutation of the components:
θ l = θ k l I k , k = 1 , , K , ρ k = 0 k I ,
where
I k : = i k 1 + 1 , , i k k = 1 , , K , I : = i K + 1 , , K .
Then, they parametrize the parameters in f using two kinds of parameters defined as
ϕ : = θ k k = 1 i K , r l l = 1 K , ρ k k = i K + 1 K , r l : = k I l ρ k , ψ : = s k k = 1 i K , θ k k = i K + 1 K , s k : = ρ k r k k I l
and rewrite f as
f ( x ) = k = 1 K ρ k g x | θ k = l = 1 K r l h l ( x ) + k = i K + 1 K ρ k g k ( x ) , h l ( x ) = k I l s k g k ( x ) .
In this parametrization, f = f is equivalent to
ϕ = ϕ : = θ 1 , , θ 1 , , θ K , , θ K , ρ 1 , , ρ K , 0 , , 0 ;
the parameter ψ has nothing to do with equivalence. This parametrization represents two types of convergence in mixture models. First, it overlaps the components to the true distributions, which is realized by
θ k k = 1 i K θ 1 , , θ 1 , , θ K , , θ K , r l l = 1 K ρ 1 , , ρ K .
The other is shrinking the weights of the redundant components to zero, which is realized by
ρ k k = i K + 1 K 0 , , 0 .
We use the following conditions for our proof:
(C1)
g ( · | θ ) is differentiable once and for every k = 1 , , K and there exists ϵ > 0 such that
E θ k sup θ : θ θ k ϵ g ( · | θ ) g ( · | θ ) < .
(C2)
As N , the estimated parameter ϕ satisfies
ϕ ϕ = o P ( 1 ) .
(C3)
Let us define the approximations of mixture proportions as
r ˜ l : = 1 N n = 1 N k I l γ k x n l = 1 , , K , ρ ˜ : = 1 N n = 1 N k = i K + 1 K γ k x n .
Then, as N , they satisfy
r ˜ l ρ l = o P ( 1 ) l = 1 , , K , ρ ˜ = o P ( 1 ) .
Condition (C1) is a usual differentiability condition, and (C2) and (C3) require consistency of the parameters. It is known that consistent estimations are possible by penalized maximum likelihood estimation [44,45] or Bayesian estimation [46], for example. Then, the consistency of the MC is shown as the following theorem.
Theorem 2.
Under assumptions (C1), (C2), and (C3), the following holds as N :
MC γ k ( x n ) k , n MC γ k ( x n ) k , n = O P ϕ ϕ + l = 1 K r ˜ l ρ l + ρ ˜ log ρ ˜ + 1 N .
The proof is described in Appendix C. Theorem 2 shows the convergence rate of the estimation error of the MC. It is interesting that this even holds when K K . Therefore, it can be said that MC is a fundamental quantity to represent the cluster structures in mixture models by overcoming the differences in mixture size.
We discuss the overview of the proofs below. First, applying Theorem 1 repeatedly, we decompose the entire MC into the following four terms:
(a)
Interaction between l = 1 K r l h l and k = i K + 1 K ρ l g l .
(b)
Contribution from k = i K + 1 K ρ l g l .
(c)
Interaction among h 1 , , h K
(d)
Contributions from h 1 , , h K , respectively.
The procedure of the decomposition is also illustrated in Figure 5. Then, we show that
(a)
tends to 0 because ρ ˜ 0 ;
(b)
tends to 0 because ρ ˜ 0 ;
(c)
tends to MC ( { γ k ( x n ) } k , n ) because h 1 , , h K tends to g 1 , , g K ;
(d)
tends to 0 because for all l, all components in h l tends to g l .
The proofs are mainly based on the mean-value theorem. However, differentiation of log f by ρ k ( k I ) may be infinite; we need additional treatments to avoid it.

5. Applications

In this section, we propose methods to apply the MC to clustering change detection problems. Formally speaking, given the dataset X : = { { x n , t } n = 1 N t 1 , , T } , where t denotes the time and { x n , t } n = 1 N denote the data generated at each t, we consider the problem of monitoring the changes in the clustering structures over t = 1 , , T .
First, we briefly summarize the method named sequential dynamic model selection (SDMS) [28] that addresses this problem. Then, we introduce our ideas and discuss the differences between SDMS.
Hereafter, we assume that the data points x n , t are d-dimensional vectors and consider a Gaussian mixture model
f t ( x ) = k = 1 K t ρ k , t N x | μ k , t , Σ k , t
for each t.

5.1. Sequential Dynamic Model Selection

SDMS is an algorithm that is used to sequentially estimate models and find changes. In clustering change detection problems, it sequentially estimates the mixture sizes K ^ t and parameters η K ^ t : = { ρ ^ k , t , μ ^ k , t , Σ ^ k , t } k = 1 K ^ t and finds model changes as changes in K ^ t .
The estimation procedures are explained below. First, depending on the estimated mixture size at the last time point K ^ t 1 , we set the candidate for K t . Then, for each K t in the candidate, we estimate the parameters θ K t from the data { x n , t } n = 1 N and calculate a cost function L SDMS ( { x n , t } n = 1 N ; K t , θ K t , K ^ t 1 ) . Finally, we select K t as the mixture size that minimizes the costs. The candidates of K t are set as
1 , , K max
at t = 1 , and
K t 1 1 , K t 1 , K t 1 + 1 1 , , K max
at t 2 , where K max is a pre-defined parameter. The cost function denotes the sum of the code length functions of the model and model changes given by
L SDMS { x n , t } n = 1 N ; K t , η K t , K ^ t 1 = L model { x n , t } n = 1 N ; K t , η K t + L change K t | K ^ t 1 .

Code Length of the Model

The score L model ( { x n } n = 1 N ; K , η K ) denotes a sum of the logarithm of the likelihood functions and penalty terms corresponding to the complexity of the model. In this study, we consider two likelihood functions and four penalty terms. For the (logarithm of) likelihood functions, we consider the observed likelihood  L ( { x n } n = 1 N ; θ K ) and complete likelihood  L ( { x n , z n } n = 1 N ; θ K ) , provided by
L ( { x n } n = 1 N ; θ K ) : = n = 1 N log P ( X = x n ) = n = 1 N log k = 1 K ρ k N ( x n | μ k , Σ k ) , L ( { x n , z n } n = 1 N ; θ K ) : = n = 1 N log P ( X = x n , Z = z n ) = n = 1 N log ρ z n N ( x n | μ z n , Σ z n ) ,
where { z n } n = 1 N are the latent variables for the data estimated by
z n : = argmax z 1 , , K P ( Z = z | X = x n ) .
They correspond to the likelihood of the observed data and complete data, respectively; the former is used to determine the mixture size, and the latter is used to determine the cluster size under the assumption that it is equal to the mixture size. For the penalty terms, we consider AIC [8], BIC [9], NML [13], and DNML [47,48]. By combining the log-likelihood and the penalty terms, we consider the following six scores:
  • AIC with observed likelihood (AIC):
    L x n n = 1 N ; η K + D ,
  • AIC with complete likelihood (AIC+comp):
    L x n , z n n = 1 N ; η K + D ,
  • BIC with observed likelihood (BIC):
    L x n n = 1 N ; η K + D 2 log N ,
  • BIC with complete likelihood (BIC+comp):
    L x n , z n n = 1 N ; η K + D 2 log N ,
  • NML:
    L x n , z n n = 1 N ; η K + log PC NML ( N , K ) ,
  • DNML:
    L x n , z n n = 1 N ; θ K + log PC DNML N , z n n = 1 N , K .
where D : = ( K 1 ) + d ( d + 3 ) / 2 denotes the number of the free parameters required to represent a Gaussian mixture model; PC NML ( N , K ) and PC DNML ( N , { z n } n = 1 N , K ) denote the parametric complexities. In our experiments, we estimated the parameter η K by conducting the EM algorithm [49] implemented in the Scikit-learn package [50] ten times and selected the best parameter that minimized each score. Note that in NML and DNML, we only considered the complete likelihood functions because only the methods to calculate their parametric complexities are known.

5.2. Track MC

In SDMS, clustering changes are detected as the changes of the mixture size or cluster size K; because it is discrete, the changes have been considered to be abrupt. Then, we propose to track MC instead of K while estimating the parameters using SDMS. Because MC takes a real value, monitoring it is more suitable for observing gradual changes than monitoring K. The algorithm for tracking MC is explained in Algorithm 1.
Algorithm 1 Tracking MC
Require: A dataset X = { { x n , t } n = 1 N t 1 , , T } .
  • 1: for t = 1 to t = T do
  • 2:       Estimate K ^ t and { g ^ k , t } k = 1 K ^ t from the data { x n , t } n = 1 N using SDMS.
  • 3:       Calculate MC t : = MC ( { γ ^ k ( x n ) } k , n ) .
  • 4: end for
  • 5: return { MC t } t = 1 T .

5.3. Track MC with Its Decomposition

In addition to monitoring the MC of the entire structure, we also propose an algorithm to track its decomposition. To accomplish this, we must estimate the upper L components and their corresponding partitions Q k , t ( l ) for each t.
Here, we assume that the upper L components are common at every t and estimate the partition Q k , t ( l ) after estimating the lower components at each time. Specifically, we consider μ k , t as a point with weights ρ k , t for each k and t and cluster them. As the clustering algorithm, we modified the fuzzy c-means [20] to handle the weighted points. Formally, we estimated the centers of the upper L components μ ˜ l and their corresponding partitions Q k , t ( l ) by minimizing the loss function
t , k ρ k , t l = 1 L Q k , t ( l ) m μ k , t μ ˜ l 2 ,
where m > 0 is parameter that determines the fuzziness of the partition.
We estimated μ ˜ l and Q k , t ( l ) by minimizing one iteratively while fixing another. We can formulate the iteration as follows:
μ ˜ l = k , t ρ k , t Q k , t ( l ) m μ k , t k , t ρ k , t Q k , t ( l ) m , Q k , t ( l ) = μ k , t μ ˜ l 2 / ( m 1 ) l = 1 L μ k , t μ ˜ l 2 / ( m 1 ) .
Finally, we present an algorithm to track the MC and its decomposition in Algorithm 2. We can analyze the structural changes in more detail by evaluating the decomposed values.
Algorithm 2 Tracking MC with its decomposition
Require: A dataset X = { { x n , t } n = 1 N t 1 , , T } , parameters m and L.
  • 1:
  • 2: # Step 1: Estimate lower components.
  • 3: for t = 1 to t = T do
  • 4:       Estimate K ^ t and { g ^ k , t } k = 1 K ^ t from the data { x n , t } n = 1 N using SDMS.
  • 5:       Calculate MC ( total ) t : = MC ( { γ ^ k ( x n ) } k , n ) .
  • 6: end for
  • 7:
  • 8: # Step 2: Estimate upper components and partition.
  • 9: Estimate the centers μ ˜ l and the partition Q k , t ( l ) using fuzzy c-means.
  • 10:
  • 11: # Step 3: Calculate the decomposition of MC.
  • 12: for t = 1 to t = T do
  • 13:       Calculate MC ( interaction ) t defined in Section 4.2.
  • 14:       for l = 1 to l = L do
  • 15:           Calculate W ( component l ) t defined in Section 4.2.
  • 16:           Calculate MC ( component l ) t defined in Section 4.2.
  • 17:     end for
  • 18: end for
  • 19: return { MC ( total ) t } t = 1 T , { MC ( interaction ) t } t = 1 T , { { W ( component l ) t } l = 1 L } t = 1 T , { { MC ( component l ) t } l = 1 L } t = 1 T .

6. Experimental Results

In this section, we present the experimental results that demonstrate the MC’s ability to monitor the clustering changes. We compare our methods to the monitoring of K.

6.1. Analysis of Artificial Data

To reveal the behaviors of MC, we conducted experiments with two artificial datasets called move Gaussian dataset and imbalance Gaussian dataset. Their experimental designs are discussed below. First, we generated artificial datasets X = { { x n , t } n = 1 N t 1 , , T } by setting T = 150 and N = 1000 . The datasets have one transaction period t = 51 , , 100 in which the data change their clustering structures gradually. Then, we estimated the MC and K using the methods in Section 5.1 and Section 5.2 by setting K max = 10 . To compare them, we first created a simple algorithm to detect the changes from the sequence of MC or K. Then, we compared the abilities of this algorithm in terms of the speed and accuracy of detecting the change points. Moreover, to evaluate the abilities to find the changes in the opposite direction, we performed experiments with the same datasets in the reverse order.
Given a sequence of the MC or K written as y 1 , , y 150 , we constructed an algorithm to detect the change points as follows. For t = 10 , , 150 , we raised a change alert if
median y t 9 , , y t 5 median y t 4 , , y t > ε
in the case of MC, and
median y t 9 , , y t 5 median y t 4 , , y t
in the case of K, where ε is the threshold to raise an alert in MC. It should be to some extent large for avoiding too many false alerts and smaller than 1 to find the changes earlier than with monitoring K. In this section, we set ε as 0.01 so as not to raise alerts from t = 1 to 10 assuming that we know that there are no changes in this period. We calculated the medians instead of the means of the subsequences for robustness. However, to avoid redundant alerts, we neglected them when the difference between t and the latest alert was less than 5 even if the conditions were satisfied.
To evaluate the quality of the algorithm, we calculated Delay and False alarm rate (FAR), defined as
Delay : = min t * 51 , 50 , FAR : = # { t [ 10 , 150 ] t ACCEPT t ALERT } # { t [ 10 , 150 ] t ACCEPT } ,
where t * denotes the first time point in the transaction period when the algorithm generated an alert, ACCEPT denotes the set of time points when alerts can be defined as { t t 9 , , t [ 51 , , 100 ] } = [ 51 , 109 ] , and ALERT denotes the set of time points when the algorithm generates alerts.

6.1.1. Move Gaussian Dataset

The move Gaussian dataset is a set of three-dimensional Gaussian distributions, whose means move gradually in the transaction period. Formally, for each t, we generated the data { x n , t } n = 1 1000 as follows:
x n , t N x | μ = [ 0 , 0 , 0 ] , Σ = I 3 ( 1 n 333 ) , N x | μ = [ 10 , 0 , 0 ] , Σ = I 3 ( 334 n 666 ) , N x | μ = [ 10 + α ( t ) , 0 , 0 ] , Σ = I 3 ( 667 n 1000 ) ,
where
α ( t ) = 0 ( 1 t 50 ) , 0.12 ( t 50 ) ( 51 t 100 ) , 6 ( 101 t 150 ) .
The first and second dimensions of some data are visualized in Figure 6. In the direction t = 1 150 , the number of clusters increases from two to three as the two clusters leave; in the direction t = 150 1 , it decreases from three to two as the two clusters merge.
The experiments were performed ten times by randomly generating the datasets; accordingly, the average performance scores were calculated. The differences in the scores between the MC and K for each criterion are presented in Table 2; the estimated MC and K in one trial are proposed in Figure 7. This figure illustrates the result of BIC as an example.
With respect to the speed to find changes, in every criterion, MC performed as well as K in the direction t = 1 150 ; however, it performed significantly better than K in the direction t = 150 1 . The reason for the differing performances is discussed below. In the direction t = 1 150 , the model selection algorithms underestimated the number of components at the beginning of the transaction period. In such time points, they ignored the overlapping of the two components and considered them as one cluster. Thus, MC, based on such model selection methods, was unable to find the changes earlier than K. However, in the direction t = 150 1 , the overlap between the components was correctly estimated at some time points before K changed. In this case, MC changed smoothly according to the overlap and found changes earlier than K.
With respect to the accuracy of finding changes, MC performed as well as K in terms of FAR. Additionally, it is evident from Figure 7 that MC stably estimated the clustering structures.

6.1.2. Imbalance Gaussian Dataset

The imbalance Gaussian dataset is a set of three-dimensional Gaussian mixture distributions whose balances change gradually in the transaction period. Formally, for each t, we generated the data { x n , t } n = 1 1000 as follows:
x n , t N x | μ = [ 0 , 0 , 0 ] , Σ = I 3 ( 1 n 250 ) , N x | μ = [ 10 , 0 , 0 ] , Σ = I 3 ( 251 n 500 ) , N x | μ = [ 20 , 0 , 0 ] , Σ = I 3 ( 501 n 750 + α ( t ) ) , N x | μ = [ 30 , 0 , 0 ] , Σ = I 3 ( 751 + α ( t ) n 1000 ) ,
where
α ( t ) = 0 ( 1 t 50 ) , 5 ( t 51 ) ( 51 t 100 ) , 250 ( 101 t 150 ) .
The first and second dimensions of some data are visualized in Figure 8. In the direction t = 1 150 , the number of clusters decreases from four to three as the edge cluster disappears. In the direction t = 150 1 , it increases from three to four as the edge cluster emerges.
The experiments were performed ten times by randomly generating datasets; accordingly, the average performance scores were calculated. The difference in the scores between the MC and K for each criterion are listed in Table 3. The estimated MC and K in one trial are plotted in Figure 9. This figure illustrates the result of BIC as an example.
In terms of the speed to find changes, in every model selection method, MC performed significantly better than K in the direction t = 1 150 ; however, MC performed as well as K in the direction t = 150 1 . The reason for the differing performances is discussed below. In the transaction period, all model selection methods counted the minor components as independent clusters. Then, in the direction t = 1 150 , MC changed smoothly according to the imbalance and determined the changes earlier than K. In the direction t = 150 1 , K increased significantly early in the transaction period. Then, MC increased along with K and determined the changes simultaneously.
In terms of the accuracy of finding changes, MC performed as well as K in terms of FAR. Additionally, it is evident from Figure 9 that MC stably estimated the clustering structures.

6.1.3. Scalability

To discuss the scalability for the large datasets, we explored the increase in the computation time for the data size. First, we set the mixture distribution f as
f ( x ) = 0.5 × N x | μ = [ 0 , , 0 ] , Σ = I d + 0.5 × N x | μ = [ 1 , , 1 ] , Σ = I d
and sampled N points from f. Then, we recorded the time to calculate { γ k , n } from f and calculated the MC from { γ k , n } . We repeatedly measured the computation times by increasing N and d. For each N and d, we measured them ten times and took their averages.
The increase in the computation times is illustrated in Figure 10. In (a), although both computation times increased linearly as N grew, calculating MC was faster than calculating { γ k , n } . In (b), although the time to calculate { γ k , n } increased as d grew, and the computation time for MC was almost constant because K and N were not changed. Overall, the cost of computing MC is much smaller than that of computing or estimating { γ k , n } .

6.2. Analysis of Real Data

We analyzed two types of real data named the beer dataset and house dataset, which are summarized in Table 4. In the following subsections, we discussed the detail of the datasets and results of the experiments.

6.2.1. Beer Dataset

We discuss the results of the beer dataset, obtained from Hakuhodo, Inc. and M-CUBE, Inc. This has also been analyzed in [28,34]. The dataset comprises the records of customer’s beer purchases from November 1st, 2010 to January 31th, 2011. The dataset X is constructed as follows. The time unit is a day. For each day t { τ , , T } , x n , t R d denotes the n-th customer’s consumption of the beer from time t τ + 1 to t, where we set τ = 14 . The dimension d of the vector is 16, which correspond to the consumptions of the following drink:
  • beer (A), …, beer (F): beer with brand name A, …, F.
  • beer (other): beer with other brands.
  • beerlike (A), …, beerlike (H): beer-like drink with brand name A, …, H.
  • beerlike (other): beer-like drink with other brands.
First, we compare the plots of the estimated MC and K in Figure 11. The results of BIC and NML are illustrated as an example. Note that we omit the results of AIC because it chose K max for K ^ t at many t. In any method, the score was high at the end and beginning of the year, reflecting the increased activities in transactions. However, because the critical changes in the clustering structure and changes due to ineffective components were mixed, the sequence of K had a lot of change points; as a result, it was difficult to interpret their meanings. On the other hand, MC identified the clustering structure by discounting the effects of the ineffective components. As a result, the sequence of MC highlighted the significant changes at the end and beginning of the year. It is also worthwhile noting that the differences of the scores between the model selection methods were much smaller in MC than those in K; this indicates that both BIC and NML estimates similar clustering structures under the concept of MC even though the number of components differs significantly.
Next, we discuss the results of the decomposition of MC. We present the results of BIC and NML with L = 4 and m = 1.5 . The centers of the upper components are listed in Table 5 and Table 6, respectively, and the plots of each decomposed value are illustrated in Figure 12 and Figure 13, respectively. The indices of the upper components are manually rearranged so that they correspond with each other; then, it can be observed that the results were also similar to each other. The structures can be extensively evaluated by analyzing the decomposed values. For instance, let us analyze the decomposed values at the end and beginning of the year. As evident from the tables, they had different characteristics. It can be observed from the figures that the contributions increased in all components, indicating that they were related to the increase in MC(total). The weight of the component decreased in cluster 1 and increased in component 2 and 3, indicating that the customers moved from component 1 to component 2 and 3. Additionally, MC(component l) increased in all components, indicating that the complexity or diversity increased within them.

6.2.2. House Dataset

We discuss the results of the house dataset, obtained from the UCI Machine Learning Repository [51]. The dataset comprises the records of electricity consumption in a house every five minutes from 16 December 2006 to 26t November 2010. The dataset X is constructed as follows. The time unit is 15 min from 00:00–00:15 to 23:45–24:00. For each t, the data { x n , t } n = 1 N denotes the set of the records on the various days included in the t-th time unit. The dimension d of the vector is 3, which corresponds to the metering of the following three points:
  • metering(A): a kitchen.
  • metering(B): a laundry room.
  • metering(C): a water-heater and an air-conditioner.
First, we compare the plots of the estimated K and the corresponding MC in Figure 14. The results of BIC and NML are illustrated as an example. Note that we omit the results of AIC because it chose K max for K ^ t at many t. It can be observed from the figure that the MC smoothly connected the discrete changes in K; therefore, MC expressed gradual changes in the dataset more effectively than K. Additionally, the MCs in BIC and NML were more similar to each other than K as well as in the beer dataset. The values of MC started increasing from around 7:00 a.m.; after slight fluctuations, the value reached its peak around 21:00. Therefore, MC seemed to represent the amount of activities in this house.
Next, we discuss the results of the decomposition of MC. We present the results of BIC and NML with L = 4 and m = 1.5 . The centers of the upper components are listed in Table 7 and Table 8, respectively, and the plots of each decomposed value are illustrated in Figure 15 and Figure 16, respectively. The indices of the upper components are manually rearranged so that they correspond with each other; then, it can be observed that the results were also similar to each other. The structures can be extensively evaluated by analyzing the decomposed values. For instance, let us analyze the decomposed values in component 3. It can be observed from the tables that the value in metering(C) was specifically high in this component. Looking at the contribution (component 3), there were two peaks around 9:00 and 21:00; it represented the increased activities in this component. However, the proportions of the weight and MC were different. W(component 3) was specifically high at 9:00, indicating that the first half of the peaks was due to the increase in the weight of the component; whereas, MC(component 3) was specifically high at 21:00, indicating that the second half of the peaks was due to the increase in the complexity within the component.

7. Conclusions

We proposed the concept of MC to measure the cluster size continuously in the mixture model. We first pointed out that the cluster size might not be equal to the mixture size when the mixture model had overlap or weight bias; then, we introduced MC as an extended concept of the cluster size considering the effects of them. We also presented methods to decompose the MC according to the mixture hierarchies, which helped us in extensively analyzing the substructures. Subsequently, we implemented the MC and its decomposition to the gradual clustering change detection problems. We conducted experiments to verify that the MC effectively elucidates the clustering changes. In the artificial data experiments, MC found the clustering changes significantly earlier in the case where the overlap or weight bias was correctly estimated. In the real data experiments, MC expressed the gradual changes better than K because it discerned the significant and insignificant changes and smoothly connected the discrete changes in K. We also found that the MC took similar values for each model selection method; it indicates that the estimated clustering structures are alike under the concept of MC. Moreover, its decomposition enabled us to evaluate the contents of changes.
Issues of the MC will be tackled in future study. For example, it does not capture the clustering structure well when the number of the components is underestimated; thus, we need to explore the model selection methods that are more compatible with MC. Also, we further need to study its theoretical aspects, such as convergence and methods for approximating the mutual information. Furthremore, we need to consider extending the concept of MC into other clustering approaches, e.g., considering co-clustering by relating non-diagonal blocks in co-clustering and cluster overlaps in finite mixture models.

Author Contributions

Conceptualization, S.K. and K.Y.; methodology, S.K. and K.Y.; software, S.K.; validation, S.K. and K.Y.; formal analysis, S.K.; investigation, S.K. and K.Y; resources, S.K.; data curation, S.K.; writing—original draft preparation, S.K. and K.Y.; writing—review and editing, S.K. and K.Y.; visualization, S.K.; supervision, K.Y.; project administration, K.Y.; funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by JST KAKENHI JP19H01114.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of the Basic Properties

We present a proof of Propositions 1–4.

Appendix A.1. Proof of Proposition 1

We can directly calculate as follows:
MC γ k x n k , n ; w n n = k = 1 K n w n γ k n w n log n w n γ k n w n + 1 n w n n = 1 N w n k = 1 K γ k log γ k = k = 1 K γ k log γ k + k = 1 K γ k log γ k = 0 .

Appendix A.2. Proof of Proposition 2

Because x log x = 0 when x = 0 or 1, H ˜ ( Z | X ) becomes 0 when the components are entirely separate. Thus, the proposition hols.

Appendix A.3. Proof of Proposition 3

It is immediate that
MC γ k x n k , n ; w n n = H ˜ ( Z ) H ˜ ( Z | X ) ( a ) H ˜ ( Z ) ( b ) log K .
The equality of (a) holds only if the components are entirely separate and the equality of (b) holds only if the components are balanced. Thus, MC equals log K only if the components are entirely separate and balanced.
Also, by applying the Jensen’s inequality to x x log x , we obtain that
k = 1 K n w n γ k x n n w n log n w n γ k x n n w n k = 1 K 1 n w n n = 1 N w n γ k x n log γ k x n ,
which is equivalent to that MC ( { γ k ( x n ) } k , n , { w n } n ) 0 . The equality holds only if the components entirely overlap.

Appendix A.4. Proof of Proposition 4

By applying Theorem 1 to partition I 1 I L into each set, we can calculate that
MC γ k x n k , n ; w n n = MC γ k x n k I 1 I L , n ; w n n = MC k I l γ k x n l = 1 , , L , n ; w n n + l = 1 L k I l ρ ˜ l MC γ k x n k I l γ k x n k I l , n ; w n k I l γ k x n n = MC γ k 0 x n k , n ; w n n + l = 1 L k I l ρ ˜ l MC ρ k k I l ρ k k I l , n ; w n k I l γ k x n n = MC γ k 0 x n k , n ; w n n ( Proposition ) .

Appendix B. Proof of the Decomposition Property

We present a proof of Theorem Table 1. Let
ρ ˜ k ( l ) = n w n ( l ) γ k ( l ) x n n w n ( l ) = n w n Q k ( l ) γ k x n n w n k Q k ( l ) γ k x n .
Then, we can calculate as
MC γ k x n k , n ; w n n MC k = 1 K Q k ( l ) γ k x n l , n ; w n n = n = 1 N k = 1 K l = 1 L w n Q k ( l ) γ k x n n w n log γ k x n n w n γ k x n n w n · n w n k Q k ( l ) γ k x n n w n k Q k ( l ) γ k x n = n = 1 N k = 1 K l = 1 L w n Q k ( l ) γ k x n n w n log Q k ( l ) γ k x n k Q k ( l ) γ k x n · n w n k Q k ( l ) γ k x n n w n Q k ( l ) γ k x n = n = 1 N k = 1 K l = 1 L W l · w n Q k ( l ) γ k x n n w n k Q k ( l ) γ k x n log γ k ( l ) x n ρ ˜ k ( l ) = n = 1 N k = 1 K l = 1 L W l · w n ( l ) γ k ( l ) x n n w n ( l ) log γ k ( l ) x n ρ ˜ k ( l ) = l = 1 L W l · γ k ( l ) x n k , n ; w n ( l ) n .

Appendix C. Proof of the Consistency

We present a proof of Theorem 2. We use the following notations for convenience:
γ 1 x n | | γ K x n : = γ k x n k , n , I 0 : = I 1 I K .
First, applying Theorem 1 repeatedly, we decompose the entire MC as follows:
MC γ k x n k , n = MC k I 0 γ k x n | k I γ k x n + ρ ˜ MC γ k x n k I γ k x n k I , n ; k I γ k x n + 1 ρ ˜ MC γ k x n k I 0 γ k x n k I 0 , n ; k I 0 γ k x n = MC k I 0 γ k x n | k I γ k x n + ρ ˜ MC γ k x n k I γ k x n k I , n ; k I γ k x n + 1 ρ ˜ MC k I l γ k x n k I 0 γ k x n l = 1 , , K ; k I 0 γ k x n + l = 1 K k I l ρ ˜ k MC γ k x n k I l γ k x n k I l ; k I l γ k x n = MC l = 1 K r l h l x n f x n | k I ρ k g k x n f x n + ρ ˜ MC ρ k x n k I ρ k x n k I ; k I γ k x n + 1 ρ ˜ MC r l h l x n l = 1 K r l h l x n l = 1 , , K ; k I 0 γ k x n + l = 1 K k I l ρ ˜ k MC s k g k x n h l x n k I l ; k I l γ k x n .
The terms in (A1) correspond to the four quantities (a), …, (d) referred to in Section 4.3. Then, it is sufficient to show that
MC l = 1 K r l h l x n f x n | k I ρ k g k x n f x n 0 , ρ ˜ MC ρ k x n k I ρ k x n k I ; k I γ k x n 0 , 1 ρ ˜ MC r l h l x n l = 1 K r l h l x n l = 1 , , K ; k I 0 γ k x n MC γ k x n k , n , l = 1 K k I l ρ ˜ k MC s k g k x n h l x n k I l ; k I l γ k x n 0 .
  • Step 1
First, we show that
MC l = 1 K r l h l x n f x n | k I ρ k g k x n f x n = O P ρ ˜ log ρ ˜ .
Using Proposition 3, it is easily shown as follows:
MC l = 1 K r l h l x n f x n | k I ρ k g k x n f x n ρ ˜ log ρ ˜ 1 ρ ˜ log 1 ρ ˜ = O P ρ ˜ log ρ ˜ .
  • Step 2
Second, we show that
ρ ˜ MC ρ k x n k I ρ k x n k I ; k I γ k x n = O P ρ ˜ .
It is also evident from Proposition 3:
ρ ˜ MC ρ k x n k I ρ k x n k I ; k I γ k x n ρ ˜ log K i K = O P ρ ˜ .
  • Step 3
Third, we show that
1 ρ ˜ MC r l h l x n l = 1 K r l h l x n l = 1 , , K ; k I 0 γ k x n = MC γ k x n k , n + O P ϕ ϕ + l = 1 K r ˜ l ρ l + ρ ˜ .
To this end, we further decompose the left hand as
1 ρ ˜ MC r l h l x n l = 1 K r l h l x n l = 1 , , K ; k I 0 γ k x n = 1 ρ ˜ l = 1 K r ˜ l 1 ρ ˜ log r ˜ l 1 ρ ˜ + 1 N n = 1 N l = 1 K k I 0 γ k x n r l h l x n l = 1 K r l h l x n log r l h l x n l = 1 K r l h l x n ;
they correspond to the unconditional and conditional entropy of the latent variables, respectively. On the other hand, the true MC is defined as
l = 1 K ρ ˜ l log ρ ˜ l + 1 N n = 1 N l = 1 K γ l x n log γ l x n .
Then, it is sufficient to show that
1 ρ ˜ l = 1 K r ˜ l 1 ρ ˜ log r ˜ l 1 ρ ˜ l = 1 K ρ ˜ l log ρ ˜ l , 1 N n = 1 N l = 1 K k I 0 γ k x n r l h l x n l = 1 K r l h l x n log r l h l x n l = 1 K r l h l x n 1 N n = 1 N l = 1 K γ l x n log γ l x n .
First, we show that
1 ρ ˜ l = 1 K r ˜ l 1 ρ ˜ log r ˜ l 1 ρ ˜ = l = 1 K ρ ˜ l log ρ ˜ l + O P l = 1 K r ˜ l ρ l + ρ ˜ + 1 N .
Indeed, by the mean-value theorem, there exist r 1 m , , r K m between r ˜ 1 , , r ˜ K and ρ 1 , , ρ K and ρ m between 0 and ρ ˜ such that
l = 1 K r ˜ l log r ˜ l = l = 1 K ρ l log ρ l + l = 1 K 1 + log r l m r ˜ l ρ l l = 1 , , K ,
1 ρ ˜ log 1 ρ ˜ = 1 log 1 ρ m ρ ˜ .
Also, from assumption (C3), if N is sufficiently large, log r l m and log ( 1 ρ m ) are finite because r l m and ρ m become arbitrarily close to ρ l ( > 0 ) and 0, respectively. Similarly, there exist ρ 1 m , , ρ K m between ρ ˜ 1 , , ρ ˜ K and ρ 1 , , ρ K such that
l = 1 K ρ ˜ l log ρ ˜ l = l = 1 K ρ l log ρ l + l = 1 K 1 + log ρ l m ρ ˜ l ρ l l = 1 , , K .
Also, from the central limit theorem, ρ ˜ l converges to ρ k at the speed of O P ( 1 / N ) .
Using (A2)–(A4), we can calculate as
1 ρ ˜ l = 1 K r ˜ l 1 ρ ˜ log r ˜ l 1 ρ ˜ = l = 1 K r ˜ l log r ˜ l + K 1 ρ ˜ log 1 ρ ˜ = l = 1 K ρ l log ρ l + O P l = 1 K r ˜ l ρ l + ρ ˜ = l = 1 K ρ ˜ l log ρ ˜ l + O P l = 1 K r ˜ l ρ l + ρ ˜ + 1 N .
Next, we show that
1 N n = 1 N l = 1 K k I 0 γ k x n r l h l x n l = 1 K r l h l x n log r l h l x n l = 1 K r l h l x n = 1 N n = 1 N l = 1 K γ l x n log γ l x n + O P ϕ ϕ + ρ ˜ .
We first define the following functions for l = 1 , , K :
F l ( ϕ , ψ , x ) : = r l h l ( x ) l = 1 K r l h l ( x ) log r l h l ( x ) l = 1 K r l h l ( x ) .
When ϕ = ϕ , F l ( ϕ , ψ , x ) is calculated as γ l ( x ) log γ l ( x ) ; it is independent of ψ . In this case, we omit ψ and write F l ( ϕ , ψ , x ) as F l ( ϕ , x ) . Applying the mean-value theorem to this function, there exists a function 0 < τ ( x ) < 1 such that
F l ( ϕ , ψ , x ) = F l ϕ , x + F l ϕ + t ( x ) ϕ ϕ , ψ , x ϕ ϕ ϕ .
Moreover, it can be shown that F l ( ϕ + t ( x ) ϕ ϕ , ψ , x ) / ϕ = O P ( 1 ) uniformly with x , τ ( x ) , and ψ . Indeed, letting
ϕ m : = θ k m k = 1 i K , r l m l = 1 K , ρ k m k = i K + 1 K : = ϕ + t ( x ) ( ϕ ϕ ) , h l m ( x ) : = k I l s k g k x | θ k m , γ l m ( x ) : = r l m h l m ( x ) l = 1 K r l m h l m ( x ) ,
derivative of each parameter in ϕ is bounded as follows:
F l ϕ m , ψ , x r l = h l m ( x ) l = 1 K r l m h l m ( x ) r l m h l m ( x ) 2 l = 1 K r l m h l m ( x ) 2 1 + log r l m h l m ( x ) l = 1 K r l m h l m ( x ) γ l m ( x ) r l m + γ l m ( x ) 2 r l m + 1 r l m γ l m ( x ) log γ l m ( x ) + 1 r l m γ l m ( x ) 2 log γ l m ( x ) 4 r l m , F l ϕ m , ψ , x r m ( m l ) = r l m h l m ( x ) h m m ( x ) l = 1 K r l m h l m ( x ) 2 1 + log r l m h l m ( x ) l = 1 K r l m h l m ( x ) γ m m ( x ) r m m γ l m ( x ) log γ l m ( x ) 1 r m m , F l ϕ m , ψ , x θ k k I l = r l m l = 1 K r l m h l m ( x ) h l m ( x ) θ k r l m 2 h l m ( x ) l = 1 K r l m h l m ( x ) 2 h l m ( x ) θ k × 1 + log r l m h l m ( x ) l = 1 K r l m h l m ( x ) γ l m ( x ) + γ l m ( x ) 2 + γ l m ( x ) log γ l m ( x ) + γ l m ( x ) 2 log γ l m ( x ) 1 h l m ( x ) h l m ( x ) θ k 4 sup θ Θ ϵ 1 g ( x | θ ) g ( x | θ ) θ k , F l ϕ m , ψ , x θ k k I m , m l = r l m r m m h l m ( x ) l = 1 K r l m h l m ( x ) 2 h m m ( x ) θ k 1 + log r l m h l m ( x ) l = 1 K r l m h l m ( x ) γ m m ( x ) γ l m ( x ) + γ m m ( x ) γ l m ( x ) log γ l m ( x ) h m m ( x ) θ k 2 sup θ Θ ϵ 1 g ( x | θ ) g ( x | θ ) θ k , F l ϕ m , ψ , x ρ k = 0 k I ,
where
Θ ϵ θ | l 1 , , K , θ θ l < ϕ ϕ .
They are all finite because r l m become arbitrarily close to ρ l and Θ ϵ become arbitrarily smaller as N ; condition (C1) is also employed.
Using (A5), we can calculate as
1 N n = 1 N l = 1 K k I l γ k x n r l h l x n l = 1 K r l h l x n log r l h l x n l = 1 K r l h l x n = 1 N n = 1 N l = 1 K 1 k I γ k x n F l ϕ , x n + F l ϕ m , ψ , x n ϕ ϕ ϕ = 1 N n = 1 N l = 1 K 1 k I γ k x n γ l ( x ) log γ l ( x ) + F l ϕ m , ψ , x n ϕ ϕ ϕ .
Therefore,
1 N n = 1 N l = 1 K k I 0 γ k x n r l h l x n l = 1 K r l h l x n log r l h l x n l = 1 K r l h l x n 1 N n = 1 N l = 1 K γ l x n log γ l x n 1 N n = 1 N l = 1 K F l ϕ m , ψ , x n ϕ ϕ ϕ + 1 N n = 1 N l = 1 K k I γ k x n γ l x n log γ l x n l = 1 K sup x n , ψ F l ϕ m , ψ , x n ϕ ϕ ϕ + 3 K ρ ˜ = O P ϕ ϕ + ρ ˜ .
  • Step 4
Finally, we show that
l = 1 K k I l ρ ˜ k MC s k g k x n h l x n k I l ; k I l γ k x n = O P ϕ ϕ
for every l = 1 , , K .
To this end, we write the left hand as G ( { θ k } k I l ) and consider it as a function of { θ k } k I l . Then, for all other parameters, G ( { θ k } k I l ) = 0 . Also, the derivative of G by θ l is O P ( 1 ) as N . Indeed, it can be rewritten as
G θ k k I l = 1 N n = 1 N k I l r l s k g k x n f x n log s k g k x n h l x n 1 N k I l n = 1 N r l s k g k x n f x n log n = 1 N r l s k g k x n f x n + 1 N k I l n = 1 N r l s k g k x n f x n log n = 1 N r l h l x n f x n .
Also, we define the posterior probabilities within h l as
γ k ( l ) ( x ) s k g k x h l x k I l .
Then, the derivatives are bounded as
G θ k k I l θ m m I l = 1 N n = 1 N k I l r l s k f x n g k x n θ m r l 2 s k s m g k x n f x n 2 g m x n θ m × log s k g k x n h l x n log n = 1 N r l s k g k x n f x n + log n = 1 N r l h l x n f x n + 1 N n = 1 N k I l r l s k f x n g k x n θ m r l s k s m g k x n f x n h l x n g m x n θ m 1 N k I l n = 1 N r l s k f x n g k x n θ m r l 2 s k s m g k x n f x n 2 g m x n θ m + 1 N k I l n = 1 N r l s k g k x n f x n n = 1 N h l x n f x n n = 1 N r l s m f x n g k x n θ m r l s m h x n f x n 2 g m x n θ m 1 N n = 1 N k I l γ k ( l ) x n log γ k ( l ) x n 1 g k x n g k x n θ m + 1 N n = 1 N k I l γ k ( l ) x n γ m ( l ) x n log γ k ( l ) x n 1 g m x n g m x n θ m + k I l 2 N n = 1 N r l h l x n f x n sup θ Θ ϵ 1 g x n | θ g x n | θ θ × n = 1 N r l s k g k x n f x n n = 1 N r l h l x n f x n log n = 1 N r l s k g k x n f x n n = 1 N r l h l x n f x n + 2 N n = 1 N k I l γ k ( l ) x n 1 g k x n g k x n θ m + 2 N n = 1 N k I l γ k ( l ) x n γ m ( l ) x n 1 g m x n g m x n θ m 8 K sup θ Θ ϵ 1 g x n | θ g x n | θ θ = O P ( 1 ) ,
where, it is assumed that { θ k } k I l are sufficiently close to θ k , which holds if N is sufficiently large because of condition (C2).
Therefore, by the mean-value theorem, there exist { θ k m } k I l such that
G ( { θ k } k I l ) = k I l G θ k m k I l θ k θ k m θ k = O P ϕ ϕ ,
which concludes the proof.

References

  1. McLachlan, G.J.; Peel, D. Finite Mixture Models; Wiley Series in Probability and Statistics: New York, NY, USA, 2000. [Google Scholar]
  2. Fraley, C.; Raftery, A.E. How Many Clusters? Which Clustering Method? Answers via Model-based Cluster Analysis. Comput. J. 1998, 41, 578–588. [Google Scholar] [CrossRef]
  3. Hennig, C. Methods for Merging Gaussian Mixture Components. Adv. Data Anal. Classif. 2010, 4, 3–34. [Google Scholar] [CrossRef]
  4. Jiang, M.F.; Tseng, S.S.; Su, C.M. Two-phase Clustering Process for Outliers Detection. Pattern Recognit. Lett. 2001, 22, 691–700. [Google Scholar] [CrossRef]
  5. He, Z.; Xu, X.; Deng, S. Discovering Cluster-based Local Outliers. Pattern Recognit. Lett. 2003, 24, 1641–1650. [Google Scholar] [CrossRef]
  6. Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A Survey On Concept Drift Adaptation. ACM Comput. Surv. 2014, 46, 1–37. [Google Scholar] [CrossRef]
  7. Kyoya, S.; Yamanishi, K. Summarizing Finite Mixture Model with Overlapping Quantification. Entropy 2021, 23, 1503. [Google Scholar] [CrossRef] [PubMed]
  8. Akaike, H. A New Look at the Statistical Model Identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  9. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  10. Rissanen, J. Modeling by Shortest Data Description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  11. Biernacki, C.; Celeux, G.; Govaert, G. Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 719–725. [Google Scholar] [CrossRef]
  12. Kontkanen, P.; Myllymäki, P.; Buntine, W.; Rissanen, J.; Tirri, H. An MDL Framework for Data Clustering. In Advances in Minimum Description Length; MIT Press: Cambridge, MA, USA, 2005; pp. 323–353. [Google Scholar]
  13. Hirai, S.; Yamanishi, K. Efficient Computation of Normalized Maximum Likelihood Codes for Gaussian Mixture Models with Its Applications to Clustering. IEEE Trans. Inf. Theory 2019, 59, 7718–7727, Erratum in IEEE Trans. Inf. Theory 2019, 65, 6827–6828. [Google Scholar] [CrossRef]
  14. McLachlan, G.J.; Rathnayake, S. On the Number of Components in a Gaussian Mixture Model. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2014, 4, 341–355. [Google Scholar] [CrossRef]
  15. Fukunaga, K. Introduction to Statistical Pattern Recognition, 2nd ed.; Academic Press Professional: San diego, CA, USA, 1990. [Google Scholar]
  16. Wang, S.; Sun, H. Measuring Overlap-Rate for Cluster Merging in a Hierarchical Approach to Color Image Segmentation. Int. J. Fuzzy Syst. 2004, 6, 147–156. [Google Scholar]
  17. Sun, H.; Wang, S. Measuring the Component Overlapping in the Gaussian Mixture Model. Data Min. Knowl. Discov. 2011, 23, 479–502. [Google Scholar] [CrossRef]
  18. Ester, M.; Krigel, H.P.; Sander, J.; Xu, X. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, Portland, 2–4 August 1996; pp. 226–231. [Google Scholar]
  19. Bradley, P.S.; Bennett, K.P.; Demiriz, A. Constrained K-Means Clustering; Technical Report MSR-TR-2000-65; Microsoft Research: Redmond, WA, USA, 2000. [Google Scholar]
  20. Bezdec, J.C.; Ehrlich, R.; Full, W. FCM: The Fuzzy c-Means Clustering Algorithm. Comput. Geosci. 1984, 10, 191–203. [Google Scholar] [CrossRef]
  21. Rusch, T.; Hornik, K.; Mair, P. Asessing and Quantifying Clusteredness: The OPTICS Cordillera. J. Comput. Graph. Stat. 2018, 27, 220–233. [Google Scholar] [CrossRef] [Green Version]
  22. Yamanishi, K. Descriptive Dimensionality and Its Characterization of MDL-based Learning and Change Detection. arXiv 2019, arXiv:1910.11540. [Google Scholar]
  23. Guha, S.; Mishra, N.; Motwani, R.; O’Callaghan, L. Clustering Data Streams. In Proceedings of the 41st Annual Symposium on Foundations of Computer, Redondo Beach, CA, USA, 12–14 November 2000; pp. 359–366. [Google Scholar]
  24. Song, M.; Wang, H. Highly Efficient Incremental Estimation of Gaussian Mixture Models for Online Data Stream Clustering. In Proceedings of the Intelligent Computing: Theory and Applications III, Orlando, FL, USA, 28 March–1 April 2005; pp. 174–183. [Google Scholar]
  25. Chakrabarti, D.; Kumar, R.; Tomins, A. Evorutionary Clustering. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 554–560. [Google Scholar]
  26. Yamanishi, K.; Maruyama, Y. Dynamic Syslog Mining for Network Failure Monitoring. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 499–508. [Google Scholar]
  27. Yamanishi, K.; Maruyama, Y. Dynamic Model Selection with Its Applications to Novelty Detection. IEEE Trans. Inf. Theory 2007, 53, 2180–2189. [Google Scholar] [CrossRef]
  28. Hirai, S.; Yamanishi, K. Detecting Changes of Clustering Structures Using Normalized Maximum Likelihood Coding. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 343–351. [Google Scholar]
  29. Herbster, M.; Warmuth, M.K. Tracking the Best Expert. Mach. Learn. 1998, 1998, 151–178. [Google Scholar] [CrossRef]
  30. Ntoutsi, E.; Spiliopoulou, M.; Theodoridis, Y. FINGERPRINT: Summarizing Cluster Evolution in Dynamic Environments. Int. J. Data Warehous. Min. 2012, 8, 27–44. [Google Scholar] [CrossRef]
  31. van Erven, T.; Grünwald, P.; de Rooji, S. Catching Up Faster by Switching Sooner: A Predictive Approach to Adaptive Estimation with an Application to the AIC-BIC dilemma. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2012, 74, 367–417. [Google Scholar] [CrossRef] [Green Version]
  32. Yamanishi, K.; Miyaguchi, K. Detecting Gradual Changes from Data Stream Using MDL-change Statistics. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 156–163. [Google Scholar]
  33. Yamanishi, K.; Xu, L.; Yuki, R.; Fukushima, S.; Lin, C.H. Change sign detection with differential MDL change statistics and its applications to COVID-19 pandemic analysis. Sci. Rep. 2021, 11, 19795. [Google Scholar] [CrossRef] [PubMed]
  34. Hirai, S.; Yamanishi, K. Detecting Latent Structure Uncertainty with Structural Entropy. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 26–35. [Google Scholar]
  35. Ohsawa, Y. Graph-Based Entropy for Detecting Explanatory Signs of Changes in Market. Rev. Socionetwork Strateg. 2018, 12, 183–203. [Google Scholar] [CrossRef] [Green Version]
  36. Still, S.; Biarlek, W.; Léon, B. Geometric Clustering Using the Information Bottleneck Method. In Proceedings of the Advances in Neural Information Processing Systems 16, Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
  37. Lin, J. Divergence Measures Based on the Shannon Entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
  38. Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
  39. Huber, M.F.; Bailey, T.; Durrant-Whyte, H.; Hanebeck, U.D. On Entropy approximation for Gaussian Mixture Random Vectors. In Proceedings of the IEEE Information Conference on Multisensor Fusion and Integration for Intelligent Systems, Seoul, Korea, 20–22 August 2008; pp. 181–188. [Google Scholar]
  40. Kolchinsky, A.; Tracey, B.D. Estimating Mixture Entropy with Pairwise Distance. Entropy 2017, 19, 361. [Google Scholar] [CrossRef] [Green Version]
  41. Teicher, H. Identifiability of Finite Mixtures. Ann. Math. Stat. 1963, 34, 1265–1269. [Google Scholar] [CrossRef]
  42. Yakiwitz, S.J.; Spragins, J.D. On the Identifiability of Fimite Mixtures. Ann. Math. Stat. 1968, 39, 209–214. [Google Scholar] [CrossRef]
  43. Liu, X.; Shao, Y. Asymptotics for Likelihood Ratio Tests Under Loss of Identifiability. Ann. Stat. 2003, 31, 807–832. [Google Scholar] [CrossRef]
  44. Dacunha-Castelle, D.; Gassiat, E. Testing in Locally Conic Models and Application to Mixture Models. ESAIM Probab. Stat. 1997, 1, 285–317. [Google Scholar] [CrossRef]
  45. Keribin, C. Consistent Estimation of the Order of Mixture Models. Sankhyā Indian J. Statics Ser. A 2000, 62, 49–66. [Google Scholar]
  46. Ghosal, S.; van der Vaart, A.W. Entropies and Rates of Convergence for Maximum Likelihood and Bayes Estimation for Mixtures of Normal Densities. Ann. Stat. 2001, 29, 1233–1263. [Google Scholar] [CrossRef]
  47. Wu, T.; Sugawara, S.; Yamanishi, K. Decomposed Normalized Maximum Likelihood Codelength Criterion for Selecting Hierarchical Latent Variable Models. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1165–1174. [Google Scholar]
  48. Yamanishi, K.; Wu, T.; Sugawara, S.; Okada, M. The Decomposed Normalized Maximum Likelihood Code-length Criterion for Selecting Hierarchical Latent Variable Models. Data Min. Knowl. Discov. 2019, 33, 1017–1058. [Google Scholar] [CrossRef]
  49. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–38. [Google Scholar]
  50. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  51. Dheeru, D.; Casey, G. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 17 August 2022).
Figure 1. Examples of MC with Gaussian mixture models with a mixture size of two.
Figure 1. Examples of MC with Gaussian mixture models with a mixture size of two.
Entropy 24 01407 g001
Figure 2. Relation between the parameter α and the exponential of the MC.
Figure 2. Relation between the parameter α and the exponential of the MC.
Entropy 24 01407 g002
Figure 3. Hierarchy in the mixture model.
Figure 3. Hierarchy in the mixture model.
Entropy 24 01407 g003
Figure 4. Example of the decomposition of MC. The data’s color in (b) and thickness in (c,d) correspond to the data weights w n ( l ) .
Figure 4. Example of the decomposition of MC. The data’s color in (b) and thickness in (c,d) correspond to the data weights w n ( l ) .
Entropy 24 01407 g004
Figure 5. Decomposition of the MC to prove Theorem 2.
Figure 5. Decomposition of the MC to prove Theorem 2.
Entropy 24 01407 g005
Figure 6. Scatter plots of the first and second dimensions of the data at t = 1 , 75 , 101 in the move Gaussian dataset.
Figure 6. Scatter plots of the first and second dimensions of the data at t = 1 , 75 , 101 in the move Gaussian dataset.
Entropy 24 01407 g006
Figure 7. Plots of the exponential of MC and K for the move Gaussian dataset. The filled area represents the transaction period. The markers on the plot represent the alerts in each method.
Figure 7. Plots of the exponential of MC and K for the move Gaussian dataset. The filled area represents the transaction period. The markers on the plot represent the alerts in each method.
Entropy 24 01407 g007
Figure 8. Scatter plots of the first and second dimensions of the data at t = 1 , 80 , 101 in the imbalance Gaussian dataset.
Figure 8. Scatter plots of the first and second dimensions of the data at t = 1 , 80 , 101 in the imbalance Gaussian dataset.
Entropy 24 01407 g008
Figure 9. Plots of the exponential of MC and K for the imbalance Gaussian dataset. The filled area represents the transaction period. The markers on the plot represent the alerts in each method.
Figure 9. Plots of the exponential of MC and K for the imbalance Gaussian dataset. The filled area represents the transaction period. The markers on the plot represent the alerts in each method.
Entropy 24 01407 g009
Figure 10. Relationships between the computation time and N and d. In (a), we fixed d = 10 and varied N from 10 to 10 6 . In (b), we fixed N = 10,000 and varied d from 10 to 10 4 .
Figure 10. Relationships between the computation time and N and d. In (a), we fixed d = 10 and varied N from 10 to 10 6 . In (b), we fixed N = 10,000 and varied d from 10 to 10 4 .
Entropy 24 01407 g010
Figure 11. Plots of the sequences of the MC and K in the beer dataset.
Figure 11. Plots of the sequences of the MC and K in the beer dataset.
Entropy 24 01407 g011
Figure 12. Plots of the decomposition of MC with BIC in the beer Dataset.
Figure 12. Plots of the decomposition of MC with BIC in the beer Dataset.
Entropy 24 01407 g012
Figure 13. Plots of the decomposition of MC with NML in the beer Dataset.
Figure 13. Plots of the decomposition of MC with NML in the beer Dataset.
Entropy 24 01407 g013
Figure 14. Plots of the sequences of the MC and K in the house dataset.
Figure 14. Plots of the sequences of the MC and K in the house dataset.
Entropy 24 01407 g014
Figure 15. Plots of the decomposition of MC with BIC in the house Dataset.
Figure 15. Plots of the decomposition of MC with BIC in the house Dataset.
Entropy 24 01407 g015
Figure 16. Plots of the decomposition of MC with NML in the house Dataset.
Figure 16. Plots of the decomposition of MC with NML in the house Dataset.
Entropy 24 01407 g016
Table 1. Quantities in the example of the decomposition.
Table 1. Quantities in the example of the decomposition.
Component 1Component 2
MC (total)1.076
MC (interaction)0.643
Contribution (component l)0.2770.157
W (component l)0.4960.504
MC (component l)0.5580.311
Table 2. Difference in the average performance score between MC and K for the move Gaussian dataset.
Table 2. Difference in the average performance score between MC and K for the move Gaussian dataset.
(Score of MC) − (Score of K)
t = 1 150 t = 150 1
CriterionDelayFARDelayFar
AIC0.00.000−20.60.000
AIC+comp0.00.000−10.90.000
BIC0.00.000−17.50.000
BIC+comp0.00.000−8.90.000
NML0.00.000−7.90.000
DNML0.00.000−7.70.000
Table 3. Differences in the average performance score between MC and K for the imbalance Gaussian dataset.
Table 3. Differences in the average performance score between MC and K for the imbalance Gaussian dataset.
(Score of MC) − (Score of K)
t = 1 150 t = 150 1
CriterionDelayFARDelayFar
AIC−30.20.010−4.60.000
AIC+comp−34.00.0000.00.000
BIC−34.00.0000.00.000
BIC+comp−34.00.0000.00.000
NML−34.00.0000.00.000
DNML−34.00.0000.00.000
Table 4. Summary of the dataset.
Table 4. Summary of the dataset.
DatasetT N t dDescription
beer92318516purchase data of beer.
house9643263electricity consumption data in a house.
Table 5. Centers of the upper components estimated by BIC in the beer dataset. For each dimension, the maximum value is denoted in boldface.
Table 5. Centers of the upper components estimated by BIC in the beer dataset. For each dimension, the maximum value is denoted in boldface.
Component 1Component 2Component 3Component 4
beer(A)0.090.441.930.16
beer(B)0.070.230.960.06
beer(C)0.070.200.830.07
beer(D)0.050.200.580.05
beer(E)0.030.060.350.03
beer(F)0.030.060.350.02
beer(other)0.040.120.690.10
beerlike(A)0.025.850.230.07
beerlike(B)0.090.570.800.21
beerlike(C)0.100.630.830.22
beerlike(D)0.070.400.570.18
beerlike(E)0.040.120.510.06
beerlike(F)0.040.200.340.13
beerlike(G)0.050.100.400.06
beerlike(H)0.030.090.260.04
beerlike(other)0.091.271.116.78
Table 6. Centers of the upper components estimated by NML in the beer dataset. For each dimension, the maximum value is denoted in boldface.
Table 6. Centers of the upper components estimated by NML in the beer dataset. For each dimension, the maximum value is denoted in boldface.
Component 1Component 2Component 3Component 4
beer(A)0.080.481.900.12
beer(B)0.040.301.040.07
beer(C)0.050.200.950.04
beer(D)0.040.190.640.09
beer(E)0.020.060.380.02
beer(F)0.020.070.400.01
beer(other)0.030.110.680.19
beerlike(A)0.025.790.210.07
beerlike(B)0.100.520.730.18
beerlike(C)0.110.610.700.21
beerlike(D)0.060.490.520.24
beerlike(E)0.040.120.470.07
beerlike(F)0.040.180.300.24
beerlike(G)0.040.110.440.07
beerlike(H)0.020.100.230.09
beerlike(other)0.081.421.086.54
Table 7. Centers of the upper components estimated by BIC in the house dataset. For each dimension, the maximum value is denoted in boldface.
Table 7. Centers of the upper components estimated by BIC in the house dataset. For each dimension, the maximum value is denoted in boldface.
Component 1Component 2Component 3Component 4
metering(A)0.044.470.130.41
metering(B)0.530.890.564.40
metering(C)0.753.344.372.96
Table 8. Centers of the upper components estimated by NML in the house dataset. For each dimension, the maximum value is denoted in boldface.
Table 8. Centers of the upper components estimated by NML in the house dataset. For each dimension, the maximum value is denoted in boldface.
Component 1Component 2Component 3Component 4
metering(A)0.044.240.110.35
metering(B)0.531.000.574.48
metering(C)0.763.374.382.93
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kyoya, S.; Yamanishi, K. Mixture Complexity and Its Application to Gradual Clustering Change Detection. Entropy 2022, 24, 1407. https://doi.org/10.3390/e24101407

AMA Style

Kyoya S, Yamanishi K. Mixture Complexity and Its Application to Gradual Clustering Change Detection. Entropy. 2022; 24(10):1407. https://doi.org/10.3390/e24101407

Chicago/Turabian Style

Kyoya, Shunki, and Kenji Yamanishi. 2022. "Mixture Complexity and Its Application to Gradual Clustering Change Detection" Entropy 24, no. 10: 1407. https://doi.org/10.3390/e24101407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop