A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

Komori, Osamu; Eguchi, Shinto

doi:10.3390/e23050518

Open AccessArticle

A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

by

Osamu Komori

^1,*

and

Shinto Eguchi

²

¹

Department of Computer and Information Science, Seikei University, 3-3-1 Kichijoji-Kitamachi, Musashino-shi, Tokyo 180-8633, Japan

²

The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(5), 518; https://doi.org/10.3390/e23050518

Submission received: 22 March 2021 / Revised: 17 April 2021 / Accepted: 19 April 2021 / Published: 24 April 2021

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Clustering is a major unsupervised learning algorithm and is widely applied in data mining and statistical data analyses. Typical examples include k-means, fuzzy c-means, and Gaussian mixture models, which are categorized into hard, soft, and model-based clusterings, respectively. We propose a new clustering, called Pareto clustering, based on the Kolmogorov–Nagumo average, which is defined by a survival function of the Pareto distribution. The proposed algorithm incorporates all the aforementioned clusterings plus maximum-entropy clustering. We introduce a probabilistic framework for the proposed method, in which the underlying distribution to give consistency is discussed. We build the minorize-maximization algorithm to estimate the parameters in Pareto clustering. We compare the performance with existing methods in simulation studies and in benchmark dataset analyses to demonstrate its highly practical utilities.

Keywords:

k-means; fuzzy-c; Gaussian mixture model; Kolmogorov–Nagumo average; generalized energy function; Pareto distribution

1. Introduction

In data analysis or data mining, there are two fundamental types of methodologies: clustering and classification [1]. Clustering, which is categorized as an exploratory paradigm, detects the underlying structure behind the data and grasps the rough image before proceeding to more intensive and comprehensive data analysis [2,3]. On the other hand, classification predicts unknown class labels of test data based on models constructed by training data with known class labels. The former is called supervised learning, while the latter is called unsupervised learning in pattern recognition [4].

Clustering algorithms fall roughly into three categories: hierarchical, partitioning, and mixture model-based algorithms [5]. In hierarchical clustering, each observation is considered as one cluster in the initial setting. Then clusters are merged recursively based on a similarity matrix defined beforehand. The resultant clusters are expressed as a dendrogram. The partition algorithm starts with a fixed number of clusters and searches for the cluster centers to minimize an objective function such as the squared distances between the centers and observations. It finds the centers simultaneously. The model-based algorithm assumes a mixture of probability distributions, which generates the observations and assigns the distributions to one of the mixture components. A Gaussian-mixture distribution-based approach is widely used in this context.

In this paper, we propose a new clustering, called Pareto clustering in the framework of quasilinear modeling [6,7,8]. It combines the cluster components by the Kolmogorov–Nagumo average [9] in a flexible way. We consider a generalized energy function as an objective function to estimate cluster parameters, which is an extension of the energy function proposed by [10]. The objective function consists of a survival function of the Pareto distribution, which is widely used in extreme value theory [11]. We investigate the consistency of the parameters, resulting in the underlying probability distribution of the generalized energy function. We find that k-means [12,13] and fuzzy c-means [14] have the underlying probability distributions with singular points at the cluster centers. This fact shows a clear difference from the model-based clustering such as a Gaussian-mixture modeling. Moreover, we show that the quasilinear modeling based on the Kolmogorov–Nagumo average connects k-means, fuzzy c-means, and a Gaussian-mixture modeling using the hyperparameters of the generalized energy function. See [15,16] for the discussion of the relation between k-means and fuzzy c-means.

The paper is organized as follows. In Section 2, we introduce the generalized energy function as the objective function of the Pareto clustering and discuss the consistency of the parameters. Moreover, we show that k-means, fuzzy c-means, and a Gaussian-mixture are all derived from the generalized energy function as special cases. This fact leads to the fact that the parameters can be estimated in a unified manner by the the minorize-maximization (MM) algorithm [17], where the monotone decrease of generalized energy function is guaranteed. In Section 3, we demonstrate the performance of the Pareto clustering based on simulation studies and benchmark datasets and show its practical utilities. We summarize the results of the Pareto clustering and discuss the extensions and applications in various scientific fields.

2. Materials and Methods

2.1. Generalized Energy Function

Let T be a non-negative random variable with a probability density function

f (t)

. The survival function of T is defined as

\begin{matrix} S (t) = P (T > t), t \geq 0 . \end{matrix}

(1)

Then for d-dimensional random variables

x_{1}, \dots, x_{n}

, we define a generalized energy function to be minimized with respect to a parameter

μ

for clustering

\begin{matrix} L_{S} (μ) & = & \frac{1}{τ} \sum_{i = 1}^{n} S^{- 1} (\frac{1}{K} \sum_{k = 1}^{K} S (τ ∥ x_{i} - μ_{k} ∥^{2})), \end{matrix}

(2)

where

μ = (μ_{1}, \dots, μ_{K})

is a set of centers and

τ > 0

is the shape parameter. If we take

S (t) = exp (t)

, the function corresponds to the energy function proposed by [10], where

τ

can be interpreted as the temperature in physics. The formulation in (2) is called the Kolmogorov–Nagumo average [9,18] and is widely applied to bioinformatics, ecology, fisheries, etc. [6,8,19].

In Equation (2), we express an average of probabilities that

x_{i}

belongs to the kth cluster over all K clusters using

1 / K \sum_{k = 1}^{K} S (∥ x_{i} - μ_{k} ∥^{2})

, where

∥ x_{i} - μ_{k} ∥^{2}

is the energy of

x_{i}

associated with

μ_{k}

. Hence we view

S^{- 1} (1 / K \sum_{k = 1}^{K} S (∥ x_{i} - μ_{k} ∥^{2}))

as the Kolmogorov–Nagumo average of the energy of

x_{i}

with the probabilistic meanings. In effect, we take summation of the Kolmogorov-Nagumo average over the observations

{x_{1}, . . ., x_{n}}

.

Remark 1.

The generalized energy function (2) has a relation with the Archimedean copula defined by

\begin{matrix} 1 - S (\sum_{k = 1}^{K} S^{- 1} (1 - u_{k})) \end{matrix}

(3)

for

{u_{k}}_{k = 1}^{K}

in

(0, 1)

, cf. [20] for an introductory discussion. In principle, the generalized energy function is a function from a vector of K cluster energy functions to a integrated energy function. The Archimedean copula is that of K marginal cumulative distribution functions to the joint cumulative distribution function. In this way, the generalized energy function expresses an interactive relation for cluster energy functions analogous with the Archimedean copula expressing the correlation among variables.

We consider an estimator of the generalized energy function as

\begin{matrix} \hat{μ} = \underset{μ}{argmin} L_{S} (μ) . \end{matrix}

(4)

If we assume that

x_{i} (i = 1, \dots, n)

is distributed according to a probability density function

p (x, μ^{*})

, the expected generalized energy function is given by

\begin{matrix} L_{S} (μ) = \frac{1}{τ} \int S^{- 1} (\frac{1}{K} \sum_{k = 1}^{K} S (τ ∥ x - μ_{k} ∥^{2})) p (x, μ^{*}) d x . \end{matrix}

(5)

Here we define a function for a set of cluster centers as

\begin{matrix} E_{μ} (x) = \frac{1}{K} \sum_{k = 1}^{K} S (τ ∥ x - μ_{k} ∥^{2}) . \end{matrix}

(6)

Thus, we note that

\begin{matrix} \int E_{μ} (x) d x = v_{d} E (T^{\frac{d}{2}}), \end{matrix}

(7)

where

v_{d}

is a volume constant

2 π^{d / 2} / {τ^{1 / 2} d Γ (d / 2)}

because

E (T) = \int_{0}^{\infty} S (t) d t

(Appendix A). This property is a key idea in the following discussion.

Lemma 1.

Assume that the survival function

S (t)

in (1) is convex in t. We define a function G of

(μ^{*}, μ)

as

\begin{matrix} G (μ^{*}, μ) = \int S^{- 1} (E_{μ} (x)) f (S^{- 1} (E_{μ^{*}} (x))) d x, \end{matrix}

(8)

Then for any μ and

μ^{*}

\begin{matrix} G (μ^{*}, μ) \geq G (μ^{*}, μ^{*}) \end{matrix}

(9)

with equality if and only if

μ = μ^{*}

.

Proof.

We observe that the function

S^{- 1} (t)

is a decreasing function of t given as

\begin{matrix} \frac{\partial S^{- 1} (t)}{\partial t} = - \frac{1}{f (S^{- 1} (t))} \end{matrix}

(10)

because

S (S^{- 1} (t)) = t

and

(\partial / \partial t) S (t) = - f (t)

. Similarly,

\begin{matrix} \frac{\partial^{2} S^{- 1} (t)}{\partial t^{2}} = - \frac{f^{'} (S^{- 1} (t))}{{f (S^{- 1} (t))}^{3}}, \end{matrix}

(11)

which is positive for all

t \geq 0

because

(\partial^{2} / \partial t^{2}) S (t) = - f^{'} (t) > 0

from the convexity assumption for

S (t)

. Therefore,

S^{- 1} (t)

is also convex in

t \in (0, 1)

. This leads to

\begin{matrix} G (μ^{*}, μ) - G (μ^{*}, μ^{*}) & = & \int {S^{- 1} (E_{μ} (x)) - S^{- 1} (E_{μ^{*}})} f (S^{- 1} (E_{μ^{*}} (x))) d x \end{matrix}

(12)

\begin{matrix} \geq & \int {E_{μ} (x) - E_{μ^{*}} (x)} \frac{\partial S^{- 1} (E_{μ^{*}} (x))}{\partial t} f (S^{- 1} (E_{μ^{*}} (x)) d x \end{matrix}

(13)

\begin{matrix} = & - \int {E_{μ} (x) - E_{μ^{*}} (x)} d x \end{matrix}

(14)

\begin{matrix} = & 0 . \end{matrix}

(15)

Here Equality in (13) holds if and only if

μ = μ^{*}

from the convexity for

S^{- 1}

. The Equality (14) is shown by

\begin{matrix} - f (S^{- 1} (t)) \frac{\partial S^{- 1} (t)}{\partial t} = 1 \end{matrix}

(16)

for any

t \geq 0

as seen in (10). Equality (15) holds due to (7). □

Theorem 1.

If the

p (x, μ^{*})

has a form such as

\begin{matrix} p (x, μ^{*}) = Z (μ^{*}) f (S^{- 1} (E_{μ^{*}} (x))), \end{matrix}

(17)

where

Z (μ^{*}) > 0

is a normalizing constant. Then we have

\begin{matrix} L_{S} (μ) \geq L_{S} (μ^{*}) . \end{matrix}

(18)

Proof.

Note that

\begin{matrix} L_{S} (μ) - L_{S} (μ^{*}) & = & \frac{Z (μ^{*})}{τ} {G (μ^{*}, μ) - G (μ^{*}, μ^{*})}, \end{matrix}

which concludes (18) from Lemma 1. □

We note that

\hat{μ}

is asymptotically consistent to true parameter

μ^{*}

if the probability density function has the form in (17).

2.1.1. Pareto Distribution

Let us consider a generalized Pareto distribution, where the survival function and its inverse function are defined by

\begin{matrix} S (t) = {(1 + β t)}^{- \frac{1}{β}} and S^{- 1} (t) = \frac{t^{- β} - 1}{β}, \end{matrix}

where

β > 0

denotes the shape parameter. Then the generalized energy function is

\begin{matrix} L_{τ, β} (μ) & = & \frac{1}{τ β} \sum_{i = 1}^{n} [{\{\sum_{k = 1}^{K} \frac{1}{K} {1 + τ β ∥ x_{i} - μ_{k} {∥^{2}}}^{- \frac{1}{β}}\}}^{- β} - 1] . \end{matrix}

(19)

If we consider

β \to 0

, then

\begin{matrix} lim_{β \to 0} L_{τ, β} (μ) = - \frac{1}{τ} \sum_{i = 1}^{n} log \{\sum_{k = 1}^{K} \frac{1}{K} exp (- τ ∥ x_{i} - μ_{k} ∥^{2})\}, \end{matrix}

(20)

which is reduced to the energy function proposed by [10]. Hence, we can understand that Rose’s clustering (maximum-entropy clustering) is generated by a survival distribution function of an exponential distribution. Then we have

\begin{matrix} lim_{τ \to \infty} L_{τ, β} (μ) = & lim_{τ \to \infty} \frac{1}{τ β} \sum_{i = 1}^{n} {\{\sum_{k = 1}^{K} \frac{1}{K} {1 + τ β ∥ x_{i} - μ_{k} {∥^{2}}}^{- \frac{1}{β}}\}}^{- β} \end{matrix}

(21)

\begin{matrix} = & \sum_{i = 1}^{n} {\{\sum_{k = 1}^{K} \frac{1}{K} {∥ x_{i} - μ_{k} {∥^{2}}}^{- \frac{1}{β}}\}}^{- β} \end{matrix}

(22)

The gradient with respect to

μ_{k}

is given by

\begin{matrix} \frac{2}{β} \sum_{i = 1}^{n} \frac{1}{K} {[\frac{{∥ x_{i} - μ_{k} {∥^{2}}}^{- \frac{1}{β}}}{\sum_{ℓ = 1}^{K} π_{ℓ} {∥ x_{i} - μ_{ℓ} {∥^{2}}}^{- \frac{1}{β}}}]}^{1 + β} (x_{i} - μ_{k}), \end{matrix}

(23)

which exactly leads to the estimation equations of fuzzy c-means if we take

β = m - 1

[14]. Furthermore, we have

\begin{matrix} lim_{τ \to \infty, β \to 0} L_{τ, β} (μ) & = & \sum_{i = 1}^{n} min_{1 \leq k \leq K} {∥ x_{i} - μ_{k} ∥}^{2}, \end{matrix}

(24)

which is the loss function of k-means. The corresponding survival function is

{lim}_{β \to 0} {(1 + β t)}^{- \frac{1}{β}} = exp (- t)

. Note that the loss function is directly derived from (2) as

lim_{τ \to \infty} \frac{1}{τ} \sum_{i = 1}^{n} S^{- 1} (\frac{1}{K} \sum_{k = 1}^{K} S (τ ∥ x_{i} - μ_{k} ∥^{2})) = \sum_{i = 1}^{n} min_{1 \leq k \leq K} {∥ x_{i} - μ_{k} ∥}^{2} .

(25)

In addition, we have

lim_{τ \to 0} \frac{1}{τ} \sum_{i = 1}^{n} S^{- 1} (\frac{1}{K} \sum_{k = 1}^{K} S (τ ∥ x_{i} - μ_{k} ∥^{2})) = \sum_{i = 1}^{n} \frac{1}{K} \sum_{k = 1}^{K} {∥ x_{i} - μ_{k} ∥}^{2}

(26)

because

S (0) = 1

.

2.1.2. Fréchet Distribution

Next, we consider Fréchet distribution with the survival function defined as

\begin{matrix} S (t) = 1 - exp (- t^{γ}) . \end{matrix}

(27)

where

γ < 0

is the shape parameter. The generalized energy function is given by

\begin{matrix} L_{γ, τ} (μ) & = \sum_{i = 1}^{n} {(- \frac{1}{τ^{γ}} log [\frac{1}{K} \sum_{k = 1}^{K} exp (- τ^{γ} ∥ x_{i} - μ_{k} ∥^{2 γ})])}^{\frac{1}{γ}} . \end{matrix}

(28)

We find that

\begin{matrix} lim_{τ \to 0} L_{γ, τ} (μ) & = & \sum_{i = 1}^{n} lim_{τ \to 0} {(- \frac{1}{τ^{γ}} log [\frac{1}{K} \sum_{k = 1}^{K} exp (- τ^{γ} ∥ x_{i} - μ_{k} ∥^{2 γ})])}^{\frac{1}{γ}} \\ = & \sum_{i = 1}^{n} {(min_{1 \leq k \leq K} {∥ x_{i} - μ_{k} ∥}^{2 γ})}^{\frac{1}{γ}} \\ = & \sum_{i = 1}^{n} min_{1 \leq k \leq K} {∥ x_{i} - μ_{k} ∥}^{2} . \end{matrix}

(29)

Hence, this energy function is reduced to that of the K-means algorithm as shown in the Pareto distribution case. The estimating equation is given by

\begin{matrix} \frac{\partial}{\partial μ_{k}} L_{γ, τ} (μ) & = & \sum_{i = 1}^{n} ω_{k} (x_{i}, τ, γ) (μ_{k} - x_{i}) = 0, \end{matrix}

(30)

where

\begin{matrix} ω_{k} (x_{i}, τ, γ) & = & {(- \frac{1}{τ^{γ}} log [\frac{1}{K} \sum_{ℓ = 1}^{K} exp (- τ^{γ} ∥ x_{i} - μ_{ℓ} ∥^{2 γ})])}^{\frac{1}{γ} - 1} \\ \times \frac{exp (- τ^{γ} ∥ x_{i} - μ_{k} ∥^{2 γ})}{\sum_{ℓ = 1}^{K} exp (- τ^{γ} ∥ x_{i} - μ_{ℓ} ∥^{2 γ})} {∥ x_{i} - μ_{k} ∥}^{2 γ - 2} . \end{matrix}

(31)

When we assume the unbiasedness for the estimating function in (30), that is

\begin{matrix} E {ω_{k} (X, τ, γ) (μ_{k} - X)} = 0, \end{matrix}

(32)

the underlying distribution has a density function proportional to

\begin{matrix} {(- \frac{1}{τ^{γ}} log M (x, μ))}^{\frac{γ}{1 - γ}} M (x, μ) . \end{matrix}

(33)

where

\begin{matrix} M (x, μ) = \frac{1}{K} \sum_{ℓ = 1}^{K} exp (- τ^{γ} ∥ x - μ_{ℓ} ∥^{2 γ}) . \end{matrix}

(34)

We confirm that

\begin{matrix} ω_{k} (x_{i}, τ, γ) = \{\begin{matrix} 1 & if ∥ x_{i} - μ_{k} ∥^{2} = min_{1 \leq ℓ \leq K} {∥ x_{i} - μ_{ℓ} ∥}^{2} \\ 0 & otherwise \end{matrix} \end{matrix}

(35)

as

τ

goes to 0. Then we consider the limit of

τ

to ∞, which provides

\begin{matrix} lim_{τ \to \infty} L_{γ, τ} (μ) = \sum_{i = 1}^{n} lim_{τ \to \infty} {(- \frac{1}{τ^{γ}} log [\frac{1}{K} \sum_{k = 1}^{K} exp (- τ^{γ} ∥ x_{i} - μ_{k} ∥^{2 γ})])}^{\frac{1}{γ}} \\ = \sum_{i = 1}^{n} {(\frac{1}{K} \sum_{k = 1}^{K} {∥ x_{i} - μ_{k} ∥}^{2 γ})}^{\frac{1}{γ}} \end{matrix}

(36)

which is equal to (22). This also leads to Fuzzy c-means if we take as

γ = 1 / (1 - m)

[14].

2.2. Estimation of Variances and Mixing Proportions in Clusters

In stead of the Euclidean distance

∥ x_{i} - μ_{k} ∥^{2}

, we consider

∥ x_{i} - μ_{k} ∥_{\sum_{k}^{- 1}}^{2} = {(x_{i} - μ_{k})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k})

to incorporate the variance structure around

μ_{k}

. Bezdek et al. [14] considered a common variance structure

\sum_{k} = \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{⊤}

, where

\bar{x} = 1 / n \sum_{i = 1}^{n} x_{i}

for

k = 1, \dots, K

. On the other hand, we estimate distinct

\sum_{k}

for each

μ_{k}

.

For this purpose, we modify the generalized energy function in (2) to allow for a variances

\sum_{1}, \dots, \sum_{K}

and mixing proportions

π_{1}, \dots, π_{K}

(

\sum_{k = 1}^{K} π_{k} = 1

and

π_{k} \geq 0

for

k = 1, \dots, K

) as

\begin{matrix} L_{S} (θ) & = & \frac{1}{τ} \sum_{i = 1}^{n} S^{- 1} (\sum_{k = 1}^{K} π_{k} | \sum_{k} |^{- \frac{1}{2}} S (τ ∥ x_{i} - μ_{k} ∥_{\sum_{k}^{- 1}}^{2})), \end{matrix}

(37)

where

θ = {(μ_{k}, \sum_{k}, π_{k})}_{k = 1}^{K}

. We assume that

S (t)

is convex so that the domain of

S^{- 1} (t)

can be extended from

[0, 1]

to

[0, \infty)

to allow for

| \sum_{k} |^{- \frac{1}{2}}

. The estimator of this modified generalized energy function is given as

\begin{matrix} \hat{θ} = \underset{θ}{argmin} L_{S} (θ), \end{matrix}

(38)

The expected generalized energy function is given by

\begin{matrix} L_{S} (θ) = \frac{1}{τ} \int S^{- 1} (\sum_{k = 1}^{K} π_{k} | \sum_{k} |^{- \frac{1}{2}} S (τ ∥ x - μ_{k} ∥_{\sum_{k}^{- 1}}^{2})) p (x, θ^{*}) d x, \end{matrix}

(39)

where

p (x, θ^{*})

is the underlying probability density function.

For a cumulative distribution function

F (t) = 1 - S (t)

we have

L_{S} (θ) = L_{F} (θ)

if and only if

\sum_{k = 1}^{K} π_{k} {| \sum_{k} |}^{- 1 / 2} = 1

. On the other hand, it always holds that

L_{S} (μ) = L_{F} (μ)

for the original generalized energy function in (2).

Similarly to (6), we define

\begin{matrix} E_{θ} (x) = \sum_{k = 1}^{K} π_{k} | \sum_{k} |^{- \frac{1}{2}} S (τ ∥ x - μ_{k} ∥_{\sum_{k}^{- 1}}^{2}), \end{matrix}

(40)

and we notice that

\int E_{θ} (x) d x

is also independent of

μ

as

\int E_{θ} (x) d x = v_{d} E (T^{\frac{d}{2}}) .

(41)

Lemma 2.

Assume that the survival function

S (t)

in (1) is convex in t. We define a function G of

(μ^{*}, μ)

as

\begin{matrix} G (θ^{*}, θ) = \int S^{- 1} (E_{θ} (x)) f (S^{- 1} (E_{θ^{*}} (x))) d x . \end{matrix}

(42)

Then for any θ and

θ^{*}

\begin{matrix} G (θ^{*}, θ) \geq G (θ^{*}, θ^{*}) \end{matrix}

(43)

with equality if and only if

θ = θ^{*}

.

Proof.

It is obvious from Lemma 1 and the fact that

\int E_{θ} (x) d x

is independent of

μ

. □

From Lemma 2, we can easily show the following theorem regarding

L_{S} (θ)

.

Theorem 2.

If the

p (x, θ^{*})

has a form such as

\begin{matrix} p (x, θ^{*}) = Z (θ^{*}) f (S^{- 1} (E_{θ^{*}} (x))), \end{matrix}

(44)

where

Z (θ^{*}) > 0

is a normalizing constant. Then we have

\begin{matrix} L_{S} (θ) \geq L_{S} (θ^{*}) . \end{matrix}

(45)

For the Pareto distribution, we have from (37)

\begin{matrix} L_{τ, β} (θ) & = & \frac{1}{τ β} \sum_{i = 1}^{n} [{\{\sum_{k = 1}^{K} π_{k} | \sum_{k} |^{- \frac{1}{2}} {1 + τ β ∥ x_{i} - μ_{k} {∥_{\sum_{k}^{- 1}}^{2}}}^{- \frac{1}{β}}\}}^{- β} - 1] \end{matrix}

(46)

\begin{matrix} = & \frac{1}{τ} \sum_{i = 1}^{n} ϕ (\sum_{k = 1}^{K} π_{k} w (x_{i}, μ_{k}, \sum_{k})), \end{matrix}

(47)

where

\begin{matrix} w (x_{i}, μ_{k}, \sum_{k}) & = & | \sum_{k} |^{- \frac{1}{2}} {1 + τ β ∥ x_{i} - μ_{k} {∥_{\sum_{k}^{- 1}}^{2}}}^{- \frac{1}{β}} \end{matrix}

(48)

\begin{matrix} ϕ (t) & = & \frac{t^{- β} - 1}{β} . \end{matrix}

(49)

From (44), the underlying probability density function is

p_{τ, β} (θ^{*}) = Z_{τ, β} (θ^{*}) {\{\sum_{k = 1}^{K} π_{k}^{*} w (x_{i}, μ_{k}^{*}, \sum_{k}^{*})\}}^{1 + β},

(50)

where

Z_{τ, β} (θ^{*})

is a normalizing constant. When

β \to 0

, we have

lim_{β \to 0} L_{τ, β} (θ) = - \frac{1}{τ} \sum_{i = 1}^{n} log \{\sum_{k = 1}^{K} π_{k} | \sum_{k} |^{- \frac{1}{2}} exp (- τ ∥ x_{i} - μ_{k} ∥_{\sum_{k}^{- 1}}^{2})\},

(51)

which is the negative log likelihood function of the normal mixture distributions apart from a constant term

{(2 π)}^{- d / 2}

when

τ = 1 / 2

.

Similarly, we have the estimation equation of fuzzy c-means allowing for the Mahalanobis distance when

τ \to \infty

. Moreover, we have k-means with the use of the Mahalanobis distance when

τ \to \infty

and

β \to 0

. For the other extreme cases, we observe that both

{lim}_{β \to \infty} L_{τ, β} (θ)

and

{lim}_{τ \to 0} L_{τ, β} (θ)

diverge or converge to 0 depending on the values of

π_{k}

and

\sum_{k}

(

k = 1, \dots, K

). Hence we choose large values for

τ

and small values for

β

in the subsequent data analysis.

2.3. Estimating Algorithm

The direct optimization of

L_{τ, β} (θ)

in (46) is difficult due to the mixture structure. Thus, we employ the idea of expectation and maximization (EM) algorithm [21] and the minorize-maximization (MM) algorithm [17] similar to [19]. Our proposed clustering method (Pareto clustering) is as follows in Algorithm 1.

Algorithm 1: Pareto clustering

1. Set initial values

(μ_{k}^{(0)}, \sum_{k}^{(0)}, π_{k}^{(0)})

for

k = 1, \dots, K

.
2. Repeat the following steps for

t = 0, \dots, T - 1

and

k = 1, \dots, K

until convergence.
3.

\begin{matrix} q_{k}^{(t)} (x_{i}) = \frac{π_{k}^{(t)} w (x_{i}, μ_{k}^{(t)}, \sum_{k}^{(t)})}{\sum_{ℓ = 1}^{K} π_{k}^{(t)} w (x_{i}, μ_{k}^{(t)}, \sum_{k}^{(t)})} \end{matrix}

(52)

\begin{matrix} μ_{k}^{(t + 1)} = \frac{\sum_{i = 1}^{n} {\{q_{k}^{(t)} (x_{i})\}}^{1 + β} x_{i}}{\sum_{i = 1}^{n} {\{q_{k}^{(t)} (x_{i})\}}^{1 + β}} \end{matrix}

(53)

\sum_{k}^{(t + 1)} = \frac{τ (2 - d β) \sum_{i = 1}^{n} {\{q_{k}^{(t)} (x_{i})\}}^{1 + β} (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤}}{\sum_{i = 1}^{n} {\{q_{k}^{(t)} (x_{i})\}}^{1 + β}}

(54)

π_{k}^{(t + 1)} = \frac{{\{\sum_{i = 1}^{n} {\{q_{k}^{(t)} (x_{i})\}}^{1 + β} {\{w (x_{i}, μ_{k}^{(t + 1)}, \sum_{k}^{(t + 1)})\}}^{- β}\}}^{\frac{1}{1 + β}}}{\sum_{ℓ = 1}^{K} {\{\sum_{i = 1}^{n} {\{q_{ℓ}^{(t)} (x_{i})\}}^{1 + β} {\{w (x_{i}, μ_{ℓ}^{(t + 1)}, \sum_{ℓ}^{(t + 1)})\}}^{- β}\}}^{\frac{1}{1 + β}}}

(55)

4. Output

({\hat{μ}}_{k}, {\sum^{^}}_{k}, {\hat{π}}_{k}) = (μ_{k}^{(T)}, \sum_{k}^{(T)}, π_{k}^{(T)})

for

k = 1, \dots, K

.

The initial values

(μ_{k}^{(0)}, \sum_{k}^{(0)}, π_{k}^{(0)})

are determined by the hierarchical clustering in a similar way to the algorithm by [22]. The derivation of the estimating algorithm is as follows. First, we have

\begin{matrix} L_{τ, β} (θ) & = & \frac{1}{τ} \sum_{i = 1}^{n} ϕ (\sum_{k = 1}^{K} π_{k} w (x_{i}, μ_{k}, \sum_{k})) \end{matrix}

(56)

\begin{matrix} = & \frac{1}{τ} \sum_{i = 1}^{n} ϕ (\sum_{k = 1}^{K} q_{k} (x_{i}) \frac{π_{k} w (x_{i}, μ_{k}, \sum_{k})}{q_{k} (x_{i})}) \end{matrix}

(57)

\begin{matrix} \leq & \frac{1}{τ} \sum_{i = 1}^{n} \sum_{k = 1}^{K} q_{k} (x_{i}) ϕ (\frac{π_{k} w (x_{i}, μ_{k}, \sum_{k})}{q_{k} (x_{i})}) \end{matrix}

(58)

where

q_{k} (x_{i})

is a positive weight such as

\sum_{k = 1}^{K} q_{k} (x_{i}) = 1

and

ϕ (t)

is the convex function defined in (49). The equality holds if and only if

\begin{matrix} \frac{π_{1} w (x_{i}, μ_{1}, \sum_{1})}{q_{1} (x_{i})} = \dots = \frac{π_{K} w (x_{i}, μ_{K}, \sum_{K})}{q_{K} (x_{i})}, \end{matrix}

(59)

which is equivalent to

\begin{matrix} q_{k} (x_{i}) = \frac{π_{k} w (x_{i}, μ_{k}, \sum_{k})}{\sum_{k = 1}^{K} π_{k} w (x_{i}, μ_{k}, \sum_{k})}, (k = 1, \dots, K) . \end{matrix}

(60)

Based on

q_{k}^{(t)} (x_{i})

in (52), we define

\begin{matrix} Q (θ | θ^{(t)}) = & \frac{1}{τ} \sum_{i = 1}^{n} \sum_{k = 1}^{K} q_{k}^{(t)} (x_{i}) \{{(\frac{π_{k} w (x_{i}, μ_{k}, \sum_{k})}{q_{k}^{(t)} (x_{i})})}^{- β} - 1\} / β \end{matrix}

(61)

\begin{matrix} = & \frac{1}{τ β} \sum_{i = 1}^{n} \sum_{k = 1}^{K} [{q_{k}^{(t)} (x_{i})}^{1 + β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} {1 + τ β {(x_{i} - μ_{k})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k})} - q_{k}^{(t)} (x_{i})] \end{matrix}

(62)

\begin{matrix} = & \frac{1}{τ β} \sum_{i = 1}^{n} \sum_{k = 1}^{K} [{q_{k}^{(t)} (x_{i})}^{1 + β} π_{k}^{- β} {| V_{k}^{- 1} |^{\frac{β}{d β - 2}} + τ β {(x_{i} - μ_{k})}^{⊤} V_{k}^{- 1} (x_{i} - μ_{k})} - q_{k}^{(t)} (x_{i})], \end{matrix}

(63)

where

V_{k}^{- 1} = {| \sum_{k} |}^{β / 2} \sum_{k}^{- 1}

. Then we have

\begin{matrix} \frac{\partial}{\partial μ_{k}} Q (θ | θ^{(t)}) & = & - 2 π_{k}^{- β} V_{k}^{- 1} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} (x_{i} - μ_{k}) \end{matrix}

(64)

\begin{matrix} = & 0 \end{matrix}

(65)

which means that

\begin{matrix} μ_{k}^{(t + 1)} = \frac{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} x_{i}}{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β}} . \end{matrix}

(66)

Similarly, we have

\begin{matrix} \frac{\partial}{\partial V_{k}^{- 1}} Q (θ | θ^{(t)}) |_{μ_{k} = μ_{k}^{(t + 1)}} & = \frac{1}{τ β} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} π_{k}^{- β} \{\frac{β}{d β - 2} {| V_{k}^{- 1} |}^{\frac{β}{d β - 2}} V_{k} + τ β (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤}\} \end{matrix}

(67)

\begin{matrix} = \frac{1}{τ β} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} π_{k}^{- β} \{\frac{β}{d β - 2} \sum_{k} + τ β (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤}\} \end{matrix}

(68)

\begin{matrix} = 0, \end{matrix}

(69)

which means that

\begin{matrix} \sum_{k}^{(t + 1)} = \frac{τ (2 - d β) \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤}}{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β}} . \end{matrix}

(70)

Next we consider

\begin{matrix} R (θ | θ^{(t)}) = Q (θ | θ^{(t)}) + λ (\sum_{k = 1}^{K} π_{k} - 1), \end{matrix}

(71)

where

λ

is a Lagrange multiplier. Then

\begin{matrix} \frac{\partial}{\partial π_{k}} R (θ | θ^{(t)}) |_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}} & = & - \frac{1}{τ} π_{k}^{- β - 1} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} w {(x_{i}, μ_{k}^{(t + 1)}, \sum_{k}^{(t + 1)})}^{- β} + λ \end{matrix}

(72)

\begin{matrix} = & 0, \end{matrix}

(73)

which means

\begin{matrix} π_{k}^{(t + 1)} & = & \frac{{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} w {(x_{i}, μ_{k}^{(t + 1)}, \sum_{k}^{(t + 1)})}^{- β}}^{\frac{1}{1 + β}}}{\sum_{ℓ = 1}^{K} {\sum_{i = 1}^{n} {q_{ℓ}^{(t)} (x_{i})}^{1 + β} w {(x_{i}, μ_{ℓ}^{(t + 1)}, \sum_{ℓ}^{(t + 1)})}^{- β}}^{\frac{1}{1 + β}}} . \end{matrix}

(74)

Remark 2.

The generalized energy function in (46) is monotonically decreasing in the estimating algorithm. That is, we have

L_{τ, β} (θ^{(t + 1)}) \leq Q (θ^{(t + 1)} | θ^{(t)}) \leq Q (θ^{(t)} | θ^{(t)}) = L_{τ, β} (θ^{(t)}) .

(75)

See Appendix B for more details.

Remark 3.

The estimating algorithm of fuzzy c-means by [14] is given as

\begin{matrix} u_{i k}^{(t)} & = & {[\sum_{ℓ = 1}^{K} {(\frac{∥ x_{i} - μ_{k}^{(t)} ∥^{2}}{∥ x_{i} - μ_{ℓ}^{(t)} ∥^{2}})}^{\frac{1}{m - 1}}]}^{- m}, \end{matrix}

(76)

\begin{matrix} μ_{k}^{(t)} & = & \frac{\sum_{i = 1}^{n} u_{i k}^{(t)} x_{i}}{\sum_{i = 1}^{n} u_{i k}^{(t)}}, \end{matrix}

(77)

where

{u_{i k}^{(t)}}^{1 / m}

is called the membership function of

x_{i}

in cluster k at the iteration step t. These are special cases of (52) and (53) with

τ \to \infty

,

\sum_{k} = I

,

π_{k} = 1 / K

and

β = m - 1

. Hence we observe that the original algorithm of fuzzy c-means can be interpreted as the EM algorithm.

Remark 4.

In analogy with the membership function of fuzzy c-means by [14], we define

q_{k}^{(t)} (x_{i})

in (52) as a membership function of

x_{i}

in cluster k at the iteration step t in Pareto clustering. Hence we estimate cluster

C_{k}

as

C_{k} = {x_{i} | q_{k}^{(T)} (x_{i}) \geq q_{ℓ}^{(T)} (x_{i}), ℓ = 1, \dots, K, i = 1, \dots, n},

(78)

where

\cup_{k = 1}^{K} C_{k} = {x_{1}, \dots, x_{n}}

.

Remark 5.

In high-dimensional setting

p ≫ 1

, we consider the ridge regularization for

\sum_{k}^{(t + 1)}

as in [23]

\sum_{k}^{(t + 1)} (α) = α \sum_{k}^{(t + 1)} + (1 - α) {\hat{σ}}_{k}^{2} I,

where

α = 0.95

and

σ_{k}^{2}

is the scalar variance estimated to be the maximum value of the diagonal elements of

\sum_{k}^{(t + 1)}

. Moreover, we take

β ≪ 1

to make

\sum_{k}^{(t + 1)}

positive definite.

2.4. Evaluation of Clustering Methods

We compare the performances of k-means, fuzzy c-means, Gaussian mixture modeling (Gaussian), partitioning around medoids (PAM), and Pareto clustering. To implement these methods, we use the kmeans function in the stat package [24], the cmeans function in the e1071 package [14], the Mclust in the mclust package [22] and the pam function in the cluster package [25] in the statistical software R, where the default settings are used for each function. In Pareto clustering,

τ = 0.5

and

β = 1

are used as the default settings. We assume that the number of clusters K is known and compare the performances as in [26].

2.4.1. Metrics

Cluster

C_{k}

(

k = 1, \dots, K

) estimated by a clustering method is evaluated by a predefined reference class set

D_{ℓ}

(

ℓ = 1, \dots, L

) such as

\begin{matrix} Precision (C_{k}, D_{ℓ}) & = & \frac{| C_{k} \cap D_{ℓ} |}{| C_{k} |} \\ Recall (C_{k}, D_{ℓ}) & = & \frac{| C_{k} \cap D_{ℓ} |}{| D_{ℓ} |}, \end{matrix}

where

Recall (C_{k}, D_{ℓ}) = Precision (D_{ℓ}, C_{k})

.

Precision (C_{k}, D_{ℓ})

counts data points in cluster

C_{k}

belonging to class ℓ. Hence

{max}_{ℓ} Precision (C_{k}, D_{ℓ})

represents the purity of cluster

C_{k}

regarding the classes. By taking the weighted average, we have

Purity = \sum_{k = 1}^{K} \frac{| C_{k} |}{n} max_{ℓ} Precision (C_{k}, D_{ℓ}),

where n is the sample size.

Recall (C_{k}, D_{ℓ})

counts data points in a class set

D_{ℓ}

estimated to be in cluster

C_{k}

. Precision and recall correspond to the positive predictive value and sensitivity, respectively [27].

A metric combining precision and recall is proposed by [28] such as

F-value = \sum_{k = 1}^{K} \frac{| D_{k} |}{n} max_{ℓ} F (D_{k}, C_{ℓ}),

where

F (D_{k}, C_{ℓ}) = \frac{2 \times Recall (D_{k}, C_{ℓ}) \times Precision (D_{k}, C_{ℓ})}{Recall (D_{k}, C_{ℓ}) + Precision (D_{k}, C_{ℓ})},

which is the harmonic mean of

Precision (D_{k}, C_{ℓ})

and

Recall (D_{k}, C_{ℓ})

, and is called the F-measure [29].

The cluster level similarity between the estimated center

\hat{μ} = ({\hat{μ}}_{1}, \dots, {\hat{μ}}_{K})

and the reference value (ground truth) of center

μ^{*} = (μ_{1}^{*}, \dots, μ_{K}^{*})

is the centroid index (CI) proposed by [26] as

\begin{matrix} CI (\hat{μ}, μ^{*}) = max ({CI}^{'} (\hat{μ}, μ^{*}), {CI}^{'} (μ^{*}, \hat{μ})), \end{matrix}

where

\begin{matrix} {CI}^{'} (\hat{μ}, μ^{*}) & = & \sum_{k = 1}^{K} orphan (μ_{k}^{*}) \\ orphan (μ_{k}^{*}) & = & \{\begin{matrix} 1 & q_{ℓ} \neq k \forall ℓ \\ 0 & otherwise \end{matrix} \\ q_{ℓ} & = & \underset{1 \leq k \leq K}{argmin} {∥ {\hat{μ}}_{ℓ} - μ_{k}^{*} ∥}^{2} . \end{matrix}

Here,

q_{ℓ}

indicates the index of the element of the reference center

μ^{*}

that is the nearest to

{\hat{μ}}_{ℓ}

. The function

orphan (μ_{k}^{*})

indicates whether

μ_{k}^{*}

is an isolated element (orphan) or not, which is

n o t

nearest to any elements of

\hat{μ}

. Hence

{CI}^{'} (\hat{μ}, μ^{*})

indicates the dissimilarity between

\hat{μ}

and

μ^{*}

. Due to the asymmetry of

{CI}^{'} (\hat{μ}, μ^{*})

with respect to

\hat{μ}

and

μ^{*}

, we take the maximum of

{CI}^{'} (\hat{μ}, μ^{*})

and

{CI}^{'} (μ^{*}, \hat{μ})

. Hence

CI (\hat{μ}, μ^{*})

measures how many clusters are differently located among

\hat{μ}

and

μ^{*}

.

Another metric to measure the similarity between

\hat{μ}

and

μ^{*}

is defined as the mean squared error (MSE) over the number of clusters K, which is given as

MSE = \frac{1}{K} \sum_{k = 1}^{K} | | {\hat{μ}}_{k} - μ_{k}^{*} {| |}^{2} .

Differently from Purity and F-value, MSE can be calculated based on only estimated and reference centers

\hat{μ}

and

μ^{*}

. This property is useful in a situation where the reference class sets

D_{1}, \dots, D_{K}

are difficult to determine but

μ^{*}

is easily identified. We use MSE in the simulation studies to evaluate the accuracy of

\hat{μ}

and Purity and F-value in the analysis of benchmark datasets.

2.4.2. Simulation Studies

We generate samples according to the density function

p_{τ, β} (θ^{*})

in (50) using the Metropolis-Hastings algorithm [30,31] as

x \sim Z_{τ, β} (θ^{*}) {\{\sum_{k = 1}^{K} π_{k}^{*} w (x_{i}, μ_{k}^{*}, \sum_{k}^{*})\}}^{1 + β},

(79)

where

μ_{1}^{*} = {(0, 0)}^{⊤}

,

μ_{2}^{*} = (5, 5)

,

μ_{3}^{*} = {(- 5, - 5)}^{⊤}

,

π_{1}^{*} = 0.5

,

π_{2}^{*} = 0.2

,

π_{3}^{*} = 0.3

and

\sum_{1}^{*} = (\begin{matrix} 2 & - 0.5 \\ - 0.5 & 1 \end{matrix}), \sum_{2}^{*} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}), \sum_{3}^{*} = (\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}) .

Figure 1 illustrates the perspective plots and contour plots for

(τ, β) = (0.5, 1), (0.5, 0),

(10, 1). The shape of

p_{τ, β} (θ^{*})

varies according to the values of

τ

and

β

. The Gaussian mixture distribution corresponds with

τ = 0.5

and

β = 0

in panel (b). When

β = 1

, the variance of each component increases and the contours connect with each other. On the other hand, for a large value of

τ = 10

, the distribution shows high peaks around the centers. This indicates that

p_{τ, β} (θ^{*})

, including fuzzy c-means when

τ \to \infty

, has a quite different shape from he Gaussian mixture distribution. Other versions of the shapes are also illustrated in Appendix C. The performance of each method is evaluated by MSE based on 100 simulated samples with sample size

n = 3000

.

2.4.3. Benchmark Data Analysis

The performance of our proposed method is evaluated using benchmark datasets prepared by [32]. It includes a variety of datasets with low and high cluster overlap, various sample sizes, low and high dimensionalities and unbalanced cluster sizes. Hence, these datasets are suitable for clarifying the statistical performance of the clustering methods. In this setting, we compare the performance of k-means, fuzzy c-means, Gaussian, PAM, and Pareto clustering as well as the variants of Pareto clustering with several values of

(τ, β)

as explained in Table S1. The characteristics of the benchmark datasets such as the sample sizes, the number of clusters, and dimensionality are summarized in Table S2.

3. Results

Figure 2 illustrates the results of MSE in the simulation studies. Pareto clustering provides the best performance in panel (a), where the samples are generated by the underlying distribution

p_{0.5, 1} (θ^{*})

of Pareto clustering. The shape of the distribution is similar to Gaussian mixture; however, the variance of each component becomes larger and the contour lines are connected to each other as in panel (a) of Figure 1. On the other hand, in panel (c), the variance of each component becomes smaller and contour lines are completely separated. In the both cases, the performances of the Gaussian mixture are clearly degenerated. In the case of panel (b) in which the data are generated from the Gaussian mixture, the performances are comparable to each other, suggesting that k-means, fuzzy c-means, PAM, and Pareto clustering are robust to the underlying distributions to some extent.

In the benchmark data analysis, metrics of Purity, F-value, and CI are evaluated in Tables S3–S5, where variance

\sum_{k}

and mixing proportion

π_{k}

(k = 1, \dots, K)

in Pareto clustering are estimated. For the two-dimensional shape datasets such as Flame, Compound, D31, Aggregation, Jain, Pathbased, and Spiral in the upper rows of Table S3, existing methods such as k-means, fuzzy c-means, PAM, and Gaussian mixture outperform our proposed methods. In high-dimensional data with

d = 1024

(Dim1024), k-means and a Gaussian mixture do not work well. Other methods achieved the maximum value (1) of Purity. For datasets with a large number of clusters, D31 (

K = 31

) and A3 (

K = 50

), PAM performs best. For datasets with large sample size of

n \geq 5000

and a moderate number of clusters

K = 8, 15

, our proposed method performs best. As for the effect of

τ

, it barely affects the performance of our proposed method. On the other hand,

β

slightly affects the performance, resulting that the intermediates among Gaussian mixture, Pareto clustering, k-means, and fuzzy c-means, such as GP, GPKF

_{1}

, GPKF

_{10}

, and GPKF

_{100}

, show relatively good performances as a whole. We observe similar tendencies regarding the F-value (Table S4).

As for the CI, the values are relatively small for all methods, suggesting that cluster locations are properly estimated. However, some methods do not work for some datasets: Gaussian mixture for A3 (CI = 18), Birch1 (CI = 34) and Birch2 (CI = 49); k-means for D31 (CI = 7), Dim1024 (CI = 4), A3 (CI = 6), Birch1 (CI = 12) and Birch2 (CI = 23); and fuzzy c-means for D31 (CI = 5), A3 (CI = 7), Birch1 (CI = 18) and Birch2 (CI = 25). On the other hand, PAM and our proposed methods show stable results. The results, where

\sum_{k}

and

π_{k}

are not estimated and fix

\sum_{k} = I

and

π_{k} = 1 / K

in Pareto clustering, are shown in Tables S6–S8.

4. Discussion

We propose a new clustering method based on the generalized energy function derived from the Kolmogorov–Nagumo average. The survival function used in the generalized energy function plays an important role to ensure the minimum consistency of the parameters, which is shown in Lemma 1 using the property of divergence

G (μ^{*}, μ)

. We consider two examples of the survival function based on the Pareto and Fréchet distributions and show a connection among k-means, fuzzy c-means, and Gaussian mixture, leading to new methods that are intermediates among them. For the underlying distribution of our method in (50), we observe that k-means and fuzzy c-means do not have probabilistic interpretations because the corresponding underlying distributions become singular. We also propose an estimating algorithm for cluster locations, variances, and mixing proportions using the MM algorithm.

Simulation studies and benchmark data analysis show that intermediates among k-means, fuzzy c-means, and the Gaussian mixture perform well. This observation suggests that our proposed method has a wide range of applications in which k-means, fuzzy c-means, and the Gaussian mixture are used. For example, simultaneous deep learning and clustering [33] in which a deep neural network and k-means are jointly used, image segmentation using fuzzy c-means in a deep neural network [34], an application of fuzzy c-means in classification problems [35] and a parallel computation for large datasets by fuzzy c-means [36] can be investigated in the framework of the generalized energy function by the Pareto distribution.

As for the tuning parameters

τ

and

β

, we consider an approach using the leave-one-out cross validation in the Supplementary Materials in order to improve the clustering performance. The objective function in the leave-one-out cross validation is derived from an anchor loss as in [37] to estimate the optimal values of

τ

and

β

properly. The benchmark data analysis suggests that the performance is insensitive to the values of

τ

but is sensitive to the values of

β

. Hence, this approach should be useful to determine the optimal value of

β

.

Banerjee et al. [38] has proposed a clustering method based on Bregman divergences and clarified the relationship between the exponential families and the corresponding Bregman divergences. They separately consider hard and soft clustering; the former corresponds to k-means style clustering and the latter corresponds to mixture model clustering. In our proposed model, the tuning parameters

τ

and

β

bridge the gap between them and the performances are investigated by simulation studies and benchmark datasets. The extension of our method by replacing the squared distance

∥ x_{i} - μ_{k} ∥_{\sum_{k}^{- 1}}^{2}

with Bregman divergence should improve its practical flexibility and utility. When

β

or

γ

divergence is used, the clustering method should be robust to contamination in observations as suggested by [39,40].

It is well known that MM algorithm and EM algorithm converge to a local optimum and the resultant clusters are sensitive to initial values [41]. One way to circumvent this difficulty is to prepare several sets of initial values and select the best one among them such as the global k-means algorithm [42]. Another approach is to combine MM algorithm and genetic algorithm (GA) to expand thoroughly the search space for the optimal solution [41,43]. The both approach can be incorporated into the Pareto clustering to make it robust to the initial values and to escape from local optimal solutions.

Supplementary Materials

The following are available at https://www.mdpi.com/article/10.3390/e23050518/s1, A: Notations of methods and characteristics of bench-mark datasets, Figure S1: Summary of clustering methods, Table S2: Sample sizes, the number of clusters and dimensions of benchmark datasets. B: Results of benchmark data with

\sum_{k} \neq I

in Pareto clustering, Table S3: The result of Purity (

\sum_{k} = I

), Table S4: The result of F values (

\sum_{k} \neq I

), Table S5: The result of Centroid index (

\sum_{k} \neq I

). C: Results of benchmark data with

\sum_{k} = I

in Pareto clustering, Table S6: The result of Purity (

\sum_{k} = I

), Table S7: The result of F values (

\sum_{k} = I

), Table S8: The result of Centroid index (

\sum_{k} = I

). D: Tuning of parameters

τ

and

β

. E: R code of the Pareto clustering.

Author Contributions

Conceptualization, S.E.; methodology, S.E. and O.K.; formal analysis, O.K.; writing—original draft preparation, O.K.; All authors have read and agreed to the published version of the manuscript.

Funding

Financial support was provided by the Japan Society for the Promotion of Science KAKENHI Grant Number JP18K11190 and and JP18H03211.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Benchmark datasets are available at http://cs.uef.fi/sipu/datasets/.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of the Volume Constant v_d

We define

\begin{matrix} E_{μ} (x) = \frac{1}{K} \sum_{k = 1}^{K} S (τ ∥ x - μ_{k} ∥^{2}) . \end{matrix}

(A1)

Then we have

\begin{matrix} \int S (τ ∥ x - μ_{k} ∥^{2}) d x & = τ^{- \frac{1}{2}} \int S (y^{⊤} y) d y, (∵ y = τ^{\frac{1}{2}} (x - μ_{k})) \end{matrix}

(A2)

\begin{matrix} = τ^{- \frac{1}{2}} \int_{0}^{\infty} S (r^{2}) \frac{2 π^{\frac{d}{2}}}{Γ (\frac{d}{2})} r^{d - 1} d r, (from polar coordinate system) \end{matrix}

(A3)

\begin{matrix} = \frac{π^{\frac{d}{2}}}{Γ (\frac{d}{2}) τ^{\frac{1}{2}}} \int_{0}^{\infty} S (t) t^{\frac{d - 2}{2}} d t, (∵ t = r^{2}) \end{matrix}

(A4)

\begin{matrix} = \frac{π^{\frac{d}{2}}}{Γ (\frac{d}{2}) τ^{\frac{1}{2}}} \{{[\frac{2}{d} t^{\frac{d}{2}} S (t)]}_{0}^{\infty} + \frac{2}{d} \int_{0}^{\infty} t^{\frac{d}{2}} f (t) d t\}, (∵ S^{'} (t) = - f (t)) \end{matrix}

(A5)

\begin{matrix} = \frac{2 π^{\frac{d}{2}}}{Γ (\frac{d}{2}) τ^{\frac{1}{2}} d} E [T^{\frac{d}{2}}] \end{matrix}

(A6)

The last equality holds under a condition that

{lim}_{t \to \infty} t^{\frac{d}{2}} S (t) = 0

. Hence we have

\int E_{μ} (x) d x = \frac{2 π^{\frac{d}{2}}}{Γ (\frac{d}{2}) τ^{\frac{1}{2}} d} E [T^{\frac{d}{2}}] .

(A7)

When

d = 2

, we have

\int E_{μ} (x) d x = \frac{π}{τ^{\frac{1}{2}}} E [T] .

(A8)

Appendix B. Monotone Decrease of the Generalized Energy Function

The Q function at iteration t in the estimating algorithm is defined as

\begin{matrix} Q (θ | θ^{(t)}) = & \frac{1}{τ β} \sum_{i = 1}^{n} \sum_{k = 1}^{K} [{q_{k}^{(t)} (x_{i})}^{1 + β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} {1 + τ β {(x_{i} - μ_{k})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k})} - q_{k}^{(t)} (x_{i})], \end{matrix}

(A9)

where

\begin{matrix} q_{k}^{(t)} (x_{i}) = \frac{π_{k}^{(t)} w (x_{i}, μ_{k}^{(t)}, \sum_{k}^{(t)})}{\sum_{k = 1}^{K} π_{k}^{(t)} w (x_{i}, μ_{k}^{(t)}, \sum_{k}^{(t)})}, (k = 1, \dots, K) . \end{matrix}

(A10)

The estimate of

μ_{k}

is given as

\begin{matrix} μ_{k}^{(t + 1)} = \frac{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} x_{i}}{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β}} = \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) x_{i}, (k = 1, \dots, K), \end{matrix}

(A11)

where

v_{k}^{(t)} (x_{i}) = \frac{{q_{k}^{(t)} (x_{i})}^{1 + β}}{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β}} .

Here we observe that

\begin{matrix} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {(x_{i} - μ_{k})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}) \end{matrix}

(A12)

\begin{matrix} = & \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}^{(t + 1)}) + \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {(μ_{k}^{(t + 1)} - μ_{k})}^{⊤} \sum_{k}^{- 1} (μ_{k}^{(t + 1)} - μ_{k}) \end{matrix}

(A13)

Hence we have

\begin{matrix} Q (θ | θ^{(t)}) - Q (θ | θ^{(t)}) |_{μ_{k} = μ_{k}^{(t + 1)}} \end{matrix}

(A14)

\begin{matrix} = & \frac{1}{τ β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} {1 + τ β {(x_{i} - μ_{k})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k})} \end{matrix}

(A15)

\begin{matrix} - \frac{1}{τ β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} {1 + τ β {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A16)

\begin{matrix} = & π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} c_{q}^{(t)} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {{(x_{i} - μ_{k})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}) - {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A17)

\begin{matrix} = & π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} c_{q}^{(t)} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {(μ_{k}^{(t + 1)} - μ_{k})}^{⊤} \sum_{k}^{- 1} (μ_{k}^{(t + 1)} - μ_{k}) \geq 0 \end{matrix}

(A18)

where

\begin{matrix} c_{q}^{(t)} = \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} . \end{matrix}

(A19)

The estimate of

\sum_{k}

is given by

\sum_{k}^{(t + 1)} = τ (2 - d β) \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤},

where

d β < 2

. Hence we have

\begin{matrix} Q (θ | θ^{(t)}) {|_{μ_{k} = μ_{k}^{(t + 1)}} - Q (θ | θ^{(t)}) |}_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}} \end{matrix}

(A20)

\begin{matrix} = & \frac{1}{τ β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} {1 + τ β {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A21)

\begin{matrix} - \frac{1}{τ β} π_{k}^{- β} {| \sum_{k}^{(t + 1)} |}^{\frac{β}{2}} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} {1 + τ β {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} {\sum_{k}^{(t + 1)}}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A22)

\begin{matrix} = & \frac{1}{τ β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} c_{q}^{(t)} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {1 + τ β {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} \sum_{k}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A23)

\begin{matrix} - \frac{1}{τ β} π_{k}^{- β} {| \sum_{k}^{(t + 1)} |}^{\frac{β}{2}} c_{q}^{(t)} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) {1 + τ β {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} {\sum_{k}^{(t + 1)}}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A24)

\begin{matrix} = & \frac{1}{τ β} π_{k}^{- β} {| \sum_{k} |}^{\frac{β}{2}} c_{q}^{(t)} [1 + τ β trace \{\sum_{k}^{- 1} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤}\}] \end{matrix}

(A25)

\begin{matrix} - \frac{1}{τ β} π_{k}^{- β} {| \sum_{k}^{(t + 1)} |}^{\frac{β}{2}} c_{q}^{(t)} [1 + τ β trace \{{\sum_{k}^{(t + 1)}}^{- 1} \sum_{i = 1}^{n} v_{k}^{(t)} (x_{i}) (x_{i} - μ_{k}^{(t + 1)}) {(x_{i} - μ_{k}^{(t + 1)})}^{⊤}\}] \end{matrix}

(A26)

\begin{matrix} = & \frac{1}{τ β} π_{k}^{- β} | \sum_{k} |^{\frac{β}{2}} c_{q}^{(t)} [1 + τ β trace \{\sum_{k}^{- 1} \frac{1}{τ (2 - d β)} \sum_{k}^{(t + 1)}\}] - \frac{1}{τ β} π_{k}^{- β} {| \sum_{k}^{(t + 1)} |}^{\frac{β}{2}} c_{q}^{(t)} [1 + \frac{d β}{2 - d β}] \end{matrix}

(A27)

Here we notice that

\begin{matrix} Q (θ | θ^{(t)}) {|_{μ_{k} = μ_{k}^{(t + 1)}} - Q (θ | θ^{(t)}) |}_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}} \geq 0 \end{matrix}

(A28)

\begin{matrix} \Leftrightarrow & \frac{| \sum_{k}^{(t + 1)} |^{\frac{β}{2}} \frac{2}{2 - d β}}{| \sum_{k} |^{\frac{β}{2}} [1 + \frac{β}{2 - d β} trace (\sum_{k}^{- 1} \sum_{k}^{(t + 1)})]} \leq 1 \end{matrix}

(A29)

\begin{matrix} \Leftrightarrow & \frac{2}{2 - d β} \leq {| \sum_{k}^{- 1} \sum_{k}^{(t + 1)} |}^{- \frac{β}{2}} \{1 + \frac{β}{2 - d β} trace (\sum_{k}^{- 1} \sum_{k}^{(t + 1)})\}, \end{matrix}

(A30)

where

trace {\sum_{k}^{- 1} \sum_{k}^{(t + 1)}} = trace {\sum_{k}^{- 1 / 2} \sum_{k}^{(t + 1)} \sum_{k}^{- 1 / 2}} = λ_{1} + \dots + λ_{d} \geq 0

with

λ_{j}

being the non-negative eigen value of

\sum_{k}^{- 1 / 2} \sum_{k}^{(t + 1)} \sum_{k}^{- 1 / 2}

. Here we have

\begin{matrix} | \sum_{k}^{- 1} \sum_{k}^{(t + 1)} |^{- \frac{β}{2}} \{1 + \frac{β}{2 - d β} trace (\sum_{k}^{- 1} \sum_{k}^{(t + 1)})\} \end{matrix}

(A31)

\begin{matrix} = & {(λ_{1} \dots λ_{d})}^{- \frac{β}{2}} \{1 + \frac{β}{2 - d β} (λ_{1} + \dots + λ_{d})\} \end{matrix}

(A32)

\begin{matrix} \geq & {(\frac{λ_{1} + \dots + λ_{d}}{p})}^{- \frac{d β}{2}} \{1 + \frac{β}{2 - d β} (λ_{1} + \dots + λ_{d})\}, (∵ β > 0) \end{matrix}

(A33)

\begin{matrix} = & Λ^{- \frac{d β}{2}} p^{\frac{d β}{2}} (1 + \frac{β}{2 - d β} Λ), \end{matrix}

(A34)

where

Λ = λ_{1} + \dots + λ_{d} \geq 0

. The last term attains the minimum at

Λ = d

, leading to

\begin{matrix} \frac{2}{2 - d β} \leq {| \sum_{k}^{- 1} \sum_{k}^{(t + 1)} |}^{- \frac{β}{2}} \{1 + \frac{β}{2 - d β} trace (\sum_{k}^{- 1} \sum_{k}^{(t + 1)})\} . \end{matrix}

(A35)

As for

π_{k}

we define

R (θ | θ^{(t)}) = Q (θ | θ^{(t)}) + λ (\sum_{k = 1}^{K} π_{k} - 1)

and we have

\begin{matrix} \frac{\partial R (θ | θ^{(t)})}{\partial π_{k}} |_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}, π_{k} = π_{k}^{(t + 1)}} & = 0 \end{matrix}

(A36)

\begin{matrix} \frac{\partial^{2} R (θ | θ^{(t)})}{\partial π_{k}^{2}} |_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}} & = \frac{1 + β}{τ} \sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} π_{k}^{- β - 2} {| \sum_{k}^{(t + 1)} |}^{\frac{β}{2}} \\ \times {1 + τ β {(x_{i} - μ_{k}^{(t + 1)})}^{⊤} {\sum_{k}^{(t + 1)}}^{- 1} (x_{i} - μ_{k}^{(t + 1)})} \end{matrix}

(A37)

\begin{matrix} \geq 0, \end{matrix}

(A38)

where

π_{k}^{(t + 1)} = \frac{{\sum_{i = 1}^{n} {q_{k}^{(t)} (x_{i})}^{1 + β} w {(x_{i}, μ_{k}^{(t + 1)}, \sum_{k}^{(t + 1)})}^{- β}}^{\frac{1}{1 + β}}}{\sum_{ℓ = 1}^{K} {\sum_{i = 1}^{n} {q_{ℓ}^{(t)} (x_{i})}^{1 + β} w {(x_{i}, μ_{ℓ}, \sum_{ℓ}^{(t + 1)})}^{- β}}^{\frac{1}{1 + β}}} .

(A39)

Hence, we have

\begin{matrix} Q (θ | θ^{(t)}) {|_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}} - Q (θ | θ^{(t)}) |}_{μ_{k} = μ_{k}^{(t + 1)}, \sum_{k} = \sum_{k}^{(t + 1)}, π_{k} = π_{k}^{(t + 1)}} \geq 0 . \end{matrix}

(A40)

Inequations (A18), (A35) and (A40) hold for

k = 1, \dots, K

. As a result, we have

\begin{matrix} Q (θ | θ^{(t)}) - Q (θ^{(t + 1)} | θ^{(t)}) \geq 0, \forall θ \end{matrix}

(A41)

Hence we have

L_{τ, β} (θ^{(t + 1)}) \leq Q (θ^{(t + 1)} | θ^{(t)}) \leq Q (θ^{(t)} | θ^{(t)}) = L_{τ, β} (θ^{(t)}) .

(A42)

Appendix C. Perspective Plots and Contour Plots for p_τ,β (θ *)

Figure A1. Perspective plots (left panels) and contour plots (right panels) for

p_{τ, β} (θ^{*})

.

Figure A1. Perspective plots (left panels) and contour plots (right panels) for

p_{τ, β} (θ^{*})

.

References

Rokach, L.; Maimon, O. Clustering Methods. In Data Mining and Knowledge Discovery Handbook; Maimon, O., Rokach, L., Eds.; Springer: Boston, MA, USA, 2005. [Google Scholar]
Tukey, J.W. We need both exploratory and confirmatory. Am. Stat. 1980, 314, 23–25. [Google Scholar]
Dubes, R.; Jain, A.K. Clustering methodologies in exploratory data analysis. Adv. Comput. 1980, 19, 113–228. [Google Scholar]
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Ghosh, S.; Dubey, S. Comparative analysis of k-means and fuzzy c-means algorithms. Int. J. Adv. Comput. Sci. Appl. 2013, 4, 35–39. [Google Scholar] [CrossRef] [Green Version]
Komori, O.; Eguchi, S.; Ikeda, S.; Okamura, H.; Ichinokawa, M.; Nakayama, S. An asymmetric logistic regression model for ecological data. Methods Ecol. Evol. 2016, 7, 249–260. [Google Scholar] [CrossRef]
Komori, O.; Eguchi, S.; Saigusa, Y.; Okamura, H.; Ichinokawa, M. Robust bias correction model for estimation of global trend in marine populations. Ecosphere 2017, 8, 1–9. [Google Scholar] [CrossRef]
Omae, K.; Komori, O.; Eguchi, S. Quasi-linear score for capturing heterogeneous structure in biomarkers. BMC Bioinform. 2017, 18, 308. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Naudts, J. Generalised Thermostatistics; Springer: London, UK, 2011. [Google Scholar]
Rose, K.; Gurewitz, E.; Fox, G.C. Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett. 1990, 65, 945–948. [Google Scholar] [CrossRef]
Beirlant, J.; Goegebeur, Y.; Segers, J.; Teugels, J.L.; Waal, D.D.; Ferro, C. Statistics of Extremes: Theory and Applications; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar]
Cox, D.R. Note on grouping. J. Am. Stat. Assoc. 1957, 52, 543–547. [Google Scholar] [CrossRef]
MacQueen, J. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Cam, L.M.L., Neyman, J., Eds.; University of California Press: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
Bezdek, J.C.; Ehrlich, R.; Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 1984, 10, 191–2003. [Google Scholar] [CrossRef]
Hathaway, R.J.; Bezdek, J.C. Optimization of clustering criteria by reformulation. IEEE Trans. Fuzzy Syst. 1995, 3, 241–245. [Google Scholar] [CrossRef]
Yu, J. General C-means clustering model. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1197–1211. [Google Scholar] [PubMed]
Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
Eguchi, S.; Komori, O. Path Connectedness on a Space of Probability Density Functions. In Geometric Science of Information: Second International Conference, GSI 2015; Nielsen, F., Barbaresco, F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; p. 615. [Google Scholar]
Komori, O.; Eguchi, S.; Saigusa, Y.; Kusumoto, B.; Kubota, Y. Sampling bias correction in species distribution models by quasi-linear Poisson point process. Ecol. Inform. 2020, 55, 1–11. [Google Scholar] [CrossRef]
Nelsen, R.B. An Introduction to Copulas; Springer: New York, NY, USA, 2006. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. 1977, 39, 1–38. [Google Scholar]
Scrucca, L.; Fop, M.; Murphy, T.B.; Raftery, A.E. mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016, 8, 289–317. [Google Scholar] [CrossRef] [Green Version]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. J. R. Stat. Soc. Ser. 1979, 28, 100–108. [Google Scholar]
Reynolds, A.P.; Richards, G.; de la Iglesia, B.; Rayward-Smith, V.J. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. J. Math. Model. Algorithms 2006, 5, 475–504. [Google Scholar] [CrossRef]
Fränti, P.; Rezaei, M.; Zhao, Q. Centroid index: Cluster level similarity measure. Pattern Recognit. 2014, 47, 3034–3045. [Google Scholar] [CrossRef]
Sofaer, H.R.; Hoeting, J.A.; Jarnevich, C.S. The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol. Evol. 2019, 10, 565–577. [Google Scholar] [CrossRef]
Amigó, E.; Gonzalo, J.; Artiles, J.; Verdejo, F. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 2009, 12, 461–486. [Google Scholar] [CrossRef] [Green Version]
Van Rijsbergen, C. Foundation of evaluation. J. Doc. 1974, 30, 365–373. [Google Scholar] [CrossRef]
Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar]
Fränti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
Yang, B.; Fu, X.; Sidiropoulos, N.D.; Hong, M. Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 7–9 August 2017; Precup, D., Teh, Y.W., Eds.; 2017; Volume 70, pp. 3861–3870. [Google Scholar]
Mohsen, H.; El-Dahshan, E.S.A.; El-Horbaty, E.S.M.; Salem, A.B.M. Classification using deep learning neural networks for brain tumors. Future Comput. Inform. J. 2018, 3, 68–71. [Google Scholar] [CrossRef]
Gorsevski, P.V.; Gessler, P.E.; Jankowski, P. Integrating a fuzzy k-means classification and a Bayesian approach for spatial prediction of landslide hazard. J. Geogr. Syst. 2003, 5, 223–251. [Google Scholar] [CrossRef]
Kwok, T.; Smith, K.; Lozano, S.; Taniar, D. Parallel Fuzzy c- Means Clustering for Large Data Sets. In Euro-Par 2002 Parallel Processing; Monien, B., Feldmann, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2002; pp. 365–374. [Google Scholar]
Mollah, M.N.H.; Eguchi, S.; Minami, M. Robust Prewhitening for ICA by Minimizing β-Divergence and Its Application to FastICA. Neural Process. Lett. 2007, 25, 91–110. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef] [Green Version]
Notsu, A.; Eguchi, S. Robust clustering method in the presence of scattered observations. Neural Comput. 2016, 28, 1141–1162. [Google Scholar] [CrossRef] [PubMed]
Pernkopf, F.; Bouchaffra, D. Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1344–1348. [Google Scholar] [CrossRef] [PubMed]
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef] [Green Version]
Krishna, K.; Murty, M.N. Genetic K-means algorithm. IEEE Trans. Syst. Man Cybern. Part (Cybern.) 1999, 29, 433–439. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Perspective plots (left panels) and contour plots for

p_{τ, β} (θ^{*})

with boundaries marked in red for (a)

τ = 0.5

,

β = 1

, (b)

τ = 0.5

,

β = 0

and (c)

τ = 10

,

β = 1

.

Figure 1. Perspective plots (left panels) and contour plots for

p_{τ, β} (θ^{*})

with boundaries marked in red for (a)

τ = 0.5

,

β = 1

, (b)

τ = 0.5

,

β = 0

and (c)

τ = 10

,

β = 1

.

Figure 2. Mean squared errors (MSE) on the log scale based on 100 random samples for each method. The samples are generated based on

p_{τ, β} (θ^{*})

with (a)

τ = 0.5

,

β = 1

, (b)

τ = 0.5

,

β = 0

and (c)

τ = 10

,

β = 1

.

Figure 2. Mean squared errors (MSE) on the log scale based on 100 random samples for each method. The samples are generated based on

p_{τ, β} (θ^{*})

with (a)

τ = 0.5

,

β = 1

, (b)

τ = 0.5

,

β = 0

and (c)

τ = 10

,

β = 1

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Komori, O.; Eguchi, S. A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average. Entropy 2021, 23, 518. https://doi.org/10.3390/e23050518

AMA Style

Komori O, Eguchi S. A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average. Entropy. 2021; 23(5):518. https://doi.org/10.3390/e23050518

Chicago/Turabian Style

Komori, Osamu, and Shinto Eguchi. 2021. "A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average" Entropy 23, no. 5: 518. https://doi.org/10.3390/e23050518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

Abstract

1. Introduction

2. Materials and Methods

2.1. Generalized Energy Function

2.1.1. Pareto Distribution

2.1.2. Fréchet Distribution

2.2. Estimation of Variances and Mixing Proportions in Clusters

2.3. Estimating Algorithm

2.4. Evaluation of Clustering Methods

2.4.1. Metrics

2.4.2. Simulation Studies

2.4.3. Benchmark Data Analysis

3. Results

4. Discussion

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Derivation of the Volume Constant v_d

Appendix B. Monotone Decrease of the Generalized Energy Function

Appendix C. Perspective Plots and Contour Plots for p_τ,β (θ *)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

Abstract

1. Introduction

2. Materials and Methods

2.1. Generalized Energy Function

2.1.1. Pareto Distribution

2.1.2. Fréchet Distribution

2.2. Estimation of Variances and Mixing Proportions in Clusters

2.3. Estimating Algorithm

2.4. Evaluation of Clustering Methods

2.4.1. Metrics

2.4.2. Simulation Studies

2.4.3. Benchmark Data Analysis

3. Results

4. Discussion

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Derivation of the Volume Constant vd

Appendix B. Monotone Decrease of the Generalized Energy Function

Appendix C. Perspective Plots and Contour Plots for pτ,β (θ *)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A. Derivation of the Volume Constant v_d

Appendix C. Perspective Plots and Contour Plots for p_τ,β (θ *)