Next Article in Journal
Approximations of Fuzzy Numbers by Using r-s Piecewise Linear Fuzzy Numbers Based on Weighted Metric
Next Article in Special Issue
A k,n-Threshold Secret Image Sharing Scheme Based on a Non-Full Rank Linear Model
Previous Article in Journal
An Analysis and Comparison of Multi-Factor Asset Pricing Model Performance during Pandemic Situations in Developed and Emerging Markets
Previous Article in Special Issue
A Novel LSB Matching Algorithm Based on Information Pre-Processing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem

1
State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
2
Chongqing Innovation Center of Industrial Big-Data Co., Ltd., Chongqing 400707, China
3
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
4
Institute of Computing Science and Technology, Guangzhou University, Guangzhou 510006, China
5
National Engineering Laboratory for Industrial Big-Data Application Technology, Chongqing 400707, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(1), 144; https://doi.org/10.3390/math10010144
Submission received: 25 November 2021 / Revised: 24 December 2021 / Accepted: 27 December 2021 / Published: 4 January 2022

Abstract

:
The k-means problem has been paid much attention for many applications. In this paper, we define the uncertain constrained k-means problem and propose a ( 1 + ϵ ) -approximate algorithm for the problem. First, a general mathematical model of the uncertain constrained k-means problem is proposed. Second, the random sampling properties of the uncertain constrained k-means problem are studied. This paper mainly studies the gap between the center of random sampling and the real center, which should be controlled within a given range with a large probability, so as to obtain the important sampling properties to solve this kind of problem. Finally, using mathematical induction, we assume that the first j 1 cluster centers are obtained, so we only need to solve the j-th center. The algorithm has the elapsed time O ( ( 1891 e k ϵ 2 ) 8 k / ϵ n d ) , and outputs a collection of size O ( ( 1891 e k ϵ 2 ) 8 k / ϵ n ) of candidate sets including approximation centers.

1. Introduction

The k-means problem has received much attention in the past several decades. The k-means problems consists of partitioning a set P of points in d-dimensional space R d into k subsets P 1 , , P k such that i = 1 k p P i | | p c i | | 2 is minimized, where c i is the center of P i , and | | p q | | is the distance between two points of p and q. The k-means problem is one of the classical NP-hard problems, and has been paid much attention in the literature [1,2,3].
For many applications, each cluster of the point set may satisfy some additional constraints, such as chromatic clustering [4], r-capacity clustering [5], r-gather clustering [6], fault tolerant clustering [7], uncertain data clustering [8], semi-supervised clustering [9], and l-diversity clustering [10]. The constrained clustering problems was studied by Ding and Xu, who presented the first unified framework in [11]. Given a point set P R d , and a positive integer k, a list of constraints L , the constrained k-means problem is to partition P into k clusters P = { P 1 , , P k } , such that all constraints in L are satisfied and P i P x P i | | x c ( P i ) | | 2 is minimized, where c ( P i ) = 1 | P i | x P i x denotes the centroid of  P i .
In recent years, particular research has been focused on the constrained k-means problem. Ding and Xu [11] showed the first polynomial time approximation scheme with running time O ( 2 p o l y ( k / ϵ ) ( log n ) k n d ) for the constrained k-means problem, and obtained a collection of size O ( 2 p o l y ( k / ϵ ) ( log n ) k + 1 ) of candidate approximate centers. The existing fastest approximation schemes for the constrained k-means problem takes O ( 2 O ( k / ϵ ) n d ) time [12,13], which was first shown by Bhattacharya, Jaiswai, and Kumar [12]. Their algorithm gives a collection of size O ( 2 O ( k / ϵ ) ) of candidate approximate centers. In this paper, we propose the uncertain constrained k-means problem, which supposes that all points are random variables with probabilistic distributions. We present a stochastic approximate algorithm for the uncertain constrained k-means problem. The uncertain constrained k-means problem can be regarded as a generalization of the constrained k-means problem. We prove the random sampling properties of the uncertain constrained k-means problem, which are fundamental for our proposed algorithm. By applying random sampling and mathematical induction, we propose a stochastic approximate algorithm with lower complexity for the uncertain constrained k-means problem.
This paper is organized as follows. Some basic notations are given in Section 2. Section 3 provides an overview of the new algorithm for the uncertain constrained k-means problem. In Section 4, we discuss the detailed algorithm for the uncertain constrained k-means problem. In Section 5, we investigate the correctness, success probability, and running time analysis of the algorithm. Section 6 concludes this paper and gives possible directions for future research.

2. Preliminaries

Definition 1
(Uncertain constrained k-means problem). Given a random variable set X R d , the probability density function f X ( s ) for every random variable X X , a list of constraints L , and a positive integer k, the uncertain constrained k-means problem is to partition X into k clusters X = { X 1 , , X k } , such that all constraints in L are satisfied and X i X X X i R d | | s c ( X i ) | | 2 f X ( s ) d s is minimized, where c ( X i ) = 1 | X i | X X i R d s f X ( s ) d s denotes the centroid of  X i .
Definition 2
([13]). Let X be a set of random variables in R d , f X ( s ) be probability density function for every random variable X X , and q R d and P be a set of points in R d , p P .
  • Define f 2 ( q , X ) = X X R d | | s q | | 2 f X ( s ) d s .
  • Define c ( X ) = 1 | X | X X R d s f X ( s ) d s .
  • Define d i s t ( X , P ) = m i n p P R d | | s p | | f X ( s ) d s .
Definition 3
([13]). Let X be a set of random variables in R d , f X ( s ) be the probability density function for every random variable X X , and X 1 , , X k be a partition of X .
  • Define m j = c ( X j ) .
  • β j = | X j | | X | .
  • Define σ j = f 2 ( m j , X j ) | X j | .
  • Define
    O P T k ( X ) = j = 1 k X X j R d | | s c ( X j ) | | 2 f X ( s ) d s = j = 1 k f 2 ( m j , X j ) .
  • Define σ o p t = O P T k ( X ) | X | = i = 1 k β i σ i 2 .
Lemma 1.
For any point x R d and a random variable set X R d , f 2 ( x , X ) = f 2 ( c ( X ) , X ) + | X | | | c ( X ) x | | 2 .
Proof. 
Let f X ( s ) be the probability density function for every random variable X X .
(1) f 2 ( x , X ) = X X R d | | s x | | 2 f X ( s ) d s (2) = X X R d | | s c ( X ) + c ( X ) x | | 2 f X ( s ) d s (3) = X X R d | | s c ( X ) | | 2 f X ( s ) d s + X X R d | | c ( X ) x | | 2 f X ( s ) d s (4) = f 2 ( c ( X ) , X ) + | | c ( X ) x | | 2 X X R d f X ( s ) d s (5) = f 2 ( c ( X ) , X ) + | X | | | c ( X ) x | | 2 .
The (3) equality follows from the fact that X X R d ( s c ( X ) ) f X ( s ) d s = 0 .    □
Lemma 2.
Let X be a set of random variables in R d and f X ( s ) be the probability density function for every random variable X X . Assume that T is a set of random variables obtained by sampling random variables from X uniformly and independently. For ∀ δ > 0 , we have:
P r ( | | c ( T ) c ( X ) | | 2 > 1 δ | T | σ 2 ) < δ ,
where σ 2 = 1 | X | X X R d | | s c ( X ) | | 2 f X ( s ) d s .
Proof. 
First, observe that
E ( c ( T ) ) = c ( X ) , E ( | | c ( T ) c ( X ) | | 2 ) = 1 | T | σ 2
where σ 2 = 1 | X | X X R d | | s c ( X ) | | 2 f X ( s ) d s . Then apply the Markov inequality to obtain the following.
P r ( | | c ( T ) c ( X ) | | 2 > 1 δ | T | σ 2 ) < δ .
   □
Lemma 3.
Let Q be a set of random variables in R d , f X ( s ) be the probability density function for every random variable X Q , and Q 1 be an arbitrary subset of Q with α | Q | random variables for some 0 < α 1 . Then | | c ( Q ) c ( Q 1 ) | | 1 α α σ , where σ 2 = 1 | Q | X Q R d | | s c ( Q ) | | 2 f X ( s ) d s .
Proof. 
Let Q 2 = Q \ Q 1 . By Lemma 1, we have the following two equalities.
f 2 ( c ( Q ) , Q 1 ) = f 2 ( c ( Q 1 ) , Q 1 ) + | Q 1 | | | c ( Q 1 ) c ( Q | | 2 ,
f 2 ( c ( Q ) , Q 2 ) = f 2 ( c ( Q 2 ) , Q 2 ) + | Q 2 | | | c ( Q 2 ) c ( Q | | 2 .
Let L = | | c ( Q 1 ) c ( Q 2 ) | | . By the definition of the mean point, we have:
c ( Q ) = 1 | Q | X Q R d s f X ( s ) d s = 1 | Q | ( | Q 1 | c ( Q 1 ) + | Q 2 | c ( Q 2 ) ) .
Thus, the three points { c ( Q ) , c ( Q 1 ) , c ( Q 2 ) } are collinear, while | | c ( Q 1 ) c ( Q ) | | = ( 1 α ) L and | | c ( Q 2 ) c ( Q ) | | = α L . Meanwhile, by the definition of σ , we have σ 2 = 1 | Q | ( X Q 1 R d | | s c ( Q ) | | 2 f X ( s ) d s + X Q 2 R d | | s c ( Q ) | | 2 f X ( s ) d s ) . Combining Equality (9) and Equality (10), we have:
(12) σ 2 1 | Q | ( | Q 1 | | | c ( Q 1 ) c ( Q | | 2 + | Q 2 | | | c ( Q 2 ) c ( Q | | 2 ) (13) = α ( ( 1 α ) L ) 2 + ( 1 α ) ( α L ) 2 (14) = α ( 1 α ) L 2 .
Thus, we have L σ α ( 1 α ) , which means that | | c ( Q ) c ( Q 1 ) | | = ( 1 α ) L 1 α α σ .    □
Lemma 4
([12]). For any x , y , z R d , then | | x z | | 2 2 | | x y | | 2 + 2 | | y z | | 2 .
Theorem 1
([14]). Let X 1 , , X s be s, an independent random 0 1 variable, where X i takes 1 with a probability of at least p for i = 1 , , s . Let X = i = 1 s X i . Then, for any δ > 0 , P r ( X < ( 1 δ ) p s ) < e 1 2 δ 2 p s .

3. Overview of Our Method

In this section, we first introduce the main idea of our methodology to solve the uncertain constrained k-means problem.
Considering the optimal partition X = { X 1 , , X k } ( | X 1 | | X k | ) of X , since | X 1 | / | X | 1 / k , if we could sample a set S of size O ( k / ϵ ) from X uniformly and independently, then at least O ( 1 / ϵ ) random variables in S are from X 1 with a certain probability. All subsets of S of size O ( 1 / ϵ ) could be enumerated to discover the approximate center of X 1 .
We assume that C j 1 = { c 1 , , c j 1 } is the set including approximate centers of the X 1 , , X j . Let B j = { X X | d i s t ( X , C j 1 ) = m i n c C j 1 R d | | s c | | f X ( s ) d s r j } , where r j = ϵ 40 β j k σ o p t . The set X j is divided into two parts: X j o u t and X j i n , where X j o u t = X j \ B j and X j i n = X j B j . For each random variable X, let X ˜ be the nearest point (particular random variable) in C j 1 to X. Let X ˜ j i n = { X ˜ | X X j i n } , and X ˜ j = X ˜ j i n X j o u t .
If most of the random variables of X j are in X j i n , our idea is to use the center of X ˜ j i n to approximate the center of X j . The center of X ˜ j i n is found based on C j 1 . If most of the random variables of X j are in X j o u t , our ideal is to replace the center of X j with the center of X ˜ j . For seeking out the approximate center of X ˜ j , we should find out a subset S by uniformly sampling from X ˜ j . However, the set X j o u t is unknown. We need to find the set S X j o u t . We apply a branching strategy to find a set Q such that X \ B j Q , and | Q | < 2 | X \ B j | . Then, a random variables set S is obtained by sampling random variables from Q independently and uniformly. And the set X \ B j Q can be replaced by a subset S * of S from X j o u t . Based on S * and X ˜ j i n , the approximation center of X ˜ j could be obtained. Therefore, the algorithm presented in this paper outputs a collection of size O ( ( 1891 e k ϵ 2 ) 8 k / ϵ n ) of candidate sets containing approximation centers, and has the running time O ( ( 1891 e k ϵ 2 ) 8 k / ϵ n d ) .

4. Our Algorithm cMeans

Given an instance ( X , k , L ) of the uncertain constrained k-means problem, X = { X 1 , , X k } denotes an optimal partition of ( X , k , L ) . There exist six parameters ( ϵ , Q , g, k, C, U) in our cMeans , where ϵ ( 0 , 1 ] is the approximate factor, Q is the input random variable set, g is the number of centers, k is the number of the clusters, C is the set of approximate cluster centers, and U is a collection of candidate sets including the approximate center. Let M = 6 ϵ , N = 79 , 380 k ϵ 3 , where M is the size of subsets of the sampling set and N is the size of the sampling set. Without loss of generality, assume that values of M and N are integers.
We use the branching strategy to seek out the approximate centers of clusters in X . There exist two branches in our algorithm cMeans , which can be seen in Figure 1. On one branch, a size N set S 1 is obtained by sampling from Q uniformly and independently; S 2 is constructed by S 1 and M copies of each point in C. Moreover, we consider each subset S of size M of S 2 , and the centroid c of S is solved to represent the approximate center of X k g + 1 , and our algorithm cMeans ( ϵ , Q , g 1 , k , C { c } , U ) is used to obtain the remaining g 1 cluster centers.
On the other branch, for each random variable X Q , we calculate the distance between X and C first. H denotes the set of all distances of random variables in X to C, where H is a multi-set. We should obtain the median value m for all values in H, which is the | H | / 2 -th element if all of the values in H are sorted. In the second branch, Q is divided into two parts, Q and Q , based on m such that for X Q , X Q , d i s t ( X , C ) d i s t ( X , C ) , where | Q | = | Q | 2 , | Q | = | Q | 2 . Subroutine cMeans ( ϵ , Q , g , k , C , U ) is used to obtain the remaining g cluster centers. Therefore, we present the specific algorithm for seeking out a collection of candidate sets in the Algorithm 1.
Algorithm 1: cMeans ( ϵ , Q , g , k , C , U )
Mathematics 10 00144 i001

5. Analysis of Our Algorithm cMeans

We investigate the success probability, correctness, and time complexity analysis of the algorithm cMeans in this section.
Lemma 5.
There exists a candidate set, with a probability of at least 1 / 12 k , including the approximate center C k = { c 1 , , c k } in U satisfying | | m j c j | | 2 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 ( 1 j k ) .
The following Lemmas from Lemma 6 to 16 are used to prove Lemma 5. We prove Lemma 5 via induction on j. For j = 1 , we can obtain β 1 1 / k easily, and prove the success probability first.
Lemma 6.
In the process of finding c 1 in our algorithm cMeans , by sampling a set of 79,380 k / ϵ 3 random variables from X independently and uniformly, denoted by S 1 , the probability that at least 6 / ϵ random variables in S 2 are from X 1 is at least 1 / 2 .
Proof. 
In our algorithm cMeans , we assume that S 1 = S 1 , , S N , where N = 79,380 k / ϵ 3 . Let x 1 , , x N be the corresponding random variables of elements in S 1 . If S i X 1 , then x i = 1 . Otherwise x i = 0 . It is known easily that P r [ S i X 1 ] 1 k . Let x = i = 1 N x i , u = i = 1 N E ( x i ) . We obtain that u 79,380 k / ϵ 3 . Then,
(15) P r [ x > 6 ϵ ] = 1 P r [ x 6 ϵ ] (16) = 1 P r [ x 6 ϵ 2 79,380 79,380 ϵ 3 ] (17) 1 P r [ x ϵ 2 13,230 u ] (18) 1 e ( 1 ϵ 2 13,230 ) 2 u 2 (19) 1 e ( 1 ϵ 2 13,230 ) 2 79,380 ϵ 3 2 (20) 1 e ( 1 1 13,230 ) 2 · 79,380 2 (21) 1 2 .
From Lemma 6, an S * with size 6 / ϵ of S 2 can be obtained, and the probability that all points in S * are from X 1 is at least 1 / 2 . Let c 1 denote the centroid of S * , and δ = 5 / 6 . For | S * | = 6 / ϵ , by Lemma 2, we conclude that | | m 1 c 1 | | 2 1 5 ϵ σ 1 2 holds with a probability of at least 1 / 6 . Then, the probability that a subset S * of size 6 / ϵ of S 2 can be found such that | | m 1 c 1 | | 2 1 5 ϵ σ 1 2 9 10 ϵ σ 1 2 + 1 10 β 1 k ϵ σ o p t 2 holds is at least 1 / 12 . Therefore, we conclude that Lemma 5 holds for j = 1 .
Moreover, we assume that for j j 0 ( 1 j 0 ) , Lemma 5 holds with a probability of at least 1 / 12 j . Considering the case j = j 0 + 1 , we prove Lemma 5 by the following two cases: (1) | X j o u t | ϵ 49 β j n ; (2) | X j o u t | > ϵ 49 β j n .

5.1. Analysis for Case 1: | X j o u t | ϵ 49 β j n

Since | X j o u t | ϵ 49 β j n , most of the random variables of X j are in B j . Our idea is to replace the center of X j with the center of X ˜ j i n . Thus, we need to find the approximate center c j of X ˜ j i n and the bound distance | | m j c j | | . We divide the distance | | m j c j | | into the following three parts: | | m j m j i n | | , | | m j i n m ˜ j i n | | , and | | m ˜ j i n c j | | . We first study the distance between m j and m j i n .
Lemma 7.
| | m j m j i n | | ϵ 48 σ j .
Proof. 
Since | X j | = β j n and | X j o u t | ϵ 49 β j n , the proportion of X j i n in X j is at least 1 ϵ 49 . By Lemma 3, | | m j m j i n | | ϵ / 49 1 ϵ / 49 σ j ϵ 48 σ j . □
Lemma 8.
| | m j i n m ˜ j i n | | r j .
Proof. 
Since m j i n = 1 | X j i n | X X j i n R d s f X ( s ) d s , and m ˜ j i n = 1 | X j i n | X X j i n X ˜ , we can obtain the following:
(22) | | m j i n m ˜ j i n | | = | | 1 | X j i n | X X j i n R d s f X ( s ) d s 1 | X j i n | X X j i n X ˜ | | (23) = 1 | X j i n | | | X X j i n R d ( s X ˜ ) f X ( s ) d s | | (24) 1 | X j i n | X X j i n R d | | s X ˜ | | f X ( s ) d s (25) 1 | X j i n | X X j i n r j (26) = r j .
Lemma 9.
f 2 ( m ˜ j i n , X ˜ j i n ) 2 | X j i n | r j 2 + 2 f 2 ( m j , X j i n ) | X j i n | | | m j m ˜ j i n | | 2 .
Proof. 
Since | X ˜ j i n | = | X j i n | , by 1, we have f 2 ( m j , X ˜ j i n ) = f 2 ( m ˜ j i n , X ˜ j i n ) + | X j i n | | | m ˜ j i n m j | | . Then,
(27) f 2 ( m ˜ j i n , X ˜ j i n ) = f 2 ( m j , X ˜ j i n ) | X j i n | | | m ˜ j i n m j | | 2 (28) = X X j i n | | X ˜ m j | | 2 | X j i n | | | m j m ˜ j i n | | 2 (29) = X X j i n R d | | X ˜ m j | | 2 f X ( s ) d s | X j i n | | | m j m ˜ j i n | | 2 (30) = X X j i n R d | | X ˜ s + s m j | | 2 f X ( s ) d s | X j i n | | | m j m ˜ j i n | | 2 (31) X X j i n R d ( 2 | | X ˜ s | | 2 + 2 | | s m j | | 2 ) f X ( s ) d s | X j i n | | | m j m ˜ j i n | | 2 (32) 2 | X j i n | r j 2 + 2 X X j i n R d | | s m j | | 2 f X ( s ) d s | X j i n | | | m j m ˜ j i n | | 2 (33) = 2 | X j i n | r j 2 + 2 f 2 ( m j , X j i n ) | X j i n | | | m j m ˜ j i n | | 2
Lemma 10.
In the process of finding c j in our algorithm cMeans , for the set S 2 in step 5, a subset S * of size 6 / ϵ of S 2 can be obtained such that all random variables in S * are from X ˜ j i n . Let c j be the centroid of S * . Then, the inequality | | m ˜ j i n c j | | 2 2 5 ϵ r j 2 + 49 120 ϵ σ j 2 1 5 ϵ | | m j m ˜ j i n | | 2 holds with a probability of at least 1 / 6 .
Proof. 
For each point p C j 1 , 6 / ϵ copies of p are added to S 2 in step 9 in our algorithm cMeans . Thus, a subset S * of size 6 / ϵ of S 2 can be obtained such that all random variables in S * are from X ˜ j i n . Let δ = 5 / 6 . Since | S * | = 6 / ϵ , by Lemma 2, | | m ˜ j i n c j | | 2 ϵ 5 f 2 ( m ˜ j i n , X ˜ j i n ) | X j i n | holds with a probability of at least 1 / 6 . Assume that | | m ˜ j i n c j | | 2 ϵ 5 f 2 ( m ˜ j i n , X ˜ j i n ) | X j i n | . Then,
(34) | | m ˜ j i n c j | | 2 ϵ 5 f 2 ( m ˜ j i n , X ˜ j i n ) | X j i n | (35) 1 5 ϵ 2 | X j i n | r j 2 + 2 f 2 ( m j , X j i n ) | X j i n | | | m j m ˜ j i n | | 2 | X j i n | (36) = 2 5 ϵ r j 2 + 2 5 ϵ f 2 ( m j , X j i n ) | X j i n | 1 5 ϵ | | m j m ˜ j i n | | 2 (37) 2 5 ϵ r j 2 + 2 5 ϵ f 2 ( m j , X j ) | X j | | X j o u t | 1 5 ϵ | | m j m ˜ j i n | | 2 (38) 2 5 ϵ r j 2 + 2 5 ϵ β j n σ j 2 ( 1 ϵ / 49 ) β j n 1 5 ϵ | | m j m ˜ j i n | | 2 (39) 2 5 ϵ r j 2 + 49 120 ϵ σ j 2 1 5 ϵ | | m j m ˜ j i n | | 2 .
Lemma 11.
If c j satisfies | | m ˜ j i n c j | | 2 2 5 ϵ r j 2 + 49 120 ϵ σ j 2 1 5 ϵ | | m j m ˜ j i n | | 2 , then | | m j c j | | 2 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 .
Proof. 
Assume that c j satisfies | | m ˜ j i n c j | | 2 2 5 ϵ r j 2 + 49 120 ϵ σ j 2 1 5 ϵ | | m j m ˜ j i n | | 2 . Then,
(40) | | m j c j | | 2 = | | m j m ˜ j i n + m ˜ j i n c j | | 2 (41) 2 | | m j m ˜ j i n | | 2 + 2 | | m ˜ j i n c j | | 2 (42) ( 2 2 5 ϵ ) | | m j m ˜ j i n | | 2 + 4 5 ϵ r j 2 + 49 60 ϵ σ j 2 (43) ( 2 2 5 ϵ ) | | m j m j i n + m j i n m ˜ j i n | | 2 + 4 5 ϵ r j 2 + 49 60 ϵ σ j 2 (44) ( 2 2 5 ϵ ) ( 2 | | m j m j i n | | 2 + 2 | | m j i n m ˜ j i n | | 2 ) + 4 5 ϵ r j 2 + 49 60 ϵ σ j 2 (45) ( 2 2 5 ϵ ) ( 1 24 ϵ σ j 2 + 2 r j 2 ) + 4 5 ϵ r j 2 + 49 60 ϵ σ j 2 (46) 9 10 ϵ σ j 2 + 4 r j 2 (47) = 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 .

5.2. Analysis for Case 2: | X j o u t | > ϵ 49 β j n

Let X ˜ j = X ˜ j i n X j o u t , and m ˜ j denote the centroid of X ˜ j . Our idea is to replace the center of X j with the center of X ˜ j . But it is difficult to seek out the center of X ˜ j . Thus, we try to find an approximate center c j of X ˜ j .
Lemma 12.
| X j o u t | | X \ B j | ϵ 2 3969 k .
Proof. 
(48) | X j o u t | | X \ B j | = | X j o u t | i = 1 j 1 | X i \ B j | + | X j o u t | + i = j + 1 k | X i \ B j | (49) | X j o u t | i = 1 j 1 f 2 ( c i , X i ) r j 2 + | X j o u t | + i = j + 1 k | X i | (50) | X j o u t | i = 1 j 1 f 2 ( m i , X i ) + | X i | | | m i c i | | 2 r j 2 + | X j o u t | + i = j + 1 k | X i | (51) | X j o u t | ( 1 + ϵ ) n σ o p t 2 r j 2 + | X j o u t | + i = j + 1 k | X i | (52) | X j o u t | 40 ( 1 + ϵ ) k β j n ϵ + | X j o u t | + ( k j ) β j n (53) ϵ 49 β j n 40 ( 1 + ϵ ) k β j n ϵ + ϵ 49 β j n + ( k j ) β j n (54) ϵ 2 ( 80 k + k ) 49 + ( ϵ 49 j ) ϵ (55) ϵ 2 3969 k
Lemma 13.
| | m j m ˜ j | | r j .
Proof. 
(56) | | m j m ˜ j | | = | | 1 | X j | X X j R d s f X ( s ) d s 1 | X j | ( X X j i n X ˜ + X X j o u t R d s f X ( s ) d s ) | | (57) = 1 | X j | | | X X j i n R d ( s X ˜ ) f X ( s ) d s | | (58) = 1 | X j | X X j i n R d | | s X ˜ | | f X ( s ) d s (59) 1 | X j | X X j i n r j (60) = | X j i n | | X j | r j (61) r j
Lemma 14.
f 2 ( m ˜ j , X ˜ j ) 2 f 2 ( m j , X j ) + 4 β j n r j 2 .
Proof. 
(62) f 2 ( m ˜ j , X ˜ j ) = X X j i n | | X ˜ m ˜ j | | 2 + X X j o u t R d | | s m ˜ j | | 2 f X ( s ) d s (63) = X X j i n R d | | X ˜ m ˜ j | | 2 f X ( s ) d s + X X j o u t R d | | s m ˜ j | | 2 f X ( s ) d s (64) = X X j i n R d | | X ˜ s + s m ˜ j | | 2 f X ( s ) d s + X X j o u t R d | | s m ˜ j | | 2 f X ( s ) d s (65) X X j i n R d ( 2 | | X ˜ s | | 2 + 2 | | s m ˜ j | | 2 ) f X ( s ) d s + X X j o u t R d | | s m ˜ j | | 2 f X ( s ) d s (66) 2 X X j i n R d | | X ˜ s | | 2 f X ( s ) d s + 2 X X j o u t R d | | s m ˜ j | | 2 f X ( s ) d s (67) 2 | X j i n | r j 2 + 2 f 2 ( m ˜ j , X j ) (68) = 2 | X j i n | r j 2 + 2 f 2 ( m j , X j ) + 2 | X j | | | m j m ˜ j | | 2 (69) 2 f 2 ( m j , X j ) + 4 β j n r j 2
Lemma 15.
In the process of finding c j in our algorithm cMeans , we assume that Q satisfies X \ B j Q and | Q | < 2 | X \ B j | . For the set S 2 in step 5, a subset S * of size 6 / ϵ of S 2 can be obtained such that all random variables in S * are from X ˜ j i n with a probability of 1 / 2 . Let c j denotes the centroid of S * . Then, the inequality | | m ˜ j c j | | 2 4 5 ϵ r j 2 + 2 5 ϵ σ j 2 holds with a probability of at least 1 / 6 .
Proof. 
In our algorithm cMeans , we assume that S 1 = S 1 , , S N , where N = 79,380 k / ϵ 3 . Let x 1 , , x N be the corresponding random variables of elements in S 1 . If S i X j o u t , obtain x i = 1 , or else x i = 0 . It is known easily that P r [ S i X j o u t ] ϵ 2 7938 k by Lemma 12. Let x = i = 1 N x i , u = i = 1 N E ( x i ) . We obtain that u 10 / ϵ , and
(70) P r [ x > 6 ϵ ] = 1 P r [ x 6 ϵ ] (71) 1 P r [ x 3 5 u ] (72) 1 e ( 1 3 5 ) 2 u 2 (73) 1 e ( 1 3 5 ) 2 10 ϵ 2 (74) 1 e 4 5 (75) 1 2 .
Then, the probability that at least 6 / ϵ random variables in S 1 are from X j o u t is at least 1 / 2 . Since S 2 = S 1 { 6 / ϵ copies of each point in C } , a subset S * of size 6 / ϵ of S 2 can be obtained, and the probability that all random variables in S * are from X ˜ j i n is at least 1 / 2 . Let c j denote the centroid of S * and δ = 5 / 6 . For | S * | = 6 / ϵ and | w i d e t i l d e X j | = | X j | , by Lemma 2, | | m ˜ j c j | | 2 ϵ 5 f 2 ( m ˜ j , X ˜ j ) | X ˜ j | = ϵ 5 f 2 ( m ˜ j , X ˜ j ) | X j | holds with a probability of at least 1 / 6 . Assume that | | m ˜ j c j | | 2 ϵ 5 f 2 ( m ˜ j , X ˜ j ) | X j | . Then,
| | m ˜ j c j | | 2 ϵ 5 f 2 ( m ˜ j , X ˜ j ) | X j | ϵ 5 2 f 2 ( m j , X j ) + 4 β j n r j 2 | X j | 4 5 ϵ r j 2 + 2 5 ϵ σ j 2 .
Lemma 16.
If c j satisfies | | m ˜ j c j | | 2 4 5 ϵ r j 2 + 2 5 ϵ σ j 2 , then | | m j c j | | 2 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 .
Proof. 
Assume that c j satisfies | | m ˜ j c j | | 2 4 5 ϵ r j 2 + 2 5 ϵ σ j 2 . Then,
(77) | | m j c j | | 2 = | | m j m ˜ j + m ˜ j c j | | 2 (78) 2 | | m j m ˜ j | | 2 + 2 | | m ˜ j c j | | 2 (79) 2 r j 2 + 8 5 ϵ r j 2 + 4 5 ϵ σ j 2 (80) = 4 5 ϵ σ j 2 + ( 2 + 8 5 ϵ ) r j 2 (81) 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 .
Lemma 17.
Given an instance ( X , k , L ) of the uncertain constrained k-means problem, where the size of X is n, for ϵ ( 0 , 1 ] , k 2 , we assume that by using our algorithm cMeans (ϵ, X , k, C, U) (C and U are initialized as empty sets), a collection U of candidate sets including approximate centers is obtained. If there exists a set C k = { c 1 , , c k } in U satisfying that | | m j c j | | 2 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 ( 1 j k ) , then C k is a ( 1 + ϵ ) -approximation for the uncertain constrained k-means problem.
Proof. 
Assume that C k = c 1 , , c k is a set in U satisfying that | | m j c j | | 2 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 ( 1 j k ) . Then,
(82) j = 1 k f 2 ( c j , X j ) = j = 1 k ( f 2 ( m j , X j ) + | X j | | | m j c j | | 2 ) (83) j = 1 k ( f 2 ( m j , X j ) + β j n ( 9 10 ϵ σ j 2 + 1 10 β j k ϵ σ o p t 2 ) ) (84) j = 1 k ( f 2 ( m j , X j ) + 9 10 ϵ n j = 1 k β j σ j 2 + 1 10 ϵ n σ o p t 2 (85) j = 1 k ( f 2 ( m j , X j ) + 9 10 ϵ n σ o p t 2 + 1 10 ϵ n σ o p t 2 (86) = ( 1 + ϵ ) · O P T k ( P ) .

5.3. Time Complexity Analysis

We analyze the time complexity for our algorithm cMeans in this section.
Lemma 18.
The time complexity of our algorithm cMeans is O ( 4 k ( 13,231 e k ϵ 2 ) 6 k / ϵ 1 ϵ n d ) .
Proof. 
Let a = C N + k M M , which N = 79,380 k ϵ 3 , M = 6 ϵ . By the Stirling formula,
C N + k M M ( N + k M ) M M ! O ( ( e N + k M M ) M ) = O ( ( 13,231 e k ϵ 2 ) 6 ϵ ) .
In our algorithm cMeans , steps 5–9 have a run time of O ( k / ϵ 3 ) , step 11 have a run time of O ( d / ϵ ) , and steps 13–16 have a run time of O ( k n d ) . Let T ( n , g ) denote the time complexity of algorithm cMeans , where g is the number of cluster centers, and n is the size of Q .
If g = 0 , T ( n , 0 ) = O ( 1 ) . When n = 1 , T ( 1 , g ) = a ( T ( 1 , g 1 ) + O ( d / ϵ ) ) + O ( k / ϵ 3 ) . Because a > k / ϵ 3 , T ( 1 , g ) = a ( T ( 1 , g 1 ) + O ( d / ϵ ) ) a g · T ( 1 , 0 ) + g · a g · O ( d / ϵ ) = O ( g · a g · d / ϵ ) . Therefore, T ( 1 , g ) O ( 4 g ( 13,231 e k ϵ 2 ) 6 g / ϵ ) 1 ϵ d , where e = 2.7183 .
For n 2 and g 1 , the recurrence of T ( n , g ) could be obtained as follows:
T ( n , g ) = a · T ( n , g 1 ) + T ( n 2 , g ) + a · O ( d ϵ ) + O ( k ϵ 3 ) + O ( k n d ) .
Because a > k / ϵ 3 , two constants b 1 and b 2 with b 1 1 and b 2 1 could be obtained to arrive at the following recurrence.
T ( n , g ) a · T ( n , g 1 ) + T ( n 2 , g ) + a · b 1 · d ϵ + b 2 · k n d .
Now we claim that T ( n , g ) b 1 · b 2 · 1 ϵ · a g · 2 2 g · n d b 1 · d ϵ . If g = 0 , then T ( n , 0 ) = O ( 1 ) . If g 1 , n = 1 , then T ( 1 , g ) O ( 4 g ( 13,231 e k ϵ 2 ) 6 g / ϵ ) 1 ϵ d , and the claim holds. Suppose that if n 1 0 , g > g 1 , the claim holds for T ( n 1 , g 1 ) , and if 0 < n 2 < n , g 2 , the claim holds for T ( n 2 , g 2 ) . We need to prove that:
b 1 · b 2 · 1 ϵ · a g · 2 2 g · n d b 1 · d ϵ a ( b 1 · b 2 · 1 ϵ · a ( g 1 ) · 2 2 ( g 1 ) · n d b 1 · d ϵ ) + b 1 · b 2 · 1 ϵ · a g · 2 2 g · n 2 d b 1 · d ϵ + a · b 1 · d ϵ + b 2 · k n d .
The above formula can be simplified as 1 4 ϵ · b 1 · a g 2 2 g k , which holds for g 1 . For a = ( 13,231 e k ϵ 2 ) 6 / ϵ , T ( n , k ) = O ( 4 k ( 13,231 e k ϵ 2 ) 6 k / ϵ 1 ϵ n d ) . □
Thus, we can obtain the following Theorem 2.
Theorem 2.
Given an instance ( X , k , L ) of the uncertain constrained k-means problem, where the size of X is n, for ϵ ( 0 , 1 ] , k 2 , by using our algorithm cMeans ( ϵ , X , k , C , U ) , a collection U of candidate sets including approximate centers can be obtained with a probability of at least 1 / 12 2 such that U includes at least one candidate set including approximate centers that is a ( 1 + ϵ ) -approximation for the uncertain constrained k-means problem, and the time complexity of our algorithm cMeans is O ( 4 k ( 13,231 e k ϵ 2 ) 6 k / ϵ 1 ϵ n d ) .

6. Conclusions

In this paper, we defined the uncertain constrained k-means problem first, and then presented a stochastic approximate algorithm for the problem in detail. We proposed a general mathematical model of the uncertain constrained k-means problem, and studied the random sampling properties, which are very important to deal with the uncertain constrained k-means problem. By applying a random sampling technique, we obtained a ( 1 + ϵ ) -approximate algorithm for the problem. Then, we investigated the success probability, correctness and time complexity analysis of our algorithm cMeans , whose running time is O ( 4 k ( 13,231 e k ϵ 2 ) 6 k / ϵ 1 ϵ n d ) . However, there also exists a big gap between the current algorithms for the uncertain constrained k-means problem and the practical algorithms for the problem, which has been mentioned in [13] similarly.
We will try to explore a much more practical algorithm for the uncertain constrained k-means problem in future. It is known that the 2-means problem is the smallest version of the k-means problem, and remains NP-hard. The approximation schemes for the 2-means problem can be generalized to solve the k-means problem. Due to the particularity of the uncertain constrained 2-means problem, we will study approximation schemes for the uncertain constrained 2-means problem and reduce the algorithm complexity of approximation schemes for the uncertain constrained k-means problem through approximation schemes of the uncertain constrained 2-means problem. Additionally, we will apply the proposed algorithm to some practical problems in the future.

Author Contributions

J.L. and J.T. contributed to supervision, methodology, validation and project administration. B.X. and X.T. contributed to review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Science and Technology Foundation of Guizhou Province ([2021]015), in part by the Open Fund of Guizhou Provincial Public Big Data Key Laboratory (2017BDKFJJ019), in part by the Guizhou University Foundation for the introduction of talent ((2016) No. 13), in part by the GuangDong Basic and Applied Basic Research Foundation (No. 2020A1515110554), and in part by the Science and Technology Program of Guangzhou (No. 202002030138), China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Feldman, D.; Monemizadeh, M.; Sohler, C. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, SoCG, Gyeongju, Korea, 6–8 June 2007; pp. 11–18. [Google Scholar]
  2. Ostrovsky, R.; Rabani, Y.; Schulman, L.J.; Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. J. ACM 2012, 59, 28:1–28:22. [Google Scholar] [CrossRef] [Green Version]
  3. Jaiswal, R.; Kumar, A.; Sen, S. A simple D2-sampling based PTAS for k-means and other clustering problems. Algorithmica 2014, 71, 22–46. [Google Scholar] [CrossRef] [Green Version]
  4. Arkin, E.M.; Diaz-Banez, J.M.; Hurtado, F.; Kumar, P.; Mitchell, J.S.; Palop, B.; Perez-Lantero, P.; Saumell, M.; Silveira, R.I. Bichromatic 2-center of pairs of points. Comput. Geom. 2015, 48, 94–107. [Google Scholar] [CrossRef]
  5. Yhuller, S.; Sussmann, Y.J. The capacitated k-center problem. SIAM J. Discrete Math. 2000, 13, 403–418. [Google Scholar]
  6. Har-Peled, S.; Raichel, B. Net and prune: A linear time algorithm for Euclidean distance problems. J. ACM 2015, 62, 4401–4435. [Google Scholar] [CrossRef]
  7. Swamy, C.; Shmoys, D.B. Fault-tolerant facility location. ACM Trans. Algorithms 2008, 4, 1–27. [Google Scholar] [CrossRef] [Green Version]
  8. Xu, G.; Xu, J. Efficient approximation algorithms for clustering point-sets. Comput. Geom. 2010, 43, 59–66. [Google Scholar] [CrossRef] [Green Version]
  9. Valls, A.; Batet, M.; Lopez, E.M. Using expert’s rules as background knowledge in the clusdm methodology. Eur. J. Oper. Res. 2009, 195, 864–875. [Google Scholar] [CrossRef]
  10. Li, J.; Yi, K.; Zhang, Q. Clustering with deversity. In Proceedings of the 37th International Colloquium on Automata, Languages and Programming, ICALP, Bordeaux, France, 6–10 July 2010; pp. 188–200. [Google Scholar]
  11. Ding, H.; Xu, J. A unified framework for clustering constrained data without locality property. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, San Diego, CA, USA, 4–6 January 2015; pp. 1471–1490. [Google Scholar]
  12. Bhattacharya, A.; Jaiswal, R.; Kumar, A. Faster algorithms for the constrained k-means problem. Theory Comput. Syst. 2018, 62, 93–115. [Google Scholar] [CrossRef] [Green Version]
  13. Feng, Q.; Hu, J.; Huang, N.; Wang, J. Improved PTAS for the constrained k-means problem. J. Comb. Optim. 2019, 37, 1091–1110. [Google Scholar] [CrossRef]
  14. Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
Figure 1. Flow chart of our algorithm cMeans .
Figure 1. Flow chart of our algorithm cMeans .
Mathematics 10 00144 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lu, J.; Tang, J.; Xing, B.; Tang, X. Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem. Mathematics 2022, 10, 144. https://doi.org/10.3390/math10010144

AMA Style

Lu J, Tang J, Xing B, Tang X. Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem. Mathematics. 2022; 10(1):144. https://doi.org/10.3390/math10010144

Chicago/Turabian Style

Lu, Jianguang, Juan Tang, Bin Xing, and Xianghong Tang. 2022. "Stochastic Approximate Algorithms for Uncertain Constrained K-Means Problem" Mathematics 10, no. 1: 144. https://doi.org/10.3390/math10010144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop