# Improved Constrained k-Means Algorithm for Clustering with Domain Knowledge

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

**Must-link**: constraints specifying two sample points must be in the same cluster.**Cannot-link**: constraints specifying a set of sample points that must not be placed in the same cluster, i.e., must be placed in a manner that each of them is in a distinct cluster.

**Definition**

**1**

#### 1.1. Related Work

#### 1.2. Our Results

- Propose a framework to incorporate must-link and cannot-link constraints with the k-means++ algorithm;
- Devise a method to cluster the points of cannot-link via novelly employing minimum weight matching and to merge the set of data points confined by must-links as a single point;
- Carry out experiments to evaluate the practical performance of the proposed algorithms against the UCI datasets, demonstrating that our algorithms outperform the previous algorithm at a rate of 65% regarding the accuracy rate.

#### 1.3. Organization

## 2. Preliminaries and Problem Statement

**Definition**

**2**

**Lemma**

**1**

## 3. Constrained k-Means Clustering Algorithm with Incidental Information

Algorithm 1 Constrained k-means clustering using domain information. |

Input: A data set $V=\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}\dots ,\phantom{\rule{0.166667em}{0ex}}{x}_{n}\}$, must-link constraints $Co{n}_{=}\subseteq D\times D$, cannot-link constraints $Co{n}_{\ne}\subseteq D\times D$, a positive integer k;Output: A collection of k clusters $\mathcal{C}=\{{C}_{1},\phantom{\rule{0.166667em}{0ex}}{C}_{2},\phantom{\rule{0.166667em}{0ex}}\dots ,\phantom{\rule{0.166667em}{0ex}}{C}_{k}\}$.1: Use the k-center algorithm to select ${c}_{1},\phantom{\rule{0.166667em}{0ex}}{c}_{2},\phantom{\rule{0.166667em}{0ex}}\dots ,\phantom{\rule{0.166667em}{0ex}}{c}_{k}$ as the initial cluster centers; 2: Set ${C}_{i}:=\varnothing ,\phantom{\rule{0.166667em}{0ex}}(1\le i\le k)$;3: While cluster centers change do4: For each set of must-link constraints do5: Compute mass center sample of the set; 6: Assign all samples of the set to the nearest ${c}_{i}$; /* Adding all samples of the set to ${C}_{i}$*/ 7: EndFor8: For $j=1,2,\dots ,n$ do9: Compute the distance ${d}_{ij}$ between point ${x}_{i}$ and cluster center ${c}_{j}$; 10: EndFor11: Assign remaining points to their nearest ${c}_{i}$, except those incident to cannot-links; 12: For each set of cannot-link constraint do13: Assign each point to the appropriate cluster via minimum weight matching; 14: EndFor15: Compute the cluster center ${c}_{i}$ of ${C}_{i}$; 16: EndWhile17: Return $\mathcal{C}=\{{C}_{1},\phantom{\rule{0.166667em}{0ex}}{C}_{2},\phantom{\rule{0.166667em}{0ex}}\dots ,\phantom{\rule{0.166667em}{0ex}}{C}_{k}\}$. |

## 4. Experimental Evaluation

#### 4.1. Evaluation Approaches

#### 4.2. Experimental Dataset and Statistics Information

#### 4.3. Comparison of Practical Performance

#### 4.4. Comparison of Runtime

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

**Figure 1.**Examples of processing must-links and cannot-links. (

**a**) Must-link examples. (

**a1**) is the original graph with the must-link set $\left\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3}\right\}$ and the mass clusters μ

_{1}and μ

_{2}. (

**a2**) is a possibly solution produced by [20] while (

**a3**) is the solution of our algorithm. (

**b**) A cannot-link example. (

**b1**) A cannot-link set $\left\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3},\phantom{\rule{0.166667em}{0ex}}{x}_{4}\right\}$ with four clustering centers $\left\{{\mu}_{1},\phantom{\rule{0.166667em}{0ex}}{\mu}_{2},\phantom{\rule{0.166667em}{0ex}}{\mu}_{3},\phantom{\rule{0.166667em}{0ex}}{\mu}_{4},\phantom{\rule{0.166667em}{0ex}}{\mu}_{5}\right\}$; (

**b2**) an assignment of the points in the cannot-link set to the clusters.

**Figure 2.**An example for Definition 2, where it contains two types of sample point sets $\left\{{\mu}_{1},\phantom{\rule{0.166667em}{0ex}}{\mu}_{2},\phantom{\rule{0.166667em}{0ex}}{\mu}_{3}\right\}$ and $\left\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3},\phantom{\rule{0.166667em}{0ex}}{x}_{4}\right\}$.

**Figure 3.**An example of executing Algorithm 1: (

**a**) the set of data points; (

**b**) Execution for must-links where c represents the must-link set $\left\{{x}_{1},\phantom{\rule{0.166667em}{0ex}}{x}_{2},\phantom{\rule{0.166667em}{0ex}}{x}_{3}\right\}$ as in (

**c**); (

**d**) the two centers selected; (

**e**) the clustering results; (

**f**) the calculated means according to the clustering.

The Number of Constraints | 100 | 200 | 300 | 400 | 500 |
---|---|---|---|---|---|

CM | 0.68669 | 0.97119 | 0.99399 | 1.00000 | 1.00000 |

TM | 0.73023 | 0.73023 | 0.73023 | 0.73023 | 0.73023 |

ICM | 0.78485 | 0.97860 | 0.99699 | 1.00000 | 1.00000 |

The Number of Constraints | 100 | 200 | 300 | 400 | 500 |
---|---|---|---|---|---|

CM | 0.44799 | 0.62021 | 0.87489 | 0.94606 | 0.99817 |

TM | 0.17284 | 0.17284 | 0.17284 | 0.17284 | 0.17284 |

ICM | 0.50131 | 0.69528 | 0.92172 | 0.96657 | 0.99908 |

The Number of Constraints | 100 | 300 | 500 | 700 | 900 |
---|---|---|---|---|---|

CM | 0.24879 | 0.35256 | 0.72374 | 0.88657 | 0.94483 |

TM | 0.13234 | 0.13234 | 0.13234 | 0.13234 | 0.13234 |

ICM | 0.26650 | 0.50679 | 0.73070 | 0.88045 | 0.95854 |

The Number of Constraints | 100 | 300 | 500 | 700 | 900 |
---|---|---|---|---|---|

CM | 0.05636 | 0.15859 | 0.73693 | 0.97761 | 0.99873 |

TM | −0.00268 | −0.00268 | −0.00268 | −0.00268 | −0.00268 |

ICM | 0.04795 | 0.42613 | 0.78812 | 0.98097 | 0.99970 |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

