# Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- A more complete description of the evolutionary approach, which is the basis of the proposed clustering algorithms, is provided.
- PB-KM now includes a mutation operation in the second step of recombination.
- More details about the Java implementations are furnished.
- All previous execution experiments were reworked, and new challenging case studies were added to the experimental framework, exploiting synthetic (benchmark) and real-world datasets.

## 2. Related Work

#### 2.1. Lloyd’s K-Means

Algorithm 1. The Lloyd’s K-Means |

Input: the dataset $X$ and the number of clusters $K$.Output: final centroids and corresponding partitions.1. Initialization. Use some seeding method (e.g., uniform random) to choose $K$ data in $X$ as initial centroids. 2. Partitioning. Assign data points of $X$ to clusters according to the nearest centroid rule. 3. Update. Redefine centroids as the mean points of the clusters resulting from step 2. 4. Check termination. If the termination condition does not hold, repeat from 2. |

#### 2.2. The Random Swap Clustering Algorithm

#### 2.3. Centroids Initialization Methods

Algorithm 2. The K-Means++ seeding method. |

1. Establish the first centroid through a uniform random selection: |

${\mu}_{1}\leftarrow {x}_{j},j\leftarrow unif\_rand(1..N),L\leftarrow 1$ |

2. For each point ${x}_{i},$ define the probability $\pi \left({x}_{i}\right)$ of being chosen as the next centroid as: |

$\pi \left({x}_{i}\right)=\frac{{D\left({x}_{i}\right)}^{2}}{{\sum}_{j=1}^{N}{{D(x}_{j})}^{2}}$ |

Use a random switch based on the newly computed values of $\{\pi \left({x}_{i}\right){\}}_{i=1}^{N}$, for choosing a point ${x}^{*}\in X$, not previously selected, as the next centroid |

$L\leftarrow L+1,{\mu}_{L}\leftarrow {x}^{*}$ |

3. If $L<K$, repeat from step 2. |

Algorithm 3. The Greedy_K-Means++ (GKM++) seeding method. |

${\mu}_{1}\leftarrow {x}_{j},j\leftarrow unif\_rand(1..N)$, $L\leftarrow 1$ $do\{$ $\mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}\mathrm{B}\mathrm{e}\mathrm{s}\mathrm{t}\leftarrow \mathrm{\infty}$ $\mathrm{c}\mathrm{a}\mathrm{n}\mathrm{d}\mathrm{B}\mathrm{e}\mathrm{s}\mathrm{t}\leftarrow ?$ $repeatStimes\{$ $\mathrm{s}\mathrm{e}\mathrm{l}\mathrm{e}\mathrm{c}\mathrm{t}$ a point ${x}^{*}\in X$ as candidate centroid, using the K-Means++ method $\mathrm{p}\mathrm{a}\mathrm{r}\mathrm{t}\mathrm{i}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}$ $X$ according to $\left\{{\mu}_{1},{\mu}_{2},\dots ,{\mu}_{L},{x}^{*}\right\},$that is assign points to clusters according to the $nc\left(.\right)$ function $\mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}\leftarrow SSE\left(\right)$ $if(\mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}<\mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}\mathrm{B}\mathrm{e}\mathrm{s}\mathrm{t})\{$ $\mathrm{c}\mathrm{a}\mathrm{n}\mathrm{d}\mathrm{B}\mathrm{e}\mathrm{s}\mathrm{t}\leftarrow {x}^{*}$ $\mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}\mathrm{B}\mathrm{e}\mathrm{s}\mathrm{t}\leftarrow \mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}$ $\}$ $\}$ $L\leftarrow L+1$ ${\mu}_{L}\leftarrow \mathrm{c}\mathrm{a}\mathrm{n}\mathrm{d}\mathrm{B}\mathrm{e}\mathrm{s}\mathrm{t}$ $\}while(LK)$ |

#### 2.4. Evolutionary Algorithm Concepts

#### 2.4.1. GA-K-Means

#### 2.4.2. Concepts of Recombinator K-Means

#### 2.5. External Measures of Clustering Accuracy

## 3. Population-Based Clustering Algorithms

#### 3.1. PB-KM

Algorithm 4. The PB-KM operation. |

1. Setup population $\wp \leftarrow \varnothing $ repeat $J$times{ costBest←∞, candBest←? repeat ${R}_{1}$ times{ cand←run(K-Means,GKM++,$X$) cost←SSE(cand,$X$) if(cost<costBest){ costBest←cost candBest←cand } } $\wp =\wp \cup $ {candBest} $\}$ 2. Recombination costBest←∞ candBest←? repeat ${R}_{2}$ times{ cand←run(K-Means,GKM++,$\wp $) cost←SSE(cand,$X$) if(cost<costBest){ costBest←cost candBest←cand replace in $\wp $ the GKM++ selected centroids by cand centroids } check candBest accuracy by clustering indexes } |

#### 3.2. PB-RS

Algorithm 5. The PB-RS recombination step. |

cand←GKM++($\wp $) partition $X$ data points according to cand cost←SSE($X$) repeat $T$ times{ save cand cand’←swap(cand), that is: cs←pj, pj$\in \wp $, s←unif_rand(1.. K), j←unif_rand($1..\mathrm{J}\ast \mathrm{K})$ refine cand’ by a few K-Means iterations (e.g., 5) new_cost←SSE(cand’$,X$) if(new_cost<cost){ accept cand’, cand←cand’ cost←new_cost } else{ restore saved cand and its previous partitioning } } check the accuracy of candBest by further clustering indexes. |

## 4. JAVA Implementation Notes

Algorithm 6. Code fragment of K-Means++/Greedy_K-Means++ operating on a source of data points. |

…final int l=L;//turn L into a final variable lStream<DataPoint> pStream= (PARALLEL) ? Arrays.stream(source).parallel(): Arrays.stream(source); DataPoint ssd=pStream//sum of squared distances .map(p->{ p.setDist(Double. MAX_VALUE);for(int k=0; k<l; ++k) {//existing centroidsdouble d=p.distance(centroids[k]);if(d<p.getDist()) p.setDist(d); } return p; }).reduce( new DataPoint(), DataPoint::add2Dist, DataPoint::add2DistCombiner); double denP=ssd.getDist(); //common denominator of points probability …//random switch … |

Algorithm 7. Java function which calculates the $SSE$ the cost of a given partitioning. |

Stream<DataPoint> pStream= (PARALLEL) ? Stream.of(dataset).parallel(): Stream.of (dataset); DataPoint s=pStream .map(p ->{ int k=p.getCID();//retrieve partition label (centroid index) of pdouble d=p.distance(centroids[k]);p.setDist(d*d);//store locally to p the squared distance of p to its (nearest) centroid return p;} ) .reduce( new DataPoint(), (p1,p2)->{ DataPoint ps= new DataPoint(); ps.setDist(p1.getDist()+p2.getDist());return ps; } ); return s.getDist(); |

## 5. Experimental Framework

^{5}2-dimensional points distributed into 100 clusters. In particular, $\mathrm{B}\mathrm{i}\mathrm{r}\mathrm{c}\mathrm{h}1$ places its clusters on a 10 × 10 grid. $\mathrm{B}\mathrm{i}\mathrm{r}\mathrm{c}\mathrm{h}2$, instead, puts the clusters on a sine curve. $\mathrm{B}\mathrm{i}\mathrm{r}\mathrm{c}\mathrm{h}1$ and $\mathrm{B}\mathrm{i}\mathrm{r}\mathrm{c}\mathrm{h}2$ have spherical clusters of the same size.

#### 5.1. Clustering the A3 Dataset

^{KM++}) and Greedy K-Means++ $\left({\mathrm{R}\mathrm{K}\mathrm{M}}^{\mathrm{G}\mathrm{K}\mathrm{M}++}\right)$ seeding procedure.

^{4}repetitions of K-Means were executed and the following quantities monitored: (a) the minimal value of the $\mathrm{S}\mathrm{S}\mathrm{E}$ cost $\left({\mathrm{S}\mathrm{S}\mathrm{E}}_{\mathrm{m}\mathrm{i}\mathrm{n}}\right),$ (b) the corresponding Cluster Index $\left(\mathrm{C}\mathrm{I}\right)$ value (see Section 2.5) (${\mathrm{C}\mathrm{I}}_{\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{S}\mathrm{S}\mathrm{E}\right)}$), (c) the minimal value of the observed $\mathrm{C}\mathrm{I}$ (${\mathrm{C}\mathrm{I}}_{\mathrm{m}\mathrm{i}\mathrm{n}}$) and the corresponding value of the $\mathrm{S}\mathrm{S}\mathrm{E}$ cost (${\mathrm{S}\mathrm{S}\mathrm{E}}_{\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{C}\mathrm{I}\right)}$), (d) the emerging average $\mathrm{C}\mathrm{I}$ value ($\mathrm{a}\mathrm{v}\mathrm{g}\_\mathrm{C}\mathrm{I}$) and (e) the $\mathrm{s}\mathrm{u}\mathrm{c}\mathrm{c}\mathrm{e}\mathrm{s}\mathrm{s}\_\mathrm{r}\mathrm{a}\mathrm{t}\mathrm{e}$, that is, the number of runs which ended with a $\mathrm{C}\mathrm{I}=0$, divided by 10

^{4}. In addition, the Parallel Execution Time ($\mathrm{P}\mathrm{E}\mathrm{T}$), in sec, needed by Repeated K-Means to complete its runs was also observed. Table 6 collects all the achieved results.

#### 5.2. First Group of Synthetic Datasets (Table 2)

#### 5.3. Second Group of Real-World Datasets (Table 3)

#### 5.4. Third Group of Synthetic Datasets (Table 4)

#### 5.5. Fourth Group of Real-World Datasets (Table 5)

#### 5.6. Time Efficiency of PB-KM

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Bell, J. Machine Learning: Hands-on for Developers and Technical Professionals; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
- Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory
**1982**, 28, 129–137. [Google Scholar] [CrossRef] - MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California: Berkeley, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Jain, A.K. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett.
**2010**, 31, 651–666. [Google Scholar] [CrossRef] - Fränti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell.
**2018**, 48, 4743–4759. [Google Scholar] [CrossRef] - Fränti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit.
**2019**, 93, 95–112. [Google Scholar] [CrossRef] - Vouros, A.; Langdell, S.; Croucher, M.; Vasilaki, E. An empirical comparison between stochastic and deterministic centroid initialization for K-means variations. Mach. Learn.
**2021**, 110, 1975–2003. [Google Scholar] [CrossRef] - Fränti, P. Efficiency of random swap algorithm. J. Big Data
**2018**, 5, 1–29. [Google Scholar] [CrossRef] - Nigro, L.; Cicirelli, F.; Fränti, P. Parallel random swap: An efficient and reliable clustering algorithm in Java. Simul. Model. Pract. Theory
**2023**, 124, 102712. [Google Scholar] [CrossRef] - Fränti, P. Genetic algorithm with deterministic crossover for vector quantization. Pattern Recognit. Lett.
**2000**, 21, 61–68. [Google Scholar] [CrossRef] - Baldassi, C. Recombinator K-Means: A population based algorithm that exploits k-means++ for recombination. arXiv
**2020**, arXiv:1905.00531v3. [Google Scholar] - Baldassi, C. Recombinator K-Means: An evolutionary algorithm that exploits k-means++ for recombination. IEEE Trans. Evol. Comput.
**2022**, 26, 991–1003. [Google Scholar] [CrossRef] - Hruschka, E.R.; Campello, R.J.; Freitas, A.A. A survey of evolutionary algorithms for clustering. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
**2009**, 39, 133–155. [Google Scholar] [CrossRef] - Celebi, M.E.; Kingravi, H.A.; Vela, P.A. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl.
**2013**, 40, 200–210. [Google Scholar] [CrossRef] - Nigro, L. Performance of parallel K-means algorithms in Java. Algorithms
**2022**, 15, 117. [Google Scholar] [CrossRef] - Urma, R.G.; Fusco, M.; Mycroft, A. Modern Java in Action; Manning: Shelter Island, NY, USA, 2018. [Google Scholar]
- Nigro, L.; Cicirelli, F. Performance of a K-Means algorithm driven by careful seeding. In Proceedings of the 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH) 2023, Rome, Italy, 12–14 July 2023; pp. 27–36, ISBN 978-989-758-668-2. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. K-Means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 2007; pp. 1027–1035.
- Goldberg, D.E. Genetic Algorithms in Search Optimization and Machine Learning; Addison Wesley: Boston, MA, USA, 1989. [Google Scholar]
- Nigro, L.; Fränti, P. Two Medoid-Based Algorithms for Clustering Sets. Algorithms
**2023**, 16, 349. [Google Scholar] [CrossRef] - Rezaei, M.; Franti, P. Set Matching Measures for External Cluster Validity. IEEE Trans. Knowl. Data Eng.
**2016**, 28, 2173–2186. [Google Scholar] [CrossRef] - Fränti, P.; Rezaei, M.; Zhao, Q. Centroid index: Cluster level similarity measure. Pattern Recognit.
**2014**, 47, 3034–3045. [Google Scholar] [CrossRef] - Fränti, P.; Rezaei, M. Generalized centroid index to different clustering models. In Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Mérida, Mexico, 29 November–2 December 2016; LNCS 10029. pp. 285–296. [Google Scholar]
- Fränti, P. Repository of Datasets. Available online: http://cs.uef.fi/sipu/datasets/ (accessed on 1 August 2023).
- UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 1 August 2023).
- Sieranoja, S.; Fränti, P. Fast and general density peaks clustering. Pattern Recognit. Lett.
**2019**, 128, 551–558. [Google Scholar] [CrossRef] - Rodriguez, R.; Laio, A. Clustering by fast search and find of density peaks. Science
**2014**, 344, 14.92–14.96. [Google Scholar] [CrossRef] - Baldassi, C. UrbanGB Dataset. Available online: https://github.com/carlobaldassi/UrbanGB-dataset (accessed on 1 August 2023).
- Rezaei, M.; Fränti, P. K-sets and k-swaps algorithms for clustering sets. Pattern Recognit.
**2023**, 139, 109454. [Google Scholar] [CrossRef] - Nigro, L. Parallel Theatre: A Java actor-framework for high-performance computing. Simul. Model. Pract. Theory
**2021**, 106, 102189. [Google Scholar] [CrossRef] - Hartigan, J.A.; Wong, M.A. Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.)
**1979**, 28, 100–108. [Google Scholar] [CrossRef] - Slonim, N.; Aharoni, E.; Crammer, K. Hartigan’s k-means versus Lloyd’s k-means-is it time for a change? In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI 2013), Beijing, China, 3–9 August 2013; pp. 1677–1684. [Google Scholar]
- Bagirov, A.M.; Aliguliyev, R.M.; Sultanova, N. Finding compact and well separated clusters: Clustering using silhouette coefficients. Pattern Recognit.
**2023**, 135, 109144. [Google Scholar] [CrossRef] - Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science
**2007**, 315, 972–976. [Google Scholar] [CrossRef] [PubMed] - Moustafa, S.S.; Abdalzaher, M.S.; Khan, F.; Metwaly, M.; Elawadi, E.A.; Al-Arifi, N.S. A quantitative site-specific classification approach based on affinity propagation clustering. IEEE Access
**2021**, 9, 155297–155313. [Google Scholar] [CrossRef] - Lovisolo, L.; Da Silva, E.A.B. Uniform distribution of points on a hyper-sphere with applications to vector bit-plane encoding. IEE Proc.-Vis. Image Signal Process.
**2001**, 148, 187–193. [Google Scholar] [CrossRef]

**Figure 4.**$\mathrm{S}\mathrm{S}\mathrm{E}$ cost vs. time for the $\mathrm{B}\mathrm{i}\mathrm{r}\mathrm{c}\mathrm{h}3$ dataset.

**Figure 5.**$\mathrm{C}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{i}\mathrm{d}\mathrm{I}\mathrm{n}\mathrm{d}\mathrm{e}\mathrm{x}\left(\mathrm{C}\mathrm{I}\right)$ vs. time for the $\mathrm{B}\mathrm{i}\mathrm{r}\mathrm{c}\mathrm{h}3$ dataset.

**Figure 6.**$\mathrm{S}\mathrm{S}\mathrm{E}$ cost vs. time for the $\mathrm{W}\mathrm{o}\mathrm{r}\mathrm{m}\mathrm{s}\_2\mathrm{d}$ dataset.

**Figure 7.**$\left(\mathrm{G}\mathrm{e}\mathrm{n}\mathrm{e}\mathrm{r}\mathrm{a}\mathrm{l}\mathrm{i}\mathrm{z}\mathrm{e}\mathrm{d}\right)\mathrm{C}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{i}\mathrm{d}\mathrm{I}\mathrm{n}\mathrm{d}\mathrm{e}\mathrm{x}\left(\mathrm{C}\mathrm{I}\right)$ vs. time for the $\mathrm{W}\mathrm{o}\mathrm{r}\mathrm{m}\mathrm{s}\_2\mathrm{d}$ dataset.

**Figure 8.**$\mathrm{S}\mathrm{S}\mathrm{E}$ cost vs. time for the $\mathrm{W}\mathrm{o}\mathrm{r}\mathrm{m}\mathrm{s}\_64\mathrm{d}$ dataset.

**Figure 9.**$\left(\mathrm{G}\mathrm{e}\mathrm{n}\mathrm{e}\mathrm{r}\mathrm{a}\mathrm{l}\mathrm{i}\mathrm{z}\mathrm{e}\mathrm{d}\right)\mathrm{C}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{r}\mathrm{o}\mathrm{i}\mathrm{d}\mathrm{I}\mathrm{n}\mathrm{d}\mathrm{e}\mathrm{x}\left(\mathrm{C}\mathrm{I}\right)$ vs. time for the $\mathrm{W}\mathrm{o}\mathrm{r}\mathrm{m}\mathrm{s}\_64\mathrm{d}$ dataset.

**Figure 10.**$\mathrm{S}\mathrm{S}\mathrm{E}$ vs. time for the $\mathrm{B}\mathrm{r}\mathrm{i}\mathrm{d}\mathrm{g}\mathrm{e}$ dataset.

**Figure 11.**$\mathrm{S}\mathrm{S}\mathrm{E}$ vs. time for the $\mathrm{M}\mathrm{i}\mathrm{s}\mathrm{s}\mathrm{A}\mathrm{m}\mathrm{e}\mathrm{r}\mathrm{i}\mathrm{c}\mathrm{a}$ dataset.

**Figure 12.**$\mathrm{S}\mathrm{S}\mathrm{E}$ vs. time for the $\mathrm{H}\mathrm{o}\mathrm{u}\mathrm{s}\mathrm{e}$ dataset.

**Figure 13.**$\mathrm{S}\mathrm{S}\mathrm{E}$ vs. time for the $\mathrm{O}\mathrm{l}\mathrm{i}\mathrm{v}\mathrm{e}\mathrm{t}\mathrm{t}\mathrm{i}$ dataset.

**Figure 14.**$\mathrm{C}\mathrm{I}$ vs. time for the $\mathrm{O}\mathrm{l}\mathrm{i}\mathrm{v}\mathrm{e}\mathrm{t}\mathrm{t}\mathrm{i}$ dataset.

**Figure 15.**$\mathrm{S}\mathrm{S}\mathrm{E}$ vs. time for the $\mathrm{U}\mathrm{r}\mathrm{b}\mathrm{a}\mathrm{n}\mathrm{G}\mathrm{B}$ dataset.

**Figure 16.**$\mathrm{C}\mathrm{I}$ vs. time for the $\mathrm{U}\mathrm{r}\mathrm{b}\mathrm{a}\mathrm{n}\mathrm{G}\mathrm{B}$ dataset.

Symbol | Description |
---|---|

$N$ | $\mathrm{number}\mathrm{of}\mathrm{data}\mathrm{points}\left(\mathrm{vectors}\right){x}_{i}$$\mathrm{in}\mathrm{the}\mathrm{dataset}X$ |

$D$ | number of dimensions (coordinates or features) for each data point |

$K$ | number of clusters/centroids |

$d({x}_{i},{x}_{j})$ | $\mathrm{Euclidean}\mathrm{distance}\mathrm{between}\mathrm{data}\mathrm{points}{x}_{i}$$\mathrm{and}{x}_{j}$ |

${C}_{1}\dots {C}_{K}$ | partition clusters |

${\mu}_{1}\dots {\mu}_{K}$ | representative centroids of clusters |

$nc\left({x}_{i}\right)$ | $\mathrm{nearest}\mathrm{centroid}\mathrm{to}\mathrm{data}\mathrm{point}{x}_{i}$ |

$L$ | the number of currently defined centroids in a seeding method |

$D\left({x}_{i}\right)$ | $\mathrm{minimal}\mathrm{distance}\mathrm{of}{x}_{i}$ to the currently existing centroids |

$SSE$ | Sum-of-Squared Errors objective function |

$nMSE$ | $\mathrm{normalized}\mathrm{mean}\mathrm{of}SSE$, also referred to as distortion |

$Unif$ | uniform random seeding method |

$KM$++ | K-Means++ seeding method |

$GKM$++ | Greedy-K-Means++ seeding method |

$S$ | number of attempts in GKM++ for identifying the next centroid |

$CI$ | Cluster Index—an external measure of clustering accuracy |

<${C}^{j},{P}^{j}$> | a solution of a clustering algorithm, i.e., a pair of a centroids vector and corresponding partition labels of clusters belonging data points |

$PB-KM$ | proposed Population-Based K-Means clustering algorithm |

$PB-RS$ | proposed Population-Based Random Swap clustering algorithm |

℘ | $\mathrm{population}\mathrm{of}J\ast K$ centroids in PB-KM/PB-RS algorithms |

$J$ | number of “best” solutions initially put in the ℘ |

${R}_{1}$ | number of repetitions of K-Means in the 1st step of PB-KM |

${R}_{2}$ | number of repetitions of K-Means in the 2nd step of PB-KM |

$T$ | number of iterations of Random Swap in the 1st step of PB-RS, for defining each of the J candidate population solutions; also the number of iterations of Random Swap in the 2nd step of PB-RS for achieving a careful solution |

**Table 2.**The first group of synthetic datasets [24].

$\mathbf{D}\mathbf{a}\mathbf{t}\mathbf{a}\mathbf{s}\mathbf{e}\mathbf{t}$ | $\mathbf{N}$ | $\mathbf{D}$ | $\mathbf{K}$ |
---|---|---|---|

$A3$ | $7500$ | $2$ | $50$ |

$S3$ | $5000$ | $2$ | $15$ |

$Dim1024$ | $1024$ | $1024$ | $16$ |

$Unbalance$ | $6500$ | $2$ | $8$ |

$Birch1/2$ | 100,000 | $2$ | $100$ |

$\mathbf{D}\mathbf{a}\mathbf{t}\mathbf{a}\mathbf{s}\mathbf{e}\mathbf{t}$ | $\mathbf{N}$ | $\mathbf{D}$ | $\mathbf{K}$ |
---|---|---|---|

$Musk$ | $6598$ | $166$ | $2$ |

$MiniBooNE$ | 130,064 | $50$ | $2$ |

**Table 4.**The third group of synthetic datasets [24].

$\mathbf{D}\mathbf{a}\mathbf{t}\mathbf{a}\mathbf{s}\mathbf{e}\mathbf{t}$ | $\mathbf{N}$ | $\mathbf{D}$ | $\mathbf{K}$ |
---|---|---|---|

$Birch3$ | 100,000 | $2$ | 100 |

$Worms\_2d$ | 105,600 | $2$ | 35 |

$Worms\_64d$ | 105,000 | $64$ | $25$ |

$\mathbf{D}\mathbf{a}\mathbf{t}\mathbf{a}\mathbf{s}\mathbf{e}\mathbf{t}$ | $\mathbf{N}$ | $\mathbf{D}$ | $\mathbf{K}$ |
---|---|---|---|

$Bridge$ | $4096$ | $16$ | $256$ |

$House$ | 34,112 | $3$ | $256$ |

$MissAmerica$ | $6480$ | $16$ | $256$ |

$Olivetti$ | $400$ | $4096$ | $40$ |

$UrbanGB$ | 360,177 | $2$ | $469$ |

${\mathbf{R}\mathbf{K}\mathbf{M}}^{\mathbf{U}\mathbf{n}\mathbf{i}\mathbf{f}}$ | ${\mathbf{R}\mathbf{K}\mathbf{M}}^{\mathbf{K}\mathbf{M}++}$ | ${\mathbf{R}\mathbf{K}\mathbf{M}}^{\mathbf{G}\mathbf{K}\mathbf{M}++}$ | |
---|---|---|---|

${\mathrm{S}\mathrm{S}\mathrm{E}}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ | $7.44$ | $6.74$ | $6.74$ |

${\mathrm{C}\mathrm{I}}_{\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{S}\mathrm{S}\mathrm{E}\right)}$ | $1$ | $0$ | $0$ |

${\mathrm{C}\mathrm{I}}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ | $1$ | $0$ | $0$ |

${\mathrm{S}\mathrm{S}\mathrm{E}}_{\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{C}\mathrm{I}\right)}$ | $7.44$ | $6.74$ | $6.74$ |

$\mathrm{a}\mathrm{v}\mathrm{g}\_\mathrm{C}\mathrm{I}$ | $6.58$ | $4.17$ | $1.62$ |

$\mathrm{s}\mathrm{u}\mathrm{c}\mathrm{c}\mathrm{e}\mathrm{s}\mathrm{s}\_\mathrm{r}\mathrm{a}\mathrm{t}\mathrm{e}$ | $0\mathrm{\%}$ | $0.01\mathrm{\%}$ | $5.8\mathrm{\%}$ |

$\mathrm{P}\mathrm{E}\mathrm{T}\left(\mathrm{s}\right)$ | $103$ | $154$ | $990$ |

$\mathbf{P}\mathbf{B}-\mathbf{K}\mathbf{M}$$(\mathbf{J}=25,{\mathbf{R}}_{1}=3),{\mathbf{R}}_{2}=40$ | |
---|---|

${\mathrm{P}\mathrm{E}\mathrm{T}}_{1}\left(\mathrm{s}\right)$ | $6.4$ |

${\mathrm{S}\mathrm{S}\mathrm{E}}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ | $6.74$ |

${\mathrm{C}\mathrm{I}}_{\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{S}\mathrm{S}\mathrm{E}\right)}$ | $0$ |

${\mathrm{C}\mathrm{I}}_{\mathrm{m}\mathrm{i}\mathrm{n}}$ | $0$ |

${\mathrm{S}\mathrm{S}\mathrm{E}}_{\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{C}\mathrm{I}\right)}$ | $6.74$ |

$\mathrm{a}\mathrm{v}\mathrm{g}\_\mathrm{C}\mathrm{I}$ | $0$ |

$\mathrm{s}\mathrm{u}\mathrm{c}\mathrm{c}\mathrm{e}\mathrm{s}\mathrm{s}\_\mathrm{r}\mathrm{a}\mathrm{t}\mathrm{e}$ | $100\%$ |

${\mathrm{P}\mathrm{E}\mathrm{T}}_{2}\left(\mathrm{s}\right)$ | $2.4$ |

**Table 8.**PB-KM results on the synthetic datasets of Table 2.

$\mathbf{D}\mathbf{a}\mathbf{t}\mathbf{a}\mathbf{s}\mathbf{e}\mathbf{t}$ | $\mathbf{m}\mathbf{i}\mathbf{n}\left(\mathbf{S}\mathbf{S}\mathbf{E}\right)$ | ${\mathbf{C}\mathbf{I}}_{\mathbf{m}\mathbf{i}\mathbf{n}\left(\mathbf{S}\mathbf{S}\mathbf{E}\right)}$ | $\mathbf{a}\mathbf{v}\mathbf{g}\_\mathbf{C}\mathbf{I}$ | $\mathbf{S}\mathbf{u}\mathbf{c}\mathbf{c}\mathbf{e}\mathbf{s}\mathbf{s}\_\mathbf{R}\mathbf{a}\mathbf{t}\mathbf{e}$ | ${\mathbf{P}\mathbf{E}\mathbf{T}}_{1}\left(\mathbf{s}\right)$ | ${\mathbf{P}\mathbf{E}\mathbf{T}}_{2}\left(\mathbf{s}\right)$ |
---|---|---|---|---|---|---|

$A3$ | 6.74 | 0 | 0 | 100% | 6.4 | 2.4 |

$S3$ | 18.82 | 0 | 0 | 100% | 1.1 | 0.5 |

$Dim1024$ | 5.39 | 0 | 0 | 100% | 9.4 | 4.0 |

$Unbalance$ | 0.65 | 0 | 0 | 100% | 0.6 | 0.3 |

$Birch1$ | 92.77 | 0 | 0 | 100% | 277.3 | 96.6 |

$Birch2$ | 0.46 | 0 | 0 | 100% | 242.2 | 99.0 |

**Table 9.**PB-KM results on the real-world datasets of Table 3.

$\mathbf{D}\mathbf{a}\mathbf{t}\mathbf{a}\mathbf{s}\mathbf{e}\mathbf{t}$ | $\mathbf{m}\mathbf{i}\mathbf{n}\left(\mathbf{S}\mathbf{S}\mathbf{E}\right)$ | ${\mathbf{C}\mathbf{I}}_{\mathbf{m}\mathbf{i}\mathbf{n}\left(\mathbf{S}\mathbf{S}\mathbf{E}\right)}$ | $\mathbf{a}\mathbf{v}\mathbf{g}\_\mathbf{C}\mathbf{I}$ | $\mathbf{S}\mathbf{u}\mathbf{c}\mathbf{c}\mathbf{e}\mathbf{s}\mathbf{s}\_\mathbf{R}\mathbf{a}\mathbf{t}\mathbf{e}$ | ${\mathbf{P}\mathbf{E}\mathbf{T}}_{1}\left(\mathbf{s}\right)$ | ${\mathbf{P}\mathbf{E}\mathbf{T}}_{2}\left(\mathbf{s}\right)$ |
---|---|---|---|---|---|---|

$Musk$ | 36,373 | 0 | 0 | 100% | 0.5 | 0.1 |

$MiniBooNE$ | 2802 | 0 | 0 | 100% | 5.3 | 0.8 |

**Table 10.**The sequential and parallel execution of PB-KM recombination on ${\mathrm{W}\mathrm{o}\mathrm{r}\mathrm{m}\mathrm{s}}_{64\mathrm{d}}$. (8 physical cores).

$\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{m}\mathbf{s}\_64\mathbf{d}$ | $\mathbf{PB}-\mathbf{KM},2\mathbf{nd}\mathbf{Step},\mathbf{J}=40$$,{\mathbf{R}}_{2}=100$ |
---|---|

${\mathrm{t}\mathrm{E}\mathrm{T}}^{S}$ (ms) | 4,405,325 |

${\mathrm{t}\mathrm{I}\mathrm{T}}^{S}$ | 15,049 |

${\mathrm{t}\mathrm{E}\mathrm{T}}^{P}$(ms) | 650,622 |

${\mathrm{t}\mathrm{I}\mathrm{T}}^{P}$ | 14,887 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Nigro, L.; Cicirelli, F.
Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding. *Algorithms* **2023**, *16*, 572.
https://doi.org/10.3390/a16120572

**AMA Style**

Nigro L, Cicirelli F.
Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding. *Algorithms*. 2023; 16(12):572.
https://doi.org/10.3390/a16120572

**Chicago/Turabian Style**

Nigro, Libero, and Franco Cicirelli.
2023. "Improving Clustering Accuracy of K-Means and Random Swap by an Evolutionary Technique Based on Careful Seeding" *Algorithms* 16, no. 12: 572.
https://doi.org/10.3390/a16120572