# Performance Analysis and Architecture of a Clustering Hybrid Algorithm Called FA+GA-DBSCAN Using Artificial Datasets

## Abstract

## 1. Introduction

## 2. Data Preprocessing

#### 2.1. Data Collection Method

#### 2.2. Data Normalization

#### 2.3. Dimensionality Reduction Technique

Algorithm 1: Dimensionality reduction using Factor Analysis. |

## 3. Dbscan Algorithm

- Eps-neighborhood of point: The Eps-neighborhood of a point p, denoted by ${N}_{\mathrm{Eps}}\left(p\right)$, is defined by ${N}_{\mathrm{Eps}}\left(p\right)=\{q\in D|distp,q\le \mathrm{Eps}\}$.
- Directly density-reachable: A point p is directly density-reachable from a point q if:
- $p\in {N}_{\mathrm{Eps}}\left(q\right)$.
- The core point condition is reached, i.e., ${N}_{\mathrm{Eps}}\left(q\right)\ge \mathrm{MinPts}$.

- Density-reachable: A point p is density-reachable from a point q if there is a set of points ${p}_{1},\dots ,{p}_{n}$, with ${p}_{1}=q$ and ${p}_{n}=p$, such that ${p}_{i+1}$ is directly density-reachable from ${p}_{i}$.
- Density-connected: A point p is density-connected to a point q if there is a point o such that p and q are density-reachable from o.
- Cluster: Let D be a specific dataset. A cluster is a non-empty subgroup from dataset D that meets the following criteria:
- Maximality $\forall p,q$: if $p\in C$ and q is density-reachable from p, then $q\in C$.
- Connectivity $\forall p,q\in C$, then p is density-connected to q.

- Noise: Let ${C}_{1},\dots ,{C}_{k}$ be the clusters of dataset D. Noise is defined as the set of points in the dataset D not belonging to any cluster ${C}_{i}$, that is $p\in D|\forall i:p\notin {C}_{i}$.

Algorithm 2: Clustering using DBSCAN algorithm. |

#### 3.1. Definition of Parameter MinPts

#### 3.2. Definition of Parameter Eps Using a Fitness Proportionate Selection

Algorithm 3: Selection of DBSCAN parameters using a genetic algorithm. |

## 4. Results

#### 4.1. Performance Evaluation Metrics

#### 4.1.1. Precision

#### 4.1.2. Entropy

#### 4.1.3. Calinski–Harabasz Clustering Evaluation Method

#### 4.2. Clustering Performance Analysis

## 5. Case Studies

#### 5.1. Aircraft Engine Degradation

#### 5.2. Lidar Dataset

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

**Figure 1.**Example of a dimensionality reduction of a dataset D. (

**a**) Three-dimensional scatter-plot of dataset ${D}_{1000\times 6}$. (

**b**) Scatter-plot of artificial dataset projection ${D}_{1000\times 2}$.

**Figure 2.**Clustering results using the FA+GA-DBSCAN algorithm with artificial datasets. Noise points are marked by “+”.

**Figure 3.**Clustering results of artificial datasets using K-means algorithm with artificial datasets.

**Figure 4.**Clustering results, precision, Entropy, and information gain using the hybrid algorithm FA+GA-DBSCAN.

**Figure 6.**Scree-plot from the aircraft engine operational conditions and degradation dataset. Two common factors are sufficient to represent a large quantity of the original information.

**Figure 7.**Clustering analysis of an aircraft engine considering different operational conditions and one mode of degradation.

**Table 1.**A scheme of a chromosome belonging to the initial population with two alleles; one is the point p, and the other is the radius r.

Allele 1 | Allele 2 | |
---|---|---|

x | y | Radius r |

51.606 | 12.783 | 1.036 |

**Table 2.**Automatic definition of FA+GA-DBSCAN’s parameters Eps and MinPts; values are represented using their mean and standard deviation after 30 runs of the algorithm.

Dataset Name | Eps | MinPts |
---|---|---|

Aggregation | $1.130\pm 2.52\times {10}^{-5}$ | $5.498\pm 0.00$ |

Compound | $2.413\pm 2.60\times {10}^{-4}$ | $7.710\pm 0.00$ |

Jain | $2.550\pm 3.89\times {10}^{-4}$ | $5.228\pm 0.00$ |

Dim064 | $0.142\pm 1.95\times {10}^{-6}$ | $5.898\pm 0.00$ |

Wine | $0.694\pm 9.90\times {10}^{-6}$ | $23.042\pm 0.00$ |

MDCgen | $0.872\pm 3.71\times {10}^{-5}$ | $26.055\pm 0.00$ |

**Table 3.**A comparative study of clustering performance using the Calinski–Harabasz clustering evaluation method and FA+GA-DBSCAN; C refers to cluster.

Dataset Name | Classes | Calinski–Harabasz Optimal C | C Defined by FA+GA-DBSCAN |
---|---|---|---|

Jain | 2 | 9 | 2 |

Aggregation | 7 | 6 | 7 |

Compound | 2 | 2 | 3 |

MDCgen | 3 | 5 | 3 |

dim064 | 16 | 16 | 16 |

Wine | 3 | 3 | 3 |

