Next Article in Journal
LQI Control System Design with GA Approach for Flying-Type Firefighting Robot Using Waterpower and Weight-Shifting Mechanism
Next Article in Special Issue
An Intelligent Hybrid Scheme for Customer Churn Prediction Integrating Clustering and Classification Algorithms
Previous Article in Journal
ReSTiNet: On Improving the Performance of Tiny-YOLO-Based CNN Architecture for Applications in Human Detection
Previous Article in Special Issue
Analysis and Evaluation of Clustering Techniques Applied to Wireless Acoustics Sensor Network Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DPS Clustering: New Results

by
Sergey M. Agayan
1,
Shamil R. Bogoutdinov
1,2,
Boris A. Dzeboev
1,*,
Boris V. Dzeranov
1,
Dmitriy A. Kamaev
3 and
Maxim O. Osipov
4
1
Geophysical Center of the Russian Academy of Sciences, 119296 Moscow, Russia
2
Schmidt Institute of Physics of the Earth of the Russian Academy of Sciences, 123995 Moscow, Russia
3
Research and Production Association “Typhoon”, 249038 Obninsk, Russia
4
Obninsk Institute for Nuclear Power Engineering, National Research Nuclear University Moscow Engineering Physics Institute (MEPhI), 249040 Obninsk, Russia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(18), 9335; https://doi.org/10.3390/app12189335
Submission received: 29 July 2022 / Revised: 12 September 2022 / Accepted: 13 September 2022 / Published: 17 September 2022
(This article belongs to the Special Issue Data Clustering: Algorithms and Applications)

Abstract

:
The results presented in this paper are obtained as part of the continued development and research of clustering algorithms based on the discrete mathematical analysis. The article briefly describes the theory of Discrete Perfect Sets (DPS-sets) that is the basis for the construction of DPS-clustering algorithms. The main task of the previously constructed DPS-algorithms is to search for clusters in multidimensional arrays with noise. DPS-algorithms have two stages: the first stage is the recognition of the maximum perfect set of a given density level from the initial array, the second stage is the partitioning of the result of the first stage into connected components, which are considered to be clusters. Study of qualities of DPS-algorithms showed that, in a number of situations in the first stage, the result does not include all clusters which have practical sense. In the second stage, partitioning into connected components can lead to unnecessarily small clusters. Simple variation of parameters in DPS-algorithms does not allow for eliminating these drawbacks. The present paper is devoted to the construction on the basis of DPS-algorithms of their new versions, more free from these drawbacks.

1. Introduction

One of the most interesting and widely used approaches to the multidimensional data analysis are cluster analysis or clustering methods. Currently, there are many clustering algorithms. Despite significant differences between them, they all rely on the initial postulate of compactness: in the space of objects, all ‘‘close’’ objects must belong to the same cluster, and all different objects, respectively, must be in different clusters. The concepts of ‘‘proximity’’ are interpreted differently in different clustering algorithms.
Within the framework of Discrete Mathematical Analysis (DMA)—an original approach to data analysis that uses fuzzy mathematics and fuzzy logic [1], methods of so-called DPS-clustering are being developed. The present study is devoted to DPS clustering and continues the series of papers on this problem [2,3,4,5,6,7].
The initial notion of DPS-clustering is a fuzzy model of the fundamental mathematical property ‘‘limit’’. It is called the density and represents a non-negative function depending on an arbitrary subset and any point in the initial space in which clustering is supposed.
The value of density can be understood as the strength of the connection between a subset and a point, as the degree of influence of a subset on a point, or dually, as the degree of limiting a point to a subset. This view of density automatically requires its monotonicity over a subset: the larger the subset, the stronger its impact on the point, and it is more limiting for it.
Nontrivial densities always exist in a finite metric space (FMS). Each density on the FMS gives a new look at it and a new program of its study, so that the density is a new mathematical concept.
Fixing the density level α and interpreting it as a limit level, we can introduce the concept of discrete perfection with level α : a subset is called discretely perfect with level α ( α -DPS- or simply DPS-set) if it consists of exactly all points of the original space α -limit to it. A rigorous theory of DPS-sets has been constructed within the framework of DMA. In particular, it has been shown that DPS-sets have the properties of clusters. This, as well as the comparison of DPS-clustering with modern cluster analysis algorithms and its applications, is described in detail in [5].
The DPS-clustering algorithms created to date operate in finite metric spaces, depend on three parameters (density P, density level α , and local coverage radius r) and have two stages.
At the first stage, topological filtering of the original space is carried out. It is cleaned from noise. The DPS-algorithm iteratively cuts out from the original space (Figure 1a) the maximum α -perfect subset (Figure 1b).
At the second stage, the DPS-algorithm splits the result of the second stage into r-connected components for which it considers to be clusters (Figure 1c).
For the array in Figure 1a and for the SDPS-algorithm that is the most known of the DPS-algorithms Section 2.2.2, the clusters are shown in different colors in Figure 1b. Similar clusters obtained as a result of working on a given array of well-known cluster analysis algorithms DBSCAN [8] and OPTICS [9] practically coincide and are shown in Figure 1c. This allows us to conclude that DPS-algorithms, like the DBSCAN and OPTICS algorithms, represent a new stage in cluster analysis, since they not only split the initial space into homogeneous parts, but also preliminarily clear it of noise (filter).
Studies show that there are situations where the result of the first stage does not include all noteworthy clusters. Decreasing the limitation level α leads to a decrease in the quality of a cluster recognition and is not a way out of the situation.
Furthermore, due to the locality of the radius, the partitioning into r-connected components of the maximum α -perfect subset in the second stage is often too shallow, and detailed, and needs to be enlarged as can be seen in Figure 1b.
The present work is devoted to correcting these disadvantages.

2. Materials and Methods

This section defines DPS-sets and describes DPS-clustering algorithms.

2.1. Discrete Perfect Sets

DMA has a rigorous theory of discrete perfect sets (DPS-sets) in finite spaces, which is summarized in [5]. A complete rigorous justification can be found in [3].
Let X be a finite set, and let A , B , and x , y , be subsets and points in it, respectively.
Definition 1.
Let us call a density P on a set X a product of mapping 2 X × X to non-negative numbers R + , increasing on the first argument and trivial on the empty:
P ( A , x ) = P A ( x ) x X , A B P A ( x ) P B ( x ) , P ( x ) = 0
For a fixed x, the function P A ( x ) is a non-normalized fuzzy measure on X, so the density P is a family of such measures parameterized by X itself.
By fixing a density level α and interpreting it as a limit level, one can define any topological notions in X, and discrete perfection with level α particularly.
Definition 2.
A set A consisting of exactly all α-perfect points of the original space X is called an α-discrete perfect (just perfect, DPS-) set in X:
A DPS-set in X A = { x X : P A ( x ) α }
Numerous studies and examples below show that DPS-sets are clusters in X and are closely related to clustering in X.
In the works [2,3,5], a construction is given that allows a subset A ( α ) X to be constructed from a subset A in the space X and the level α of the density P. Under the condition of nontriviality, it will be an α -DPS-set in X. It does not have to lie in A, but if A itself is an α -DPS-set in X, then the set A ( α ) constructed for it will coincide with it. Thus, the construction A A ( α ) depends on the space X, subset A, density P, and level α . To emphasize this fact, we introduce the notation
A ( α ) = A P ( α | X ) .
The properties of the construction A A P ( α | X ) are formulated in Statement 1, which is proved in [3].
Statement 1. 
The dependences on A, P and X are increasing, and the dependence on α is decreasing:
(1)
If A B , then A ( α ) B ( α ) ;
(2)
If P, Q are densities on X and P A ( x ) Q A ( x ) x X , then A P ( α ) A Q ( α ) ;
(3)
If α < β , then A ( β ) A ( α ) .
(4)
If A X Y and density P is given on Y, then A ( α | X ) A ( α | Y ) .

2.2. DPS-Algorithms

In the case of the entire space X, the transition X X ( α ) is an iterative cutting:
X ( α ) = i X i ( α ) X i + 1 ( α ) = { x X : P X i ( α ) ( x ) α } , X 0 ( α ) = X
If there is a metric on the space X and the density P is compatible with it, the property of α -perfectness is inherited by the connected components of the set X ( α ) . Precisely, they most closely correspond in the first approximation to empirical clusters.
In what follows, a metric d is given on the space X, so that ( X , d ) is a FMS.

2.2.1. Topological Digression

Two points x and y in A are called r-connected if there exists a chain of r-close to each other points x 0 , , x n in A with the beginning x 0 = x , and the end x n = y and distances d ( x i , x i + 1 ) r , i = 0 , , n 1 . The r-connectivity relation is an equivalence splitting the set A into r-connectivity components, which, depending on the context, will be denoted c A ( k ) = c r A ( k ) , k = 1 , , k * = k * ( A , r ) . Let us denote their collection by C ( A ) = C r ( A ) = { c A = c r A ( k ) | 1 k * } . Thus:
A = c A C ( A ) c A = k = 1 k * c A ( k ) ,
where the sign ∨ denotes a disjoint union of sets.
For x X , A X , we denote by D A ( x , r ) the closed ball in A centered at x with radius r: D A ( x , r ) = { a A : d ( x , a ) r } .
Definition 3.
Let P be a density on X, r > 0 . A density P is called r-local if
x X , A X P A ( x ) = P D A ( x , r ) ( x )
Statement 2.
If the density Pr-local, then every r-connected component of the set X ( α ) is α-perfect.
Proof. 
According to (2), for any k = 1 , , k * , it is necessary to establish the equality of the component c r X ( α ) ( k ) with the set { x X : P c r X ( α ) ( k ) ( x ) α } , which we denote by c r X ( α ) ( k ) 1 .
The inclusion of c r X ( α ) ( k ) { x X : P c r X ( α ) ( k ) ( x ) α } . x c r X ( α ) ( k ) x X ( α ) P X ( α ) ( x ) α . Furthermore, D X ( α ) ( x , r ) = D c r X ( α ) ( k ) ( x , r ) , since d ( x , c r X ( α ) ( k ¯ ) ) > r for k ¯ k . Due to the r-locality of P  P c r X ( α ) ( k ) ( x ) = P D c r X ( α ) ( k ) ( x , r ) ( x ) = P D X ( α ) ( x , r ) ( x ) = P X ( α ) ( x ) α , x c r X ( α ) ( k ) 1 .
The inclusion of c r X ( α ) ( k ) { x X : P c r X ( α ) ( k ) ( x ) α } .
First case ( x c r X ( α ) ( k ) ) ( x X ( α ) ) . Then, P c r X ( α ) ( k ) ( x ) P X ( α ) ( x ) < α x c r X ( α ) ( k ) 1 .
Second case ( x c r X ( α ) ( k ) ) ( x X ( α ) ) . Then, ( x c r X ( α ) ( k ¯ ) ) and d ( x , c r X ( α ) ( k ¯ ) ) > r , and therefore D c r X ( α ) ( k ) ( x , r ) = .
Taking into account the normalization of the density P and its r-locality, P c r X ( α ) ( k ) ( x ) = 0 . Thus, in this case, x c r X ( α ) ( k ) 1 . □
Definition 4. 
DPS-algorithm depends on three parameters: radius r, r-local density P and level α: DPS = DPS ( P , α , r ) and has two stages:
1.
The first stage of DPS the process of constructing for FMS ( X , d ) based on the r-local density P its α-envelope X ( α ) :
DPS ( P , α , r ) X X ( α )
2.
The second stage of DPS the partitioning of the α-envelope X ( α ) into r-connected components:
DPS c ( P , α , r ) X C ( X ( α ) ) 2 2 X .
Further in the text, DPS ( P , α , r ) denotes, depending on the context, either the algorithm itself or its first stage.
Let us present a flowchart of the DPS-algorithm (Figure 2).
Let us summarize the above from the perspective of a cluster analysis: 1st stage DPS ( P , α , r ) carves the maximal subset X ( α ) dense in the common background from the original space X; 2nd stage DPS c ( P , α , r ) partitions X ( α ) into components of r-connectivity c X ( α ) , each of which combines density on background and connectivity, i.e., formally expresses empirical clustering [5].
DPS-clustering has developed methods for choosing the parameters α , r, which are described in detail in [5] (see also Section 4.3.1). In particular, the density index α is determined through the level of its extremity β ( α ) , which answers the question on the scale [ 1 , 1 ] : “To what extent can α be considered large against the background of all values of the density P on X?”.
When constructing β ( α ) , a fuzzy comparison of n between non-negative numbers is used. It is expressing on the scale [ 1 , 1 ] the degree of superiority of the larger of them over the smaller one (the left part in (7))
β ( α ) = x X n ( P X ( x ) , α ) | X | , x X n ( P X ( x ) , α ( β ) ) | X | = β .
Due to the properties of n, the correspondence α β ( α ) is unique and the inverse dependence β α ( β ) is implicitly given by the right side of (7). Thus, the DPS algorithm has a second parameterization, which will be used in the future:
X ( β ) X ( α ( β ) ) DPS ( P , β , r ) DPS ( P , α ( β ) , r ) DPS C ( P , β , r ) DPS C ( P , α ( β ) , r )

2.2.2. SDPS-Algorithm

Historically, the first in a series of DPS-algorithms was the set-theoretic SDPS. It is based on the density S, which has the name “Number of points” (“Amount of space”) [2,3,6] and conveys the degree of concentration of space X around each of its points x (the most natural understanding of the density X in x).
The density S A ( x ) depends on the localization radius r and the non-negative parameter p, which takes into account the distance to x in the ball D A ( x , r ) :
S A ( x ) = S A ( x | p ) = y D A ( x , r ) 1 d ( x , y ) r p .
With p = 0 , we obtain the usual number of points, which explains the name S:
S A ( x | 0 ) = | D A ( x , r ) | .
The density S is r-local, and the SDPS-algorithm is the operation of a DPS-scheme based on S: SDPS = D P S ( S , β , r ) .
The result of SDPS-condensing in X sets locally containing “many X”. They are formally the most consistent with empirical clusters. By varying the SDPS parameters, one can obtain a fairly complete idea of the hierarchy of clusters in X.
The SDPS-algorithm will serve as a testing ground where new results for the DPS-series algorithms will be tested and shown.

3. Results: Iterative DPS

The first DPS-stage of the DPS ( P , β , r ) algorithm carves a maximal perfect subset X ( β ) , β -extremely P-dense in the background of X at each of its points, from the space X.
Let us turn to Figure 3b: it shows the result X ( β ) of the DPS algorithm on array X (Figure 3a) in red when β = 0.02 . It is easy to see that not all noteworthy condensations from X were included. The reason for their non-inclusion in the result X ( β ) is explained by the contradiction between the r-local character of the view of X and the global approach to determine the level α ( β ) of density P by its extremality level β based on the whole image P X ( X ) (7) and (8): the density level of worthy condensations in the complement X ( β ) ¯ appears to be below α ( β ) .
To include them in the final result, we need to lower the extremum level β : β β ¯ < β and switch from the DPS ( P , β , r ) to the DPS ( P , β ¯ , r ) algorithm. In the control example, the first level of β ¯ , for which the result X ( β ¯ ) will include all worthy condensations, will be β ¯ = 0.25 . The result is shown in red in Figure 3c: we see that the result X ( β ¯ ) of the algorithm DPS ( P , β ¯ , r ) , along with X ( β ) and worthy condensations from X ( β ) ¯ , included weak points of the r-halo of the set X ( β ) , and it helped them in this.
It can be done otherwise: keep the extremum level β , but change the original space by transition from X to X ( β ) ¯ . The result of DPS ( P , β , r ) on X ( β ) ¯ is shown in green in Figure 3b. If we make another similar transition, a blue cluster appears in Figure 3b.
Such way of using the DPS scheme is called the ‘‘iterative DPS’’ algorithm.

3.1. Iterations by Extremality

It will be remembered that the level α ( β ) of density P by extremality level β is determined in the DPS ( P , β , r ) = DPS ( P , α ( β ) , r ) algorithm from Equation (8). Let us introduce the partition X = X ( β ) X ( β ) ¯ . According to (7) and (8),
β = n ( P X ( X ) , α ( β ) ) = x X ( β ) n ( P X ( x ) , α ( β ) ) + x X ( β ) ¯ n ( P X ( x ) , α ( β ) ) | X | = = | X ( β ) | | X | x X ( β ) n ( P X ( x ) , α ( β ) ) | X ( β ) | + | X ( β ) ¯ | | X | x X ( β ) ¯ n ( P X ( x ) , α ( β ) ) | X ( β ) ¯ | .
From the properties of convexity, hence the inequality
max x X ( β ) n ( P X ( x ) , α ( β ) ) | X ( β ) | , x X ( β ) ¯ n ( P X ( x ) , α ( β ) ) | X ( β ) ¯ | β .
However, the left mean 0 because x X ( β ) P X ( x ) P X ( β ) ( x ) α ( β ) . Hence, if β 0 , the right mean must be β :
x X ( β ) ¯ n ( P X ( x ) , α ( β ) ) | X ( β ) ¯ | β .
Then, from the inequalities x X P X ( β ) ¯ ( x ) P X ( x ) and the properties of fuzzy comparison n, we have:
n P X ( β ) ¯ X ( β ) ¯ , α ( β ) = x X ( β ) ¯ n ( P X ( β ) ¯ ( x ) , α ( β ) ) | X ( β ) ¯ | x X ( β ) ¯ n ( P X ( x ) , α ( β ) ) | X ( β ) ¯ | β .
It follows that the density level α 1 ( β ) , required for the operation of DPS ( P , β , r ) algorithm on the space X ( β ) ¯ :
n P X ( β ) ¯ X ( β ) ¯ , α 1 ( β ) = β
does not exceed the level α ( β ) , which we will consider as zero: α ( β ) = α 0 ( β ) , and can be strictly less than it. In addition, this, in turn, means the possible nontriviality of such operation, i.e., the set X ( β ) ¯ ( β ) . Let us denote it by X 1 ( β ) .
In the situation of non-triviality X 1 ( β ) , one can define the first iteration on the extremity X ( β , 1 ) of the set X ( β ) , which is naturally considered as its null iteration, assuming X ( β , 0 ) = X ( β ) .
Definition 5.
Let X 1 ( β ) be non-trivial, and we will call by the first iteration of the space X with respect to the extremity level β the disjoint union
X ( β , 1 ) = X ( β , 0 ) X 1 ( β ) .
Note 1.
Given this definition, the above can be understood as the first precondition for the existence of the first iteration on extremality when β is non-negative. In this connection, the zero level of β appears to be the most productive and interesting, at which the second precondition (non-triviality of the result) will be the weakest.
A direct check shows that P X ( β , 1 ) ( x ) α 1 ( β ) x X ( β , 1 ) , so, by repeating the above deduction with replacement α 0 ( β ) α 1 ( β ) , X ( β , 0 ) X ( β , 1 ) , we obtain the level α 2 ( β ) α 1 ( β ) of the DPS ( P , β , r ) algorithm on the complement X ( β , 1 ) ¯ and the possible nontriviality of its result X 2 ( β ) = X ( β , 1 ) ¯ ( β ) .
If all this can be continued up to and including the i-th step, i.e., the sets X 1 ( β ) , , X i ( β ) , will be nontrivial, then their union with X ( β , 0 ) = X ( β ) will be called the i-th iteration of X by extremum level β :
X ( β , i ) = X ( β , 0 ) X 1 ( β ) X i ( β ) .

3.2. Iterations of the DPS Algorithm

Considering the algorithms DPS ( P , β , r ) and DPS c ( P , β , r ) as their zero iterations: DPS 0 ( P , β , r ) = DPS ( P , β , r ) , DPS c 0 ( P , β , r ) = DPS c ( P , β , r ) , we define their i-th iterations as processes of building for space X its i-th iteration X ( β , i ) and then breaking it into r-connectivity components:
DPS i ( P , β , r ) X X ( β , i ) DPS c i ( P , β , r ) X C r ( X ( β , i ) )
Due to the disjunctive nature of the decomposition (10), the iteration index ind is correctly defined at the i-th iteration of X ( β , i ) :
x X ( β , i ) ind ( x ) = 0 , if x X ( β ) i ¯ [ 1 , i ] , if x X i ¯ ( β ) .
The index ind moves from points to connectivity components c = c X ( β , i ) , thus becoming multivalued:
ind ( c ) = { ind ( x ) : x c } .
In addition, this, in turn, makes it possible for any subset I [ 0 , i ] to define a conditional version of the DPS I algorithm:
DPS I ( P , β , r ) = i I X i ( β ) DPS c I ( P , β , r ) = { c C ( X ( β , i ) ) : ind ( c ) I } .
Example 1.
Figure 4b–d show the sets X i ( β ) , i = 0 , 1 , 2 for the array X shown in Figure 4a, obtained as a result of the operation of the SDPS-algorithm at β = 0.09 based on the fuzzy comparison n ( t , s ) = ( s t ) / ( s + t ) of non-negative numbers s and t [1].
Figure 5 shows both stages of the SDPS 1 -algorithm. Its result SDPS e 1 ( X ) indicates that, in the general case, the second stage of DPS cannot be considered final: many components of the r-connectivity in Figure 5b need further connection. This will be done by the third stage of DPS, the results of which on a given array X for a given SDPS-algorithm will be shown in Example 4.

4. Results: Third Stage of DPS

The first stage of the algorithm DPS DPS ( P , α , r ) cuts out from the original FMS X the maximum α -perfect subset X ( α ) : DPS ( X ) = X ( α ) . The second stage of the algorithm DPS C DPS C ( P , α , r ) is considered to be its result and is the set of all components of the r-connectivity of the set X ( α ) : DPS C ( X ) = C r X ( α ) . Each c r X ( α ) component is considered to be a DPS-cluster because, by virtue of the initial assumption that the initial density is r-local, P is a DPS-set in X and, in particular, is α -dense at each of its points.
At this stage, the result DPS C ( X ) is taken as a given and is not subject to further transformation. The radius r is assumed to be infinitely small, and the level α of P is infinitely large. As a consequence, each component c X ( α ) is considered to be a single and indivisible spot (large point).
Spots c X ( α ) are interpreted as fragmentary manifestations (exits) of global anomalous entities in X. To understand their true scales, if possible, further connection of spots is necessary. This is the third and final stage of the DPS-algorithm.

4.1. Logic and Action Plan

A collection of spots C DPS c ( X ) is considered by the expert as a whole if the degree of advantage in the closeness of internal transitions (paths) between any of its spots over external transitions from C to C ¯ , allows the expert to conclude that C is non-random.
Thus, the cluster of spots C must have a tighter internal connection between them in the general background of all spots from DPS c ( X ) . There are no other constraints on C, in particular on shape.
Because of the α -perfectness of the c X ( α ) ( k ) spots, it is natural to consider the minimum distance between the closest points in them as the distance between them:
d ( c X ( α ) ( k ) , c X ( α ) ( k ¯ ) ) = min { d ( y , y ¯ ) : y c X ( α ) ( k ) , y ¯ c X ( α ) ( k ¯ ) } .
It will no longer be a metric since there are no triangle inequalities for d (13), as the example below shows:
Example 2.
Consider subsets of the real line: A = [ 1 , 1 ] , B = [ 2 , 3 ] , C = [ 4 , 4 ] . It is obvious d ( A , B ) + d ( C , B ) < d ( A , C ) , d ( A , B ) = | 2 1 | , d ( B , C ) = | 4 3 | , d ( A , C ) = | 4 1 | .
Based on the distance d, the formalization of the expert’s logic for connecting spots is constructed: in each cluster of spots C, any two spots can be connected by a chain of spots whose distance d between neighboring links will be smaller than the distance d from C to C ¯ . This is the d-clustering condition necessary for C. Its formal analysis will be the first part of the functioning program to algorithmise the third stage of the DPS. Note that d-clustering and DPS-clustering are different interpretations of clustering: the first is based on connectivity, and the second on limiting.
The second part of this program consists of constructing, based on the initial distance d in the original space X, measures of proximity and distance between spots expressing in this the expert. Based on these, only those spots are selected from the d-clusters of spots in which the proximity advantage of the inner transitions over the range of the outer ones will be non-random in the expert’s opinion.

4.2. FirstPart: The Theory of d-Clusters

Initial data and designations: X is a finite set, d is a quasi-metric in X ( ( X , d ) is a quasi-metric space)
x X d ( x , x ) = 0 x , y X d ( x , y ) = d ( y , x ) 0
Definition 6.
By the enumeration X ( x ) of space X with origin at the point x, we call the sequence x = x 0 , , x | X | 1 for which for any i = 0 , , | X | 2
d ( x i + 1 , X i ( x ) ) = d X i ( x ) , X i ( x ) ¯ ,
where X i ( x ) = { x 0 , , x i } .

4.2.1. The Notation

d ( x ) is a numerical sequence of distances d i ( x ) = d X i 1 ( x ) , X i 1 ( x ) ¯ :
d ( x ) = { d 1 ( x ) , , d | X | 1 ( x ) }
Definition 7.
Let us call an eigen d-cluster C = C k ( x ) , 1 k | X | 2 with center (origin) in x the eigen segment X k ( x ) of the enumeration X ( x ) , for which max i = 1 k d i ( x ) < d k + 1 ( x ) :
C = C k ( x ) eigen d - cluster C = C k ( x ) max i = 1 k d i ( x ) < d k + 1 ( x )
Statement 3.
The cluster C k ( x ) is independent of the choice of origin within itself:
C k ( x ) = C k ( y ) y C k ( x ) x
In other words, in the case of d-clustering, there is a set-theoretic equality X k ( x ) = X k ( y ) that is, for any y C k ( x ) , the first k + 1 terms in the enumeration X ( y ) will lie in C k ( x ) .
Proof. 
Induction on the number i of steps y i 1 y i , i = 1 i k in the enumeration X ( y ) . Let y = x j , j = 1 , , k .
i = 1 . We have to show that y 1 C k ( x ) . By virtue of y C k ( x ) , we have
d y , C k ( x ) ¯ d C k ( x ) , C k ( x ) ¯ = d k + 1 ( x ) .
The element y = x j in C k ( x ) = X k ( x ) has two neighbors x j 1 and x j + 1 , if 1 j < k , and one neighbor x k 1 , if j = k and y = x k . To any of the neighbors, the distance from y will be less than d k + 1 ( x ) , so the next element after y = y 0 in the sequence X ( y ) necessarily lies C k ( x ) .
Let the assumption that the first i steps of y i , i k 1 in the sequence X ( y ) lie in C k ( x ) be satisfied. Let us show that the element y i + 1 will also lie in C k ( x ) .
Under our assumptions X i ( y ) C k ( x ) , therefore
d ( X i ( y ) , C k ( x ) ¯ ) > d ( C k ( x ) , C k ( x ) ¯ ) = d k + 1 ( x ) .
Assume y m = x s ( m ) , m = 0 , , i . If s 0 < < s i is the correct ordering of array s ( m ) | 0 i , then two cases are possible:
1st:
the array s p | 0 i has no gaps, that is, it is an eigensegment in [ 0 , k ] . In this case, C k ( x ) necessarily contains either an element x s 0 1 , or an element x s i + 1 , whose distance from X i ( y ) is nontrivial and less than d k + 1 ( x ) . Hence, the term y i + 1 in the sequence X ( y ) necessarily lies in C k ( x ) :
2nd:
s p | 0 i array has gaps. Let p * be the first number for which s p * + 1 s p * > 1 . In this case, in the array s p | 0 i , it has no number s p * + 1 and in the set X i ( y ) of an element x s p * + 1 , whose distance from x s p * will be less than d k + 1 ( x ) . Hence, in this case too, the term y i + 1 in the sequence X ( y ) lies in C k ( x ) .
Consequence 1.
Eigen d-clusters behave like non-Archimedean balls: either one contains the other or they do not intersect. In particular, two d-clusters of the same order either coincide or do not intersect.
Proof. 
Let C = C k ( x ) , C ¯ = C k ¯ ( x ¯ ) and y C C ¯ . According to Statement 3, C = X k ( y ) , C ¯ = X k ¯ ( y ) . If k k ¯ , then C C ¯ . □
The d-clusters of C k ( x ) are eigennonpoint subsets in X. Let us remove this restriction on k in Definition 7 using the notion of a connectivity exponent, which develops and continues the topic of finite connectivity Section 2.2.1.
Definition 8. 
1.
Connectivity index r ( A ) of a subset A minimum r for which A is r-connected:
r ( A ) = min { r : A r-connected } .
2.
The isolation index e ( A ) of a subset A the distance to its complement A ¯ :
e ( A ) = d ( A , A ¯ ) .
Statement 4.
r ( X k ( x ) ) = max i = 1 k d i ( x )
Proof. 
1.
Proof of the inequality r ( X k ( x ) ) max i = 1 k d i ( x ) . Induction on k, 1 k | X | 2
  • k = 1 : X 1 ( x ) = { x 0 , x 1 } and d ( x 0 , x 1 ) = d 1 ( x ) according to (14) and (15).
  • Suppose that r ( X k 1 ( x ) ) max i = 1 k 1 d i ( x ) , that is, any two points in X k 1 ( x ) can be connected by a path with distance transitions max i = 1 k 1 d i ( x ) . Let x * be the point in X k 1 ( x ) , closest to x k . According to (15), d k ( x ) = d ( x * , x k ) . Through x * any point in X k 1 ( x ) can be connected to x k by a path with distance transitions max i = 1 k d i ( x ) .
2.
Proof of the inequality r ( X k ( x ) ) max i = 1 k d i ( x ) : let d i * ( x ) = max i = 1 k d i ( x ) , then d i * ( x ) = d X i * 1 ( x ) , X i * 1 ( x ) ¯ , and any path with ends in X i * 1 ( x ) and X i * 1 ( x ) ¯ has at least one jump between them. Therefore, r ( X k ( x ) ) d i * ( x ) .
For the eigen d-cluster C (Definition 7), the proved statement means that the necessary condition is fulfilled
r ( C ) < e ( C ) .
It is invariant not only from the beginning of the enumeration but also from the enumeration itself, which is obligatory at this point for the definition of C. Therefore, it is natural to check this condition for sufficiency of d-clustering.
Statement 5.
If (17) is satisfied for an eigensubset of A in X, then A is a d-cluster and A = C | A | 1 ( x ) x A .
Proof. 
There exists r 1 : r ( A ) r 1 < d ( A , A ¯ ) for which the set A is r 1 -connected. Hence, in the enumeration X ( x ) of space X from any point x A the elements from A ¯ will appear only at the | A | -th step. □
The result of Definition 8, Statement 4 and Statement 5 is a new invariant and more complete definition of a d-cluster.
Definition 9.
Let us call a d-cluster in X the subset C for which there is a true inequation: r ( C ) < d ( C , C ¯ ) .
To the eigenclusters C ( 1 < | C | < | X | ) will be added all points x X : x = C 0 ( x ) , and all space X: X = C | X | 1 ( x ) x X .
Thus, the restrictions in Definition 7 on k ( 1 k | X | 2 ) are removed:
x d -cluster, since 0 = r ( x ) < e ( x ) x X ,
X d -cluster, since r ( X ) < e ( X ) = : no appearance for X means that it is at infinity.

4.2.2. Conclusions

The collection of d-clusters C ( X , d ) forms a hierarchy of sets on X based on the notion of discrete connectivity and different from traditional hierarchical cluster analysis hierarchies, usually binary [10].
C ( X , d ) is the first part of the formalization of the third stage of DPS.

4.3. Second Part: Final Selection of D-Clusters

Thanks to (14)–(16), the search for d-clusters in the space of r-connectivity components DPS C ( X ) with respect to distance (13), that is, the formation of the space C ( DPS C ( X ) , d ) is constructive. Each d-cluster C C ( DPS C ( X ) , d ) has two characteristics: internal r ( C ) and external e ( C ) (17). Based on this information, expert E must decide: is it necessary to combine the spots from C into a single whole or not?
Let DPS f ( X ) = DPS f ( X | E ) denote the set of maximal d-clusters in DPS C ( X ) , obtained after joining spots by expert E. According to Consequence 1, they are all disjoint.
Under the condition that DPS f ( X ) is nontrivial in the set X ( α ) , a partition appears in the general case larger than the partition (5) into r-connectivity components:
X ( α ) = { C : C DPS f ( X ) } { c X ( α ) : C DPS f ( X ) c X ( α ) C }
Let it also denote by DPS f ( X ) and consider it the third and final stage of the DPS algorithm. Full history is presented in Figure 6.
Thus, for each component c r X ( α ) , one of three things can happen:
  • c r X ( α ) DPS f ( X ) component c r X ( α ) is sufficiently isolated from the rest to be of interest to the expert, it is the only way out of the global entity behind it on X;
  • c r X ( α ) DPS f ( X ) , but c r X ( α ) C DPS f ( X ) component c r X ( α ) is part of the d-cluster C, which is of interest to the expert and in the team represents a global entity in X;
  • c r X ( α ) C C DPS f ( X ) any d-cluster containing c r X ( α ) is of no interest to the expert: either it is not internally dense enough, or it is externally separable. According to the expert, if c r X ( α ) is a fragment of the output of the global entity on X, then it is not clear enough.
Furthermore, Example 3 will show that all options for c r X ( α ) are possible.
At stage 3.2 (Figure 6), the expert can act in different ways. Let us present the simplest Boolean variant of actions for the formation of DPS f ( X ) .

4.3.1. Boolean Variant

Expert E decides whether d-cluster C is included in the final result DPS f ( X ) based on comparison of r ( C ) and e ( C ) with their proximity and distance thresholds r E and e E : C D P S f ( X ) r ( C ) r E and e ( C ) e E .
Expert E considers the parameter r of the DPS-algorithm to be very small (infinitesimal), much less than r E , which, in turn, according to E, is much less than e E :
r r E e E .
The threshold r E , like the radius r in the DPS-algorithm, is built using the Kolmogorov averaging of non-trivial distances of the FMS X [5]
r E = x y X d ( x , y ) | X | ( | X | 1 ) 1 / q , q = q ( r E )
For the parameter r, numerous applications of DPS-series algorithms have established that the choice of q ( r ) [ 3 , 2 ] can be considered optimal. The studies carried out within the framework of present paper show that q ( r E ) [ 2.5 , 1 , 5 ] . The intersection of the areas of parameters r, r E is explained both by the fuzzy perception of proximity by the expert and by the diversity in the arrangement of an arbitrary FMS.
The threshold e E is obtained from r E by formalizing the expert judgment ‘‘ r E e E ’’ using fuzzy comparison. For the comparison given in Example 1, this would be the inequality e E / r E 5 / 3 .
Example 3.
Array X in Figure 7a has already been seen above: it was a testing ground for DPS, DBSCAN, and OPTICS algorithms. Figure 7b–d show the complete DPS-scenario with parameters β = 0.25 , q ( r ) = 2.7 , q ( r E ) = 2.3 , e E / r E = 5 / 3 .
The result of cutting at the first stage (Figure 7b) was divided into eight r-connectivity components at the second stage (Figure 7c). Two components (blue and light brown) and two more in a green combination passed independently to the third stage (Figure 7d). The remaining four brown components did not pass the third stage.
Example 4.
In the conditions, notation and parameters of Example 1 Figure 8a–c show the original array and two stages of the SDPS algorithm on it. Figure 8d–f show the complete scenario for his second iteration of SDPS 2 .
Comparison of Figure 8c,f serves as clear evidence of the work done in the paper: the full version of the second iteration of SDPS 2 with the same parameters performs better than the incomplete zero version (the original SDPS algorithm) that existed before this article.

5. Discussion

The density P on a finite space X is a definition in X in the language of fuzzy mathematics of the property of limiting. From a formal point of view, P is a fuzzy structure on the direct product 2 X × X , monotonic in the first argument (1).
Fixing a level α [ 0 , 1 ] for P, we can define a closure operator on 2 X and, through it, the usual topology τ α on X. Thus, an increasing family of topologies { τ α , α [ 0 , 1 ] } , arises on X, starting at zero with a trivially inseparable minimal topology of sticking together points and ending with a trivially separable maximal topology of all subsets.
Each density on the space X gives its own view of it and its own program of its study. It is a concept that does not lie in classical mathematics.
In addition to the density S Section 2.2.2, nontrivial densities always exist in a finite metric space. The description of the most important densities and their significance for data analysis are given in [5,11].
The study of the space X with the help of the density P defined on it began with the study of perfect sets in the topology τ α . The explanation for this is as follows: setting P on X defines a clustering in X: K is a cluster (P-cluster) in X if it consists of exactly all points that are P-limit to it. If α is the limit threshold, then the formal expression of what has been said coincides with α -perfection: K = { x X : P K ( x ) α } .
The DPS-algorithm builds all such clusters, but there are too many of them for meaningful cluster analysis in X: first of all, we need ‘‘connected components’’ of the maximal α -perfect subset X ( α ) in X.
All this exists when X is FMS and the density P is local on it. It is in this situation that the DPS-algorithm operates. It depends on the design of the density P, its level α and the localization radius r: DPS = DPS ( P , α , r ) . DPS operates in two stages: first, it cuts out a subset X ( α ) in X (first stage), and then splits it into r-connected components, which it considers to be DPS-clusters in X (second stage).
Despite generally good results and efficient applications [4,5,11,12,13], the DPS-algorithm in its current version has drawbacks: at the first stage, the cutting X X ( α ) is not always thorough and of high quality; at the second stage, the partitioning of X ( α ) into r-connectivity components is too small and detailed due to the locality of the radius r. A proper partition of X ( α ) must come from the global view of X ( α ) induced by the entire space X.
In the present paper, we have made attempts to correct or mitigate these drawbacks: our answers were at the first stage—the iterative cutting scheme DPS i ( X ) , at the second—its third stage DPS f ( X ) of connecting the components of the r-connectivity.
The iterations allowed the DPS-algorithm at the first stage to achieve the completeness of its DPS ( X ) result, while maintaining a high level of extremality of cutting.
At the moment, the result DPS C ( X ) of the second stage of the DPS-algorithm is considered its final result. The additional connections of components from DPS C ( X ) proposed in the paper occur in two stages. On the first of them, which has the necessary character, d-clusters are searched in DPS C ( X ) . According to the results of the article, their search is constructive. d-clustering is a necessary condition for connecting components from DPS C ( X ) into a single whole. The set of d-clusters C ( DPS C ( X ) , d ) serves as the basis for the second stage, which is already sufficient. The main thing here is the expert: his criteria form the final choice of DPS f ( X ) in C ( DPS C ( X ) , d ) and at the same time the final partition of DPS C ( X ) . There are various formal variants of this choice in DMA. The simplest one of them is given in Section 4.3.1.
DPS-series algorithms are actively applied in many geological and geophysical studies (analysis of seismic catalogs, search for signals on geophysical records, in the problem of radioactive waste disposal, etc.) [1,4,5,7,11,12,13]. It seems that the new version of DPS developed in the present paper will make it possible to improve this application.
The addition of the third stage to the DPS according to the authors makes its architecture sufficient. Further development of DPS-algorithms should take place through their parameters: new constructions of densities and their connection with fuzzy logic will give (and already give [5]) the possibility of deep study of finite metric spaces. Furthermore, the following circumstance is fundamental: in contrast to the Euclidean space, the FMS X is local at each of its points x; as a rule, it is arranged differently; therefore, the parameters α , r of the DPS-algorithm, which have a local nature, must depend on x: r = r ( x ) , α = α ( x ) . The iterative carving scheme only softens the matter.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by S.R.B., B.A.D., B.V.D., D.A.K. and M.O.O. The first draft of the manuscript was written by S.M.A., and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was conducted in the framework of budgetary funding of the Geophysical Center of RAS, adopted by the Ministry of Science and Higher Education of the Russian Federation.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Agayan, S.M.; Bogoutdinov, S.R.; Krasnoperov, R.I. Short introduction into DMA. Russ. J. Earth Sci. 2018, 18, 1–10. [Google Scholar] [CrossRef]
  2. Agayan, S.M.; Bogoutdinov, S.R.; Dobrovolsky, M.N. On one algorithm for searching the denseareas and its geophysical applications. In Proceedings of the Materials of 15th Russian National Workshop “Mathematical Methods of Pattern Recognition, MMRO-15”, Petrozavodsk, Russia, 11–17 September 2011; pp. 543–546. [Google Scholar]
  3. Agayan, S.M.; Bogoutdinov, S.R.; Dobrovolsky, M.N. Discrete Perfect Sets and Their Application in Cluster Analysis. Cybern. Syst. Anal. 2014, 50, 176–190. [Google Scholar] [CrossRef]
  4. Agayan, S.M.; Tatarinov, V.N.; Gvishiani, A.D.; Bogoutdinov, S.R.; Belov, I.O. FDPS algorithm in stability assessment of the Earth’s crust structural tectonic blocks. Russ. J. Earth Sci. 2020, 20, 1–14. [Google Scholar] [CrossRef]
  5. Agayan, S.; Bogoutdinov, S.; Kamaev, D.; Kaftan, V.; Osipov, M.; Tatarinov, V. Theoretical Framework for Determination of Linear Structures in Multidimensional Geodynamic Data Arrays. Appl. Sci. 2021, 11, 11606. [Google Scholar] [CrossRef]
  6. Dzeboev, B.A.; Gvishiani, A.D.; Agayan, S.M.; Belov, I.O.; Karapetyan, J.K.; Dzeranov, B.V.; Barykina, Y.V. System-Analytical Method of Earthquake-Prone Areas Recognition. Appl. Sci. 2021, 11, 7972. [Google Scholar] [CrossRef]
  7. Agayan, S.M.; Losev, I.V.; Belov, I.O.; Tatarinov, V.N.; Manevich, A.I.; Pasishnichenko, M.A. Dynamic Activity Index for Feature Engineering of Geodynamic Data for Safe Underground Isolation of High-Level Radioactive Waste. Appl. Sci. 2022, 12, 2010. [Google Scholar] [CrossRef]
  8. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; AAAI Press: Palo Alto, CA, USA, 1996; pp. 26–231. [Google Scholar]
  9. Ankerst, M.; Breunig, M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering Points To Identify the Clustering Structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia PA, USA, 1–3 June 1999; ACM Press: New York, NY, USA, 1999; pp. 49–60. [Google Scholar] [CrossRef]
  10. Jambu, M. Hierarchical Cluster Analysis and Correspondences; Finansy i Statistica: Moscow, Russia, 1988. [Google Scholar]
  11. Agayan, S.; Bogoutdinov, S.; Soloviev, A.; Sidorov, R. The Study of Time Series Using the DMA Methods and Geophysical Applications. Data Sci. J. 2016, 15, 16. [Google Scholar] [CrossRef]
  12. Dzeboev, B.A.; Karapetyan, J.K.; Aronov, G.A.; Dzeranov, B.V.; Kudin, D.V.; Karapetyan, R.K.; Vavilin, E.V. FCAZ-recognition based on declustered earthquake catalogs. Russ. J. Earth Sci. 2020, 20, 1–9. [Google Scholar] [CrossRef]
  13. Gvishiani, A.D.; Agayan, S.M.; Losev, I.V.; Tatarinov, V.N. Geodynamic hazard assessment of a structural block holding an underground radioactive waste disposal facility. Min. Inf. Anal. Bull. 2021, 12, 5–18. [Google Scholar] [CrossRef]
Figure 1. Comparison of algorithms: (a)—initial array; (b)—the result of clustering by SDPS-algorithm; (c)—the result of clustering by the DBSCAN-algorithm.
Figure 1. Comparison of algorithms: (a)—initial array; (b)—the result of clustering by SDPS-algorithm; (c)—the result of clustering by the DBSCAN-algorithm.
Applsci 12 09335 g001
Figure 2. Flowchart of the DPS-algorithm.
Figure 2. Flowchart of the DPS-algorithm.
Applsci 12 09335 g002
Figure 3. Dependence of the DPS-algorithm result on the level of extremeness: (a)—initial array; (b)—red color shows the result X ( β ) of the DPS-algorithm, green color shows the result of the algorithm on the complement ( X ( β ) ¯ ) , β = 0.02 , the blue cluster is the result of one more transition to the complement; β = 0.02 ; (c)—red color shows the result at β ¯ = 0.25 .
Figure 3. Dependence of the DPS-algorithm result on the level of extremeness: (a)—initial array; (b)—red color shows the result X ( β ) of the DPS-algorithm, green color shows the result of the algorithm on the complement ( X ( β ) ¯ ) , β = 0.02 , the blue cluster is the result of one more transition to the complement; β = 0.02 ; (c)—red color shows the result at β ¯ = 0.25 .
Applsci 12 09335 g003
Figure 4. DPS-algorithm iterations: (a)—initial array; (b)—the set X 0 ( 0.09 ) ; (c)—the set X 1 ( 0.09 ) ; (d)—the set X 2 ( 0.09 ) .
Figure 4. DPS-algorithm iterations: (a)—initial array; (b)—the set X 0 ( 0.09 ) ; (c)—the set X 1 ( 0.09 ) ; (d)—the set X 2 ( 0.09 ) .
Applsci 12 09335 g004
Figure 5. SDPS 1 -algorithm: (a)—the first stage; and (b)—the second stage.
Figure 5. SDPS 1 -algorithm: (a)—the first stage; and (b)—the second stage.
Applsci 12 09335 g005
Figure 6. Complete block-diagram of the DPS-algorithm.
Figure 6. Complete block-diagram of the DPS-algorithm.
Applsci 12 09335 g006
Figure 7. The complete scheme of the SDPS algorithm: (a)—initial array; (b)—the first stage; (c)—the second stage; (d)—the third stage.
Figure 7. The complete scheme of the SDPS algorithm: (a)—initial array; (b)—the first stage; (c)—the second stage; (d)—the third stage.
Applsci 12 09335 g007
Figure 8. Comparison of SDPS and SDPS 2 algorithms: (a)—initial array; (b)—the first stage of SDPS; (c)—the second stage of SDPS; (d)—the first stage of SDPS 2 ; (e)—the second stage of SDPS 2 ; (f)—the third stage of SDPS 2 .
Figure 8. Comparison of SDPS and SDPS 2 algorithms: (a)—initial array; (b)—the first stage of SDPS; (c)—the second stage of SDPS; (d)—the first stage of SDPS 2 ; (e)—the second stage of SDPS 2 ; (f)—the third stage of SDPS 2 .
Applsci 12 09335 g008
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Agayan, S.M.; Bogoutdinov, S.R.; Dzeboev, B.A.; Dzeranov, B.V.; Kamaev, D.A.; Osipov, M.O. DPS Clustering: New Results. Appl. Sci. 2022, 12, 9335. https://doi.org/10.3390/app12189335

AMA Style

Agayan SM, Bogoutdinov SR, Dzeboev BA, Dzeranov BV, Kamaev DA, Osipov MO. DPS Clustering: New Results. Applied Sciences. 2022; 12(18):9335. https://doi.org/10.3390/app12189335

Chicago/Turabian Style

Agayan, Sergey M., Shamil R. Bogoutdinov, Boris A. Dzeboev, Boris V. Dzeranov, Dmitriy A. Kamaev, and Maxim O. Osipov. 2022. "DPS Clustering: New Results" Applied Sciences 12, no. 18: 9335. https://doi.org/10.3390/app12189335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop