Next Article in Journal
Identifying Qualified Public Safety Education Venues Using the Dempster–Shafer Theory-Based PROMETHEE Method under Linguistic Environments
Next Article in Special Issue
A Robust Sphere Detection in a Realsense Point Cloud by USING Z-Score and RANSAC
Previous Article in Journal
Statistical Features and Estimation Methods for Half-Logistic Unit-Gompertz Type-I Model
Previous Article in Special Issue
Compression Reconstruction Network with Coordinated Self-Attention and Adaptive Gaussian Filtering Module
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Topological Regularization for Representation Learning via Persistent Homology

1
School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
2
School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(4), 1008; https://doi.org/10.3390/math11041008
Submission received: 2 January 2023 / Revised: 9 February 2023 / Accepted: 14 February 2023 / Published: 16 February 2023
(This article belongs to the Special Issue Advances in Computer Vision and Machine Learning)

Abstract

:
Generalization is challenging in small-sample-size regimes with over-parameterized deep neural networks, and a better representation is generally beneficial for generalization. In this paper, we present a novel method for controlling the internal representation of deep neural networks from a topological perspective. Leveraging the power of topology data analysis (TDA), we study the push-forward probability measure induced by the feature extractor, and we formulate a notion of “separation” to characterize a property of this measure in terms of persistent homology for the first time. Moreover, we perform a theoretical analysis of this property and prove that enforcing this property leads to better generalization. To impose this property, we propose a novel weight function to extract topological information, and we introduce a new regularizer including three items to guide the representation learning in a topology-aware manner. Experimental results in the point cloud optimization task show that our method is effective and powerful. Furthermore, results in the image classification task show that our method outperforms the previous methods by a significant margin.

1. Introduction

Although over-parameterized deep neural networks generalize well in practice when sufficient data are provided, in small-sample-size regimes, generalization is more difficult and requires careful consideration. Since the ability to learn task-specific representations is beneficial for generalization, a lot of effort has been dedicated to imposing structure on the latent space or representation space via additional regularizers [1,2], to guide the mapping from input space into internal space, or control properties of the internal representations [3]. However, internal representations are high-dimensional, discrete, sparse, incomplete and noisy; extraction of information from this kind of data is rather challenging.
In order to explore and control internal representations, there are various ways to choose: (1) For the algebraic methods based on vector space [4], coordinates are not natural, the power of linear transformation is limited, and low-dimensional visualizations cannot faithfully characterize the data. (2) For the statistical methods [5], a small sample size limits the power of the analysis, inference and computation; asymptotic statistics cannot be used; and the results may exhibit large variance. (3) For geometric methods based on distances [6] or manifold assumptions [7], it is difficult to capture the global picture, and metrics are not theoretically justified [8] (for neural networks, notions of distance are constructed by the feature extractor, which is hard to understand). Individual parameter choices may significantly influence the results, and constraints based on geometry information, such as pairwise distances, may be too strict to respect reality. (4) Methods based on calculus [9,10] can capture the local information in a small neighborhood for each point, but the performance is questionable while high-dimensional data are sparse and the sample size is small.
Besides the above traditional methods, there is another fundamentally different perspective, an unexplored, powerful tool to employ, i.e., the topological data analysis (TDA) method. The advantages of TDA methods [8,11,12,13] are as follows: (1) TDA studies the global “shape” of data and explores the underlying topological and geometric structures of point clouds. As a complement to localized and generally more rigid geometric features, topological features are suitable for capturing multi-scale, global and intrinsic properties of data. (2) Topology methods study geometric features in a way that is less sensitive to the choice of metrics, and this insensitivity is beneficial when the metric is not well understood or only determined in a coarse way [8], as in the neural networks. (3) Topology methods are coordinate-free, and they only focus on the intrinsic geometric properties of the geometric objects. (4) Instead of determining a proper spatial scale to understand and control the data, persistent homology collects information over the whole domains of parameter values and creates a summary in which the features that persist over a wide range of spatial scales are considered more likely to represent true features of the underlying space rather than artifacts of sampling, noise or particular choice of parameters.

1.1. Related Works

Previous work related to our work can be divided into two categories. The first category focuses on regularization using statistical information of internal representations. Cogswell et al. [1] proposed a regularizer to encourage diverse or non-redundant representations by minimizing the cross-covariance of internal representation. Choi et al. [2] designed two class-wise regularizers to enforce the desired characteristic for each class; one focused on reducing the covariance of the representations for samples from the same class, and the other used variance instead of covariance to improve compactness. The second category studies deep neural networks using tools from algebraic topology, in particular, persistent homology. Brüel-Gabrielsson et al. [14] presented a differentiable topology layer to extract topological features, which can be used to promote topological structure or incorporate a topological prior via regularization. Kim et al. [15] proposed a topological layer for generative deep models to feed critical topological information into subsequent layers and provided an adaptation for the distance-to-measure (DTM) function-based filtration. Hajij et al. [16] defined and studied the classification problem in machine learning in a topological setting and showed when the classification problem is possible or not possible in the context of neural networks. Li et al. [17] proposed an active learning algorithm to characterize the decision boundaries using their homology. Chen et al. [18] proposed measuring the complexity of the classification boundary via persistent homology, and the topological complexity was used to control the decision boundary via regularization. Vandaele et al. [19] introduced a novel set of topological losses to topologically regularize data embeddings in unsupervised feature learning, which can efficiently incorporate a topological prior. Hofer et al. [20] considered the problem of representation learning, treated each mini-batch as a point cloud, and controlled the connectivity of latent space via a novel topological loss. Moor et al. [3] extended this work and proposed a loss term to harmonize the topological features of the input space with the topological features of the latent space. This approach also acts on the level of mini-batches, computes persistence diagrams for both input space and latent space, and encourages these two persistence diagrams (PD) to be similar by the regularization item. Wu et al. [21] explored the rich spatial behavior of data in the latent space, proposed a topological filter to filter out noisy labels, and theoretically proved that the method is guaranteed to collect clean data with high probability. These works show empirically or theoretically that enforcing a certain topological structure on representation space can be beneficial for learning tasks.
Hofer et al. [22] proposed an approach to regularize the internal representation to control the topological properties of the internal space, and proved that this approach would enforce mass concentration effects which are beneficial for generalization. However, the authors based their work on the assumption that a loss function that yields a large margin in the representation space should be selected and, therefore, that the mass concentration effect is only beneficial if the reference set is located sufficiently far away from the decision boundary, but this large margin assumption may be violated in practice.

1.2. Contribution

In this paper, we apply the TDA method, in particular, persistent homology from algebraic topology, to analyze and control the global topology of the internal representations of the training points, which reveals the intrinsic structure of the representation space.
When TDA is combined with statistics, data are deemed to be generated from some unknown distribution instead of some underlying manifold. TDA methods are used to infer topological features of the underlying distribution, especially the support of the dis-tribution. Inspired by [22], by combining statistics with topological data analysis, our work focuses on the probability measure induced by the feature extractor and treats the representation of the training points in a mini-batch as point cloud data from which the topological information is extracted, and then we compute persistence diagram of the persistent homology. Specifically, we consider the topological properties of the samples from the product measure of two classes to enforce intra-class mass concentration and separation between classes simultaneously. We extend the definitions and techniques in [22] to formalize the separation between two classes via persistent homology. We argue that if this separation property is encouraged, then both mass concentration and separation will be enforced, and we proposed a novel weight function and constructed a novel loss to control the topological properties of the representation space.
In summary, our contributions are as follows:
(1) We characterize a separation property between two classes in representation space in terms of persistent homology (Section 3.2).
(2) We prove that a topological constraint on the samples of the push-forward probability measure in the presentation space leads to mass separation (Section 3.2).
(3) We propose a novel weight function based on DTM. Using our weight function, the weighted Rips filtration can be built on top of training samples from class pairs in a mini-batch. The stability of the persistence diagram with respect to the proposed weight function is presented (Section 3.4).
(4) We propose three regularization items, including a birth loss, a margin loss and a length loss, which operate on a persistence diagram obtained via persistent homology computations on mini-batches, to encourage mass separation (Section 3.4).
The remainder of this paper is structured as follows: In Section 2, we present some topological preliminaries relevant to our work. Section 3 gives our main results, including the separation property, the weight function and the regularization method. Section 4 shows the experimental results on synthetic data and benchmark datasets. Finally, Section 5 gives the conclusion.

2. Topological Preliminaries

Generally, in topological data analysis, the point clouds are thought to be finite samples taken from an underlying geometric object. To extract topological and geometric information, a natural way is to “connect” data points that are close to each other to build a global continuous shape on top of the data. This section contains a brief introduction to the relevant topological notions. More details can be found in several excellent introductions and surveys [8,11,12,13,23].

2.1. Simplicial Complex, Persistent Homology and Persistence Diagrams

Simplicial complex K is a discrete structure built over a finite set of samples to provide a topological approximation of the underlying topology or geometry. The Čech complex and the Vietoris-Rips complex are widely used in TDA. Below, for any x X and r > 0 , let B ( x ; r ) be the open ball of radius r > 0 centered at x .
Definition 1 
(Čech complex [11]). Let X X be finite and r > 0 . The Čech complex C ( X ; r ) is the simplicial complex
{ x 0 , , x k } C ( X ; r ) : i = 0 k B ¯ ( x i ; r )  
Definition 2 
(Vietoris-Rips complex [11]). Let X X be finite and r > 0. The Vietoris-Rips complex R ( X ; r ) is the simplicial complex
{ x 0 , , x k } R ( x ; r ) : B ¯ ( x i ; r ) B ¯ ( x j ; r )   for   any   i , j 0 , , k d ( x i , x j ) 2 r   for   any   i , j 0 , , k
For a simplicial complex K, the k-th homology group of K is used to characterize k-dimensional topological features of K, denoted by H k ( K ) . The k-th Betti number of K is the dimension β k ( K ) = dim H k ( K ) of the vector space H k ( K ) . The k-th Betti number counts the number of k-dimensional features of K. For example, dim H 0 ( K ) counts the number of connected components, dim H 1 ( K ) counts the number of holes, and so on.
For Definitions 1 and 2, it is difficult to choose a proper r without prior domain knowledge. The main insight of persistent homology is to compute topological features of a space at different spatial resolutions. In general, the assumption is that features that persist for a wide range of parameters are “true” features. Features persisting for only a narrow range of parameters are presumed to be noise.
A filtration of a simplicial complex K is a collection of subcomplexes approximating the data points at different spatial resolutions, formally defined as follows:
Definition 3 
(Filtration [11]). Let K be a simplicial complex, and T . A family of subcomplexes ( K t ) t T of K is said to be a filtration of K if it satisfies
(1) K s K t for s t ;
(2) t T K t = K
Given ε 0 , two filtrations ( V t ) t T and ( W t ) t T of E = d are ε -interleaved [24] if for every t T , V t W t + ε and W t V t + ε . The interleaving pseudo-distance between ( V t ) t T and ( W t ) t T is defined as the infimum of such ε :
d i ( ( V t ) t T , ( W t ) t T ) = inf { ε : ( V t )   and   ( W t )   are   ε - interleaved }  
Let X be a finite point set in X = d and t . The family { C ( X ; t ) } t forms Čech filtration for t 0 , and the family { R ( x ; t ) } t forms Rips filtration. Since the Čech complex is expensive to compute, Rips filtration is less expensive to compute than Čech filtration and is frequently used to investigate the topology of the point set X .
In the construction of Čech filtrations, the radii of balls increase uniformly. We can also make radii increase non-uniformly. Let f : d 0 be a continuous function, p [ 1 , ) . For p < , we define a function r f : X × 0 by
r f ( x , t ) = t < f ( x ) t p f ( x ) p 1 / p otherwise  
By modifying the definition of C ( X ; t ) , we define a simplicial complex C f ( X ; t ) by
{ x 0 , , x k } C f ( X ; t ) : i = 0 k B ¯ ( x i ; r f ( x i , t ) )  
For each fixed t, C f ( X ; t ) is a Čech complex. The family C f ( X ; t ) t forms a filtration, which is called the weighted Čech filtration. We can also construct the weighted Rips filtration in a similar way.
For a filtration and each non-negative k, we keep track of when k-dimensional homological features appear and disappear in the filtration. If a homological feature α i appears at b i and disappears at d i , then we say α i is born at b i and dies at d i . By considering these pairs ( b i , d i ) as points in the plane, we obtain the persistence diagram.
Definition 4 
(Bottleneck distance [15]). Given two persistence diagrams D and D , their bottleneck distance ( d b ) is defined by
d b ( D , D ) = inf γ Γ sup p D ¯ p γ ( p )  
where is the usual L -norm, D ¯ = D Diag and D ¯ = D Diag with Diag being the diagonal { ( x , x ) : x } 2 with infinite multiplicity, and the set Γ consists of all the bijections γ : D ¯ D ¯ .

2.2. DTM Function

Despite strong stability properties, distance-based methods in TDA, such as the Čech or Vietoris-Rips filtrations, are sensitive to outliers and noise. To address this issue, [24] introduced an alternative distance function, i.e., the DTM function. Details of DTM-based filtrations are studied in [25]. We only list the properties of the DTM that will be used here.
Let μ be a probability measure over d , and m [ 0 , 1 ) a parameter. For every x d , let δ μ , m be the function defined on d by δ μ , m ( x ) = inf { r 0 , μ ( B ¯ ( x , r ) ) > m } .
Definition 5 
(Distance-to-measure [24]). The distance-to-measure function (DTM) with parameter m [ 0 , 1 ) and power p is the function d μ , m , p : d defined by
d μ , m , p ( x ) = 1 m 0 m δ μ , m ( x ) p d m 1 / p  
and if not specified, p = 2 is used as a default and omitted.
From Definition 5, it can be seen that for every x , d μ ( x ) is not lower than the distance from x to the support of μ .
Proposition 1 
([24]). For every probability measure μ and m [ 0 , 1 ) , d μ , m is 1-Lipschitz.
Proposition 2 
([24]). Let μ and ν be two probability measures, and m ( 0 , 1 ) . Then
d μ , m d ν , m m ( 1 / 2 ) W 2 ( μ , ν )  
where W 2 ( μ , ν ) is the Wasserstein distance between μ and ν .
In practice, the measure μ is usually unknown, and we only have a finite set of samples X = { x 1 , , x n } ; a natural idea to estimate an approximation of the DTM from X is to plug the empirical measure μ n instead of μ in Definition 5, to obtain the “distance to the empirical measure (DTEM)”. For m = k / n , the DTEM satisfies
d μ n , k / n , p p ( x ) 1 k j = 1 k x x n ( j ) p  
where x x n ( j ) denotes the distance between x and its j-th neighbor in { x 1 , , x n } . This quantity can be easily computed in practice since it only requires the distances between x and the sample points.

3. Topological Regularization

Let X be the input space, Y the label space and Z the internal representation space before the classifier. Assuming there are C classes, we formulate the neural network as a compositional mapping: η φ : X Y = [ C ] = { 1 , , C } , where φ : X Z represents a feature extractor and η : Z Y represents a classifier that maps the internal representation to the predicted label. Assume the representation space Z is equipped with a metric d . Let P be the probability measure on X and Q be the push-forward probability measure induced by φ : X Z on the Borel σ -algebra defined by d on Z .
We focus on the internal representation space; in particular, we study the push-forward probability measure Q induced by the feature extractor φ on Z , identify a property of Q that is beneficial for generalization and propose a regularization method to implement the property.

3.1. Push-Forward Probability Measure and Generalization

Let c : supp ( P ) Y represent the deterministic mapping from the support of P to the label space, and S = { ( x 1 , y 1 ) , , ( x m , y m ) } be a training sample, where { x 1 , , x m } are m i.i.d. draws from X P , and y i = c ( x i ) .
For a neural network h : X Y and X P , we define the generalization error by E X ~ P [ 1 h , c ( X ) ] , where
1 h , c ( X ) = 0 , h ( x ) = c ( x ) 1 , else  
To study the property of Q , we consider the class-specific probability measure as in [22], define the restriction of Q (i.e., the push-forward of P via φ ) to class k by
Q k : [ 0 , 1 ] , σ Q ( σ C k ) Q ( C k )  
where C k = φ ( c 1 { k } ) is the representation of class k in Z .
If the probability mass of class k’s decision region, measured via Q k , tends towards one, it may lead to better generalization. Reference [22] formulated this notion by establishing a direct link between Q k and the generalization error.
Proposition 3 
([22]). For any class k [ K ] , let C k = φ ( c 1 { k } ) be its internal representation and D k = η 1 ( { k } ) be its decision region in Z w.r.t. η . If, for ε > 0 , k : 1 Q k ( D k ) ε , then E X ~ P [ 1 η φ , c ( X ) ] K ε .
Proposition 3 links generalization to a condition depending on Q . Intuitively, increasing the probability of φ mapping a sample of class k into the correct decision region can improve generalization.
Based on this observation, [22] introduced the definition of a β -connected set to characterize connectivity via 0-dimensional (Vietoris-Rips) persistent homology and proved that a corresponding property for probability measure Q k would be beneficial for generalization.
Definition 6 
( β -connected [22]). Let β > 0 . A set M Z is β -connected iff all 0-dimensional death-times of its Vietoris-Rips persistent homology are in the open interval ( 0 , β ) .
However, [22] assumed a large margin in representation space which may be violated in deep neural networks. In the following, we extend their work and identify a property for probability measure Q that can enhance the separation between classes.
Note that we can also write Proposition 3 in an alternate form, because the probability mass of all classes’ decision region measured via Q k sums to one, we have 1 Q k ( D k ) = i k Q k ( D i ) , which means that for each class k, the sum of the probability mass of other classes’ decision region, measured via Q k , tends towards zero. Intuitively, decreasing the probability of φ mapping a sample of class k into other incorrect decision regions can improve generalization.
Therefore, in order to decrease Q j ( D i ) , we take class pairs Q i and Q j into consideration and formulate a notion of separation in terms of persistent homology as follows.

3.2. Probability Mass Separation

In this section, we show that a certain topological constraint on the ( Q i , Q j ) pair will lead to probability mass separation. More precisely, given a reference set M = B ( x 0 ; r 0 ) Z , let M l β = B ( x 0 ; r 0 + l β ) , our topological constraint provides a non-trivial upper bound on Q j ( M l β ) in terms of Q i ( M ) .
In order to enforce the separation between two classes, we extend Definition 6 to characterize the separation between two sets:
Definition 7. 
Let β > 0 , γ > 2 β . Considering two sets, M 1 , M 2 Z and M = M 1 M 2 , we denote the death-times of M ’s 0-dimensional Vietoris-Rips persistent homology as { d i } i I 0 and order the indexing of points by decreasing lifetimes, i.e., d i d j for i < j . Then, we state M 1 and M 2 are ( β , γ ) -separated, if and only if the following two conditions are satisfied:
(1) M 1 and M 2 are both β -connected;
(2) d 2 is in the open interval ( γ , ) .
Then we use this notion to capture the concentration and separation of a (b, b) sample from a ( Q i , Q j ) pair. For b-sized i.i.d. samples from Q , we denote the product measure of Q by Q b , and for (b, b)-sized samples from ( Q i , Q j ) (b i.i.d. draws from each), we denote the product measure by Q i b Q j b .
Define the indicator function c ( b , b ) β , γ : Z ( b , b ) { 0 , 1 } as follows:
c ( b , b ) β , γ ( z 1 , 1 , z 1 , b , z 2 , 1 , z 2 , b ) = 1 { z 1 , 1 , z 1 , b } , { z 2 , 1 , z 2 , b } are ( β , γ ) -separated.
Now we consider the probability of the (b, b)-sized samples from Q i b Q j b being ( β , γ ) -separated.
Definition 8. 
Let β > 0 , γ > 2 β , c β [ 0 , 1 ] , and b . We call a ( Q i , Q j ) pair ( b , c β , γ ) -separated if:
Q i b Q j b ( { c ( b , b ) β , γ = 1 } ) c β , γ  
For two classes C1 and C2, consider the restriction of Q to C1 and C2, i.e., Q 1 and Q 2 . Assume ( Q 1 , Q 2 ) is ( b , c β , γ ) -separated, consider reference set M = B ( x 0 ; r 0 ) Z and let M l β = B ( x 0 ; r 0 + l β ) , together with the complement set N = M l β C , where r 0 + l β γ / 2 . Let p = Q 1 ( M ) , q = Q 1 ( M l β ) and s = Q 2 ( M l β ) . According to [22], when p is fixed, we can lower bound q . In the following, we will provide an approach to upper bound s, which hints at mass separation between different classes.
For a (b, b) sample ( z 1 , 1 , z 1 , b , z 2 , 1 , z 2 , b )   ~   Q 1 b Q 2 b , consider the distribution of z i , j among M l β and N. Let n 1 and n 2 be the numbers of z 1 , i ’s and z 2 , i ’s that fall within M l β , respectively; i.e., n 1 = { z 1 , i } M l β and n 2 = { z 2 , i } M l β . Apparently, if the membership assignment satisfies: n 1 1 and n 2 1 , then ( z 1 , 1 , z 1 , b , z 2 , 1 , z 2 , b ) cannot be ( β , γ ) -separated.
Thus, we define events that ( z 1 , 1 , z 1 , b , z 2 , 1 , z 2 , b ) cannot be ( β , γ ) -separated as follows:
E = { ( z 1 , 1 , z 1 , b , z 2 , 1 , z 2 , b )   ~   Q 1 b Q 2 b : n 1 1 , n 2 1 } and then we have c ( b , b ) β , γ ( E ) = { 0 } .
In the following lemma, we compute the probability of event E and derive some useful properties.
Proposition 4. 
Let b , q , s [ 0 , 1 ] , q = Q 1 ( M l β ) and s = Q 2 ( M l β ) . Denote the following:
Φ ( q , s ; b ) = n 1 = 1 b n 2 = 1 b b ! n 1 ! ( b n 1 ) ! b ! n 2 ! ( b n 2 ) ! q n 1 ( 1 q ) b n 1 s n 2 ( 1 s ) b n 2  
Then the probability of E can be expressed in terms of q and s as follows: Q 1 b Q 2 b ( E ) = Φ ( q , s ; b ) , and for Φ ( q , s ; b ) , it holds that
(1) Φ ( q 0 , ; b ) is monotonically increasing on [ 0 , 1 ] ,
(2) Φ ( , s 0 ; b ) is monotonically increasing on [ 0 , 1 ] .
Proof. 
Φ ( q , s ; b ) = n 1 = 1 b n 2 = 1 b b ! n 1 ! ( b n 1 ) ! b ! n 2 ! ( b n 2 ) ! q n 1 ( 1 q ) b n 1 s n 2 ( 1 s ) b n 2 = n 1 = 1 b b ! n 1 ! ( b n 1 ) ! q n 1 ( 1 q ) b n 1 n 2 = 1 b b ! n 2 ! ( b n 2 ) ! s n 2 ( 1 s ) b n 2
For argument (1), we fix q = q 0 and write
Φ ( q 0 , s ; b ) = n 1 = 1 b b ! n 1 ! ( b n 1 ) ! q 0 n 1 ( 1 q 0 ) b n 1 n 2 = 1 b b ! n 2 ! ( b n 2 ) ! s n 2 ( 1 s ) b n 2 A ( s )  
To study the monotonicity properties of Φ ( q 0 , ; b ) , it is sufficient to consider A ( s ) .
We define two auxiliary functions:
a n 2 ( s ) = b ! n 2 ! ( b n 2 ) ! s n 2 ( 1 s ) b n 2  
c n 2 ( s ) = b ! n 2 ! ( b n 2 ) ! s n 2 ( b n 2 ) ( 1 s ) b n 2 1  
Then we have A ( s ) = n 2 = 1 b a n 2 ( s ) , and that
c n 2 ( s ) = b ! n 2 ! ( b     n 2     1 ) ! s n 2 ( 1 s ) b n 2 1 , 0 n 2 < b 0 , n 2 = b  
a n 2 ( s ) s = b ! n 2 ! ( b     n 2 ) ! n 2 s n 2 1 ( 1 s ) b     n 2 b ! n 2 ! ( b     n 2 ) ! s n 2 ( b n 2 ) ( 1 s ) b n 2 1 = c n 2 1 ( s ) c n 2 ( s )
Hence,
A ( s ) s = n 2 = 1 b c n 2 1 ( s ) c n 2 ( s ) = c 0 ( s ) c b ( s ) = c 0 ( s ) 0  
Consequently, A ( s ) is monotonically increasing, and thus, so is Φ ( q 0 , ; b ) .
For argument (2), the proof is similar and omitted. □
Now we can derive the main theorem:
Theorem 1. 
Let b , q , s [ 0 , 1 ] , q = Q 1 ( M l β ) and s = Q 2 ( M l β ) . Then it holds that
1 c β , γ Φ ( q , s ; b )  
Proof. 
The left side includes all the events that violate the separation assumption, and the right side is only a special case among them. Therefore, by combining Definition 8 and Proposition 4, we complete the proof. □

3.3. Ramifications of Theorem 1

According to Theorem 1 and Proposition 4, if M l β covers a certain mass of Q 1 , we can upper bound the mass it covers of Q 2 , i.e., because Φ ( q , s ; b ) is bounded by 1 c β , γ and is monotonically increasing in q and s , if q = Q 1 ( M l β ) is greater than some q 0 , then s = Q 2 ( M l β ) should be less than some s 0 . This is beneficial for generalization if M l β is constructed from the representations of the correctly classified training instances to include some minimal mass of C1.
Assuming that the mass of the reference set p = Q 1 ( M ) is fixed, noting that our Definition 7 is stronger than the mass concentration condition in [22] and then letting Ψ ( p , q ; b , l ) = ( u , v , w ) I ( b , l ) b ! u ! v ! w ! p u ( 1 q ) v ( q p ) w , we have
q { q [ p , 1 ] : 1 c β , γ Ψ ( p , q ; b , l ) } = A 1  
Let R b , c β , γ ( p , l ) = min A 1 be the smallest mass in the l β extension, and then q R b , c β , γ ( p , l ) p .
By Theorem 1, we have
s s [ 0 , 1 ] : 1 c β , γ Φ ( q , s ; b ) Φ ( R b , c β , γ ( p , l ) , s ; b ) = A 2  
and thus A 2 is non-empty. Now let T b , c β , γ ( q ) = max A 2 identify the largest mass in the l β extension for which the inequality holds. As q increases, T b , c β , γ ( q ) decreases; therefore, s T b , c β , γ ( q ) T b , c β , γ ( R b , c β , γ ( p , l ) ) .
Let G b , c β , γ ( p , l ) T b , c β , γ ( R b , c β , γ ( p , l ) ) , then s G b , c β , γ ( p , l ) .
As Ψ is monotonically decreasing in q , R b , c β , γ ( p ) is monotonically increasing in c β , γ ; furthermore, as Φ is monotonically increasing in s, T b , c β , γ ( q ) is monotonically decreasing in c β , γ . These facts motivate our regularization goal of increasing c β , γ . In other words, increasing c β , γ would both boost mass concentration within a class and enforce mass separation between two classes.
Suppose M is constructed mainly by training samples from C1, i.e., we choose x 0 and r 0 to include many training samples from C1; then we can ask the following: how much mass of C1 should M contain at least to boost the separation?
We plot Φ ( q , q ; b ) in Figure 1, and we can see that q = s at point (0.049, 0.049), i.e., Φ ( q , q ; b ) = 1 c β , γ . At this point, the mass separation effect starts to occur. To satisfy Inequality (21), when q increases, s should decrease, which means that as M l β covers more mass of Q 1 , it covers less mass of Q 2 . In addition, as the batch size b increases, the least mass of Q 1 that M l β should cover decreases.
In Figure 2, we fix the batch size to 8 and visualize Φ ( q , s ; b ) as a function of s (for different values of q ), and we can see that when q increases, T b , c β , γ ( q ) moves towards zero, which indicates a smaller s , i.e., M l β covers less mass of Q 2 , and therefore, it leads to a better separation.
In Figure 3, we plot T b , c β , γ ( q ) as a function of q for different values of c β , γ , where q = Q 1 ( M l β ) . As c β , γ is increased, the maximal mass of Q 2 contained in M l β , characterized by T b , c β , γ ( q ) , shifts towards a smaller value, which indicates that a better separation between classes C1 and C2 is achieved.
Figure 4 visualizes G b , c β , γ ( p , l ) as a function of p for different values of c β , γ , where p = Q 1 ( M ) . It can be seen that as c β , γ is increased, the maximal mass of Q 2 contained in M l β , characterized by G b , c β , γ ( p , l ) , also shifts towards a smaller value, which indicates a better separation.
Figure 5 plots G b , c β , γ ( p , l ) as a function of p for different values of batch size b, where p = Q 1 ( M ) . As b is increased, the maximal mass of Q 2 contained in M l β , characterized by G b , c β , γ ( p , l ) , also shifts towards a smaller value, which indicates a better separation, and in order to achieve separation, M only needs to cover a small mass of Q 1 .

3.4. Weighted Rips Filtration and Regularization

In Section 3.2, we show that a topological constraint on a ( Q i , Q j ) pair would lead to probability mass concentration and separation. To impose this constraint, we propose a function that is used to construct the filtration, and then we compute the 0-dimensional persistent diagram and construct the loss item to regularize the internal representation.
Our method acts on the level of mini-batches; we construct each mini-batch B as a collection of n sub-batches, i.e., B = (B1,…, Bn), as in [22]. Each sub-batch consists of b samples from the same class, and thus the resulting mini-batch B contains n*b samples. Our regularizer consists of three items and penalizes deviations from a ( β , γ ) -separated arrangement of ( z 1 , 1 , z 1 , b , z 2 , 1 , z 2 , b ) for all sub-batch pairs (Bi, Bj).

3.4.1. A Weight Function for Weighted Rips Filtration

To construct a proper filtration to deal with samples from two different classes, we define a function f :
f μ , ν , m , T ( x ) = ln ( 1 + exp ( ( d μ , m ( x ) d ν , m ( x ) ) / T ) )  
where T is the temperature that controls the magnitude and d μ , m ( x ) is the DTM function defined in Equation (7).
Considering the mass separation for two classes, we denote the data instances of class k by S k , and then for a class pair ( C i , C j ) , the training samples can be written as follows: S i , j = S i S j . Let Q i and Q j be the restriction of Q (i.e., the push-forward of P via φ ) to classes i and j, respectively. In order to construct filtration with Equation (24), firstly, we need to compute f Q i , Q j , m , T ( x ) for x S i , j . Note that d Q i , m and d Q j , m can be computed with Equation (9), the DTEM, where Q i is approximated by φ ( S i ) and Q j is approximated by φ ( S j ) . According to Equation (24), for a good classifier, points from class i should have smaller function values than points from class j. Then we plug f Q i , Q j , m , T ( x ) into Equation (4) to compute the weighted Rips filtration (we set p = 1 for Equation (4) in this research), and obtain the 0-dimensional persistence diagram, i.e., the multi-set of intervals for homology in dimension 0, P D 0 ( X i , j ) = { ( b k , d k ) } k I 0 . After that, we order the indexing of points by decreasing lifetimes as done in Definition 7; we will use them later to construct the loss item in Section 3.4.3.

3.4.2. Stability

In this section, we establish the stability results for our weight function in Equation (24). In Theorem 2, the stability of the weight function is given, which will later be used in Theorem 3 to ensure the stability of the filtration V [ X , f ] with respect to the weight function f . Proposition 5 is used to ensure the stability of the filtration V [ X , f ] with respect to X . According to persistent homology theory, the stability results for the filtration translate as stability results for the persistence diagrams. We present our main stability results in Theorem 3.
Theorem 2. 
Let μ 1 , μ 2 , ν 1 and ν 2 be four probability measures, and m ( 0 , 1 ) . Then
f μ 1 , μ 2 , m , T f ν 1 , ν 2 , m , T ( 1 / T ) m ( 1 / 2 ) ( W 2 ( μ 1 , ν 1 ) + W 2 ( μ 2 , ν 2 ) )  
Proof. 
Let h μ 1 , μ 2 , m , T ( x ) = ( d μ 1 , m ( x ) d μ 2 , m ( x ) ) / T and g ( x ) = ln ( 1 + exp ( x ) ) . Then f = g h .
Because g ( x ) is 1-Lipschitz, i.e., for all x and y , g ( x ) g ( y ) x y , we have
f μ 1 , μ 2 , m , T f ν 1 , ν 2 , m , T = g h μ 1 , μ 2 , m , T g h ν 1 , ν 2 , m , T h μ 1 , μ 2 , m , T h ν 1 , ν 2 , m , T = ( 1 / T ) ( d μ 1 , m d μ 2 , m ) ( d ν 1 , m d ν 2 , m ) ( 1 / T ) ( d μ 1 , m d ν 1 , m + d μ 2 , m d ν 2 , m ) ( 1 / T ) m ( 1 / 2 ) ( W 2 ( μ 1 , ν 1 ) + W 2 ( μ 2 , ν 2 ) )
The last inequality is obtained according to Proposition 2. □
In Proposition 5, we consider the stability of the filtration with respect to X . For brevity, the subscripts of f are omitted.
Proposition 5. 
Suppose that X and Y are compact and that the Hausdorff distance d H ( X , Y ) ε . Then the filtrations V [ X , f ] and V [ Y , f ] are k-interleaved with k = ε ( 1 + 2 / T ) .
Proof. 
It suffices to show that for every t 0 , V t [ X , f ] V t [ Y , f ] .
For z V t [ X , f ] , there exists x X such that z B ¯ f ( x , t ) , i.e., x z r x ( t ) . From the hypothesis d H ( X , Y ) ε , there exists y Y such that x y ε . Then we need to prove that z B ¯ f ( y , t + k ) , i.e., z y r y ( t + k ) .
According to triangle inequality, z y z x + x y r x ( t ) + ε . Then it suffices to show that r x ( t ) + ε r y ( t + k ) .
Using Equation (4), we have
r y ( t + k ) r x ( t ) = ( ( t + k ) f ( y ) ) ( t f ( x ) ) = k + ( f ( x ) f ( y ) ) = k + ( g ( h ( x ) ) g ( h ( y ) ) ) k h ( x ) h ( y ) = k ( 1 / T ) ( d μ 1 , m ( x ) d μ 2 , m ( x ) ) ( d μ 1 , m ( y ) d μ 2 , m ( y ) )
According to Proposition 1, the DTM function is 1-Lipschitz, and then
( d μ 1 , m ( x ) d μ 2 , m ( x ) ) ( d μ 1 , m ( y ) d μ 2 , m ( y ) ) d μ 1 , m ( x ) d μ 1 , m ( y ) + d μ 2 , m ( x ) d μ 2 , m ( y ) 2 x y 2 ε
Therefore,
r y ( t + k ) r x ( t ) k ( 2 / T ) ε = ε ( 1 + 2 / T ) ( 2 / T ) ε = ε .  
In the following theorem, we combine the above results to establish the stability of the persistence diagram with respect to X and f .
Theorem 3. 
Consider four measures μ 1 , μ 2 , ν 1 and ν 2 on d with compact supports X 1 , X 2 , Y 1 and Y 2 , respectively. Let X = X 1 X 2 , Y = Y 1 Y 2 , V 1 = V [ X , f μ 1 , μ 2 , m , T ] and V 2 = V [ Y , f ν 1 , ν 2 , m , T ] , d b denotes the bottleneck distance between persistence diagrams. Then
d b ( D ( V 1 ) , D ( V 2 ) ) ( 1 / T ) m ( 1 / 2 ) ( W 2 ( μ 1 , ν 1 ) + W 2 ( μ 2 , ν 2 ) ) + ( 1 + 2 / T ) d H ( X , Y )  
Proof. 
Under some regularity conditions, d b ( D ( V 1 ) , D ( V 2 ) ) = d i ( V 1 , V 2 ) , where d i denotes the interleaving pseudo-distance between two filtrations as defined in Equation (3).
We use the triangle inequality for the interleaving distance:
d i ( V 1 , V 2 ) d i ( V [ X , f μ 1 , μ 2 , m , T ] , V [ Y , f μ 1 , μ 2 , m , T ] ) ( 1 ) + d i ( V [ Y , f μ 1 , μ 2 , m , T ] , V [ Y , f ν 1 , ν 2 , m , T ] ) ( 2 )  
For the first part (1) on the right side of Equation (30), it can be seen that from Proposition 5, we have
d i ( V [ X , f μ 1 , μ 2 , m , T ] , V [ Y , f μ 1 , μ 2 , m , T ] ) ( 1 + 2 / T ) d H ( X , Y )  
For the second part (2) on the right side of Equation (30), according to Proposition 3.2 in [25], we have
d i ( V [ Y , f μ 1 , μ 2 , m , T ] , V [ Y , f ν 1 , ν 2 , m , T ] ) f μ 1 , μ 2 , m , T f ν 1 , ν 2 , m , T  
Using Theorem 2, we have
f μ 1 , μ 2 , m , T f ν 1 , ν 2 , m , T ( 1 / T ) m ( 1 / 2 ) ( W 2 ( μ 1 , ν 1 ) + W 2 ( μ 2 , ν 2 ) )  
By combining part (1) and part (2), we complete the proof. □

3.4.3. Regularization via Persistent Homology

We split the persistence intervals obtained in Section 3.4.1 into two subsets:
P D 0 ( X i , j ) = { ( b k , d k ) } k I 0 = { ( b i , k , d i , k ) } k I 0 , i { ( b j , k , d j , k ) } k I 0 , j  
where { ( b i , k , d i , k ) } k I 0 , i consists of the intervals in which the birth time belongs to class i and { ( b j , k , d j , k ) } k I 0 , j consists of the intervals in which the birth time belongs to class j.
Now we define three loss items:
  • Birth loss
Birth loss is designed to measure intra-class distance, in order to meet the first requirement of Definition 6, to enforce intra-class mass concentration:
L birth ( B i , B j ) = k I 0 , i b i , k b 0 + k I 0 , j b j , k b 1  
where b 0 and b 1 are super parameters used to control the birth time for each class.
  • Margin loss
Margin loss is designed to measure the “distance” between two classes. There may be connected components that appear due to points from class i but disappear due to points from class j, these cases should be penalized. In addition, for class j, the longest interval in { ( b j , k , d j , k ) } k I 0 , j would finally merge into class i’s intervals.
Let b j , min = min { b j , k : k I 0 , j } ; we define
L margin ( B i , B j ) = k I 0 , i max γ + ( d i , k b j , min ) , 0  
which means that we penalize the margins b j , min d i , k smaller than γ , where γ is also a super parameter used to control inter-class separation.
  • Length loss
Weighted Rips filtration is not as direct as the Rips filtration in controlling distances. Therefore, the length loss can be used in combination with the birth loss to penalize large intra-class distances. In addition, we hope that the two classes correspond to two connected components, which will persist for a wide range of parameters until these two components finally merge when the parameter reaches a sufficiently large value. We also want to prevent Q k from becoming overly dense. To formulate this intuition, we define
L length ( B i , B j ) = k I 0 , i d i , k b i , k β + k I 0 , j d j , k b j , k β  
where β is a super parameter.
Finally, our regularization item can be written as follows:
L ( B i , B j ) = λ 1 L birth ( B i , B j ) + λ 2 L margin ( B i , B j ) + λ 3 L length ( B i , B j )  
where the weightings λ 1 , λ 2 and λ 3 can be set such that the range of the loss is comparable, in range, to the cross-entropy loss, or can be selected via cross-validation.

4. Experiments

In this section, we test our idea with some experiments. We first consider point cloud optimization to obtain some intuition on the behavior of (35), and then we evaluate our approach on the image classification task.

4.1. Point Cloud Optimization

To validate our approach, as an illustrative example, we perform point cloud optimization with only the proposed loss in Equation (35), without other loss items. Point clouds are generated from Gaussian mixture distribution, and we assume that these points are from two different classes: C1 and C2. In the following figures, purple points represent samples from C1, and red points represent samples from C2.

4.1.1. Gaussian Mixture with Two Components

To test the separation effect, we set the centers of the two components to (−0.6, 0) and (0.6, 0), and the covariance matrix is set to the identity matrix. For parameters in the weight function of Equation (24), we set m = 0.1 and 1 / T = 0.05 . For super parameters in the three loss items, we set b 0 = 0 , b 1 = 1.0 , β = 0 and γ = 0.6 . To encourage clustering and speed up optimization, we adopt a dynamic m update scheme, i.e., gradually increasing m during training. The weightings of the three loss functions are set to 9, 1 and 0.3, respectively, and are chosen by gradient information in the first epoch.
Figure 6a shows the initial position of the points; the purple points are sampled from C1, and the red points are sampled from C2. Figure 6b shows the final position after 5000 epochs; the mass concentration and separation effects are obvious, and the points from the two classes are well separated.
Since the weight function f ( x ) may lead to imbalanced point configurations for the two classes, we can address this issue by changing the order of the two sets of points alternatively when feeding data to the computation of persistent homology during training. For vision datasets, because our regularization is used together with cross-entropy loss and a stochastic mini-batch sampling scheme, the imbalance will be compensated automatically.
Figure 7 compares the persistence diagrams before and after training; each green point represents a (bi, di) pair, the green point with y = inf represents the final merged single component left when the filtration value is sufficiently large, and the green point with y > 6 tells us that at this value, the two components merge, i.e., one component disappears and merges into the other component which is generated at an earlier time. In Figure 7, we can see that after 5000 epochs, the two subsets mentioned in Section 3.4.3 that correspond to the two classes are well separated, i.e., two connected components can be identified in the persistence diagram. We can also see that the points from the same class are concentrated.

4.1.2. Gaussian Mixture with Four Components

As a more challenging example, we consider a Gaussian mixture with four components. We suppose that they represent samples from two different classes, i.e., each class corresponds to two components. For each class, we hope the corresponding two components can merge. To achieve this goal, samples from one class have to travel across samples from the other class, which may cause the loss to increase. Therefore, to obtain optimal results, the optimizer needs to climb the mountain in the loss landscape before it arrives at a valley.
Figure 8 visualizes the points before and after training. Figure 8a shows the initial position of the points. The purple points are sampled from C1, and the red points are sampled from C2. We can see that after 1800 epochs, for each class, the two components merge into a single connected component, as shown in Figure 8b.
Figure 9 visualizes the function values calculated by our weight function (24); these values are used to construct the weighted Rips filtration using Equation (4) to extract topological information, and finally, the topological information is used by the regularizer to guide the optimization. Figure 9a visualizes the function values and the contour lines at epoch 0; it can be seen that larger values are assigned for points from C2. Figure 9b visualizes the function values and the contour lines after 1800 epochs.
Similar to Figure 7, Figure 10 compares the persistence diagrams before and after training. Figure 10b shows that after 1800 epochs, two connected components can be identified, and the mass concentration and separation effect is obvious.

4.1.3. Gaussian Mixture with Nine Components

In Figure 11 and Figure 12, we present the results for a Gaussian mixture with nine components. Figure 11a shows the initial position of the points sampled from two classes, while Figure 11b shows the results after 12,500 epochs. Figure 12 compares the persistence diagrams before and after training; it can be seen that our method achieves an effective performance in separating samples from two different classes.

4.2. Datasets

In this part, we use the same models and settings as [22], and we evaluate our method on three vision benchmark datasets: MNIST [26], SVHN [27] and CIFAR10 [28]. For MNIST and SVHN, 250 instances are used for training the model, for CIFAR10, 500 and 1000 instances are used.
CNN-13 [29] architecture is employed for CIFAR10 and SVHN. For MNIST, a simpler CNN architecture is employed. We use a stochastic gradient descent (SGD) optimizer with a momentum of 0.9, and the cosine annealing learning rate scheduler [30] is employed.
With the cross-entropy loss, the weighting of our regularization term is set such that the loss in Equation (35) is comparable to the cross-entropy loss. In our experiments, each batch contains n = 8 sub-batches, and the sub-batch size is set to b = 16; thus, the total batch size is 128.
During training, for each epoch, we select the 10 most significant channels dynamically for each class to perform topological computation; the criterion for channel selection is similar to that in [31]. To compensate for the imbalance between two classes induced by the weight function, we use the ratio of the derivative to weight the two items in the birth loss (Equation (32)). In order to meet the stability requirements of topological computation, we use 0.001 as the minimal differentiable distance between points. The weighting of our regularization term is set to 0.001. For parameters in the weight function (24), we set m = 0.2 and 1 / T = 0.15 . For super parameters in the three loss items, we set b 0 = 0.1 , b 1 = 2.5 , β = 0.3 , and γ = 1.8 ; weight decay on φ is fixed to 1 × 10−3, and weight decay on η is fixed to 0.001, except for CIFAR10-1k, for which we set it to 5 × 10−4. On MNIST, the initial learning rate is fixed to 0.1. On SVHN and CIFAR10, it is fixed to 0.5.
Table 1 compares our method to Vanilla (including batch normalization, dropout and weight decay) and the regularizers proposed in relevant works, in particular, the regularizers based on statistics of representations [1,2] and the topological regularizer as proposed in [22]. In addition, we also provide results given by the Jacobian regularizer [9]. We report the average test error (%) and the standard deviation over 10 cross-validation runs. The number attached to the dataset names indicates the number of training instances used. It can be seen that our method achieves the lowest average error for MNIST-250, CIFAR10-500, and CIFAR10-1k. For SVHN, the mean error is a little higher than the result presented in [22], but our method achieves a lower variance. Especially, our method outperforms all the regularization methods based on statistical constraints by a significant margin, which demonstrates the advantage of the proposed topology-aware regularizer, and this also supports our claim that mass separation is beneficial.

5. Conclusions

Traditionally, statistical methods are employed to impose constraints on the internal representation space for deep neural networks, while topological methods are generally underexploited. In this paper, we took a fundamentally different perspective to control internal representation with tools from TDA. By utilizing persistent homology, we constrained the push-forward probability measure and enhanced mass separation in the internal representation space. Specifically, we formulated a property of this measure that is beneficial for generalization for the first time, and we proved that a topological constraint in the representation space leads to mass separation. Moreover, we proposed a novel weight function for weighted Rips filtration, proved its stability and introduced a regularizer that operates on the persistence diagram obtained via persistent homology to control the distribution of the internal representations.
We evaluated our approach in the point cloud optimization task and the image classification task. For the point cloud optimization task, experiments showed that our method can separate points from different classes effectively. For the image classification task, experiments showed that our method significantly outperformed the previous relevant regularization methods, especially those methods based on statistical constraints.
In summary, both theoretical analysis and experimental results showed that our method can provide an effective learning signal utilizing topological information to guide internal representation learning. Our work demonstrated that persistent homology may serve as a novel and powerful tool for promoting topological structure in the internal representation space. Areas for future research are the exploration of the potential of 1-dimensional persistent homology and the development of other topology-aware methods for deep neural networks.

Author Contributions

M.C. and D.W. worked on conceptualization, methodology, software and writing—original draft preparation; S.F. and Y.Z. worked on validation and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No.62172086, Grant No.62272092).

Data Availability Statement

The data used in this study are available from the references.

Acknowledgments

The authors would like to express our sincere gratitude to the editor and reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cogswell, M.; Ahmed, F.; Girshick, R.; Zitnick, L.; Batra, D. Reducing Overfitting in Deep Networks by Decorrelating Representations. In Proceedings of the International Conference on Learning Representations, San Juan, PR, USA, 2–4 May 2016. [Google Scholar]
  2. Choi, D.; Rhee, W. Utilizing Class Information for Deep Network Representation Shaping. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019. [Google Scholar]
  3. Moor, M.; Horn, M.; Rieck, B.; Borgwardt, K. Topological Autoencoders. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020. [Google Scholar]
  4. Raghu, M.; Gilmer, J.; Yosinski, J.; Sohl-Dickstein, J. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  5. Littwin, E.; Wolf, L. Regularizing by the variance of the activations’ sample-variances. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
  6. Pang, T.Y.; Xu, K.; Dong, Y.P.; Du, C.; Chen, N.; Zhu, J. Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April 2020. [Google Scholar]
  7. Zhu, W.; Qiu, Q.; Huang, J.; Calderbank, R.; Sapiro, G.; Daubechies, I. LDMNet: Low Dimensional Manifold Regularized Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  8. Carlsson, G. Topology and Data. Bull. Am. Math. Soc. 2009, 46, 255–308. [Google Scholar] [CrossRef] [Green Version]
  9. Hoffman, J.; Roberts, D.; Yaida, S. Robust Learning with Jacobian Regularization. arXiv 2019, arXiv:1908.02729. [Google Scholar]
  10. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  11. Chazal, F.; Michel, B. An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. Front. Artif. Intell. 2017, 4, 667963. [Google Scholar] [CrossRef] [PubMed]
  12. Herbert, E.; John, H. Computational Topology: An Introduction; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar]
  13. Hensel, F.; Moor, M.; Rieck, B. A Survey of Topological Machine Learning Methods. Front. Artif. Intell. 2021, 4, 681108. [Google Scholar] [CrossRef]
  14. Brüel-Gabrielsson, R.; Nelson, B.J.; Dwaraknath, A.; Skraba, P.; Guibas, L.J.; Carlsson, G. A topology layer for machine learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020. [Google Scholar]
  15. Kim, K.; Kim, J.; Zaheer, M.; Kim, J.S.; Chazal, F.; Wasserman, L. PLLay: Efficient Topological Layer Based on Persistence Landscapes. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
  16. Hajij, M.; Istvan, K. Topological Deep Learning: Classification Neural Networks. arXiv 2021, arXiv:2102.08354. [Google Scholar]
  17. Li, W.; Dasarathy, G.; Ramamurthy, K.N.; Berisha, V. Finding the Homology of Decision Boundaries with Active Learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
  18. Chen, C.; Ni, X.; Bai, Q.; Wang, Y. A Topological Regularizer for Classifiers via Persistent Homology. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019. [Google Scholar]
  19. Vandaele, R.; Kang, B.; Lijffijt, J.; De Bie, T.; Saeys, Y. Topologically Regularized Data Embeddings. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
  20. Hofer, C.; Kwitt, R.; Dixit, M.; Niethammer, M. Connectivity-Optimized Representation Learning via Persistent Homology. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
  21. Wu, P.; Zheng, S.; Goswami, M.; Metaxas, D.; Chen, C. A Topological Filter for Learning with Label Noise. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
  22. Hofer, C.D.; Graf, F.; Niethammer, M.; Kwitt, R. Topologically Densified Distributions. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020. [Google Scholar]
  23. Sizemore, A.E.; Phillips-Cremins, J.; Ghrist, R.; Bassett, D.S. The Importance of the Whole: Topological Data Analysis for the Network Neuroscientist. Netw. Neurosci. 2019, 3, 656–673. [Google Scholar] [CrossRef] [PubMed]
  24. Chazal, F.; Cohen-Steiner, D.; Merigot, Q. Geometric Inference for Probability Measures. Found. Comput. Math. 2011, 11, 733–751. [Google Scholar] [CrossRef]
  25. Anai, H.; Chazal, F.; Glisse, M.; Ike, Y.; Inakoshi, H.; Tinarrage, R.; Umeda, Y. DTM-Based Filtrations. In Proceedings of the 35th International Symposium on Computational Geometry, Portland, OR, USA, 18–21 June 2019. [Google Scholar]
  26. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  27. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proceedings of the Conference on Neural Information Processing Systems, Granada, Spain, 12–17 December 2011. [Google Scholar]
  28. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  29. Laine, S.; Aila, T. Temporal ensembling for semisupervised learning. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  30. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  31. Huang, Z.; Wang, H.; Xing, E.P.; Huang, D. Self-Challenging Improves Cross-Domain Generalization. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Figure 1. Illustration of when Φ ( q , q ; b ) = 1 c β , γ holds, i.e., when the mass separation effect starts to occur. When q increases, s should decrease, which means that as M l β covers more mass of Q 1 , it covers less mass of Q 2 . As the batch size b increases, the least mass of Q 1 that M l β should cover decreases.
Figure 1. Illustration of when Φ ( q , q ; b ) = 1 c β , γ holds, i.e., when the mass separation effect starts to occur. When q increases, s should decrease, which means that as M l β covers more mass of Q 1 , it covers less mass of Q 2 . As the batch size b increases, the least mass of Q 1 that M l β should cover decreases.
Mathematics 11 01008 g001
Figure 2. Illustration of Φ ( q , s ; b ) for b = 8 and different values of q . Points at which 1 c β , γ = Φ ( q , s ; b ) holds are marked by dots. When q increases, T b , c β , γ ( q ) moves towards zero, which indicates a smaller s , i.e., M l β covers less mass of Q 2 .
Figure 2. Illustration of Φ ( q , s ; b ) for b = 8 and different values of q . Points at which 1 c β , γ = Φ ( q , s ; b ) holds are marked by dots. When q increases, T b , c β , γ ( q ) moves towards zero, which indicates a smaller s , i.e., M l β covers less mass of Q 2 .
Mathematics 11 01008 g002
Figure 3. Illustration of T b , c β , γ ( q ) , i.e., the upper bound on s = Q 2 ( M l β ) , plotted as a function of the mass q = Q 1 ( M l β ) (for b = 8 and different values of c β , γ ). For a fixed q , as c β , γ is increased, the maximal mass of Q 2 contained in M l β decreases, i.e., better separation is achieved.
Figure 3. Illustration of T b , c β , γ ( q ) , i.e., the upper bound on s = Q 2 ( M l β ) , plotted as a function of the mass q = Q 1 ( M l β ) (for b = 8 and different values of c β , γ ). For a fixed q , as c β , γ is increased, the maximal mass of Q 2 contained in M l β decreases, i.e., better separation is achieved.
Mathematics 11 01008 g003
Figure 4. Illustration of G b , c β , γ ( p , l ) , i.e., the upper bound on s = Q 2 ( M l β ) , plotted as a function of the mass p = Q 1 ( M ) (for b = 8, and different values of c β , γ ). For a fixed p , as c β , γ is increased, the maximal mass of Q 2 contained in M l β decreases.
Figure 4. Illustration of G b , c β , γ ( p , l ) , i.e., the upper bound on s = Q 2 ( M l β ) , plotted as a function of the mass p = Q 1 ( M ) (for b = 8, and different values of c β , γ ). For a fixed p , as c β , γ is increased, the maximal mass of Q 2 contained in M l β decreases.
Mathematics 11 01008 g004
Figure 5. Illustration of G b , c β , γ ( p , l ) , i.e., the upper bound on s = Q 2 ( M l β ) , plotted as a function of the mass p = Q 1 ( M ) (for c β , γ = 0.95 , and different values of b). For a fixed p , as the batch size is increased, the maximal mass of Q 2 contained in M l β decreases.
Figure 5. Illustration of G b , c β , γ ( p , l ) , i.e., the upper bound on s = Q 2 ( M l β ) , plotted as a function of the mass p = Q 1 ( M ) (for c β , γ = 0.95 , and different values of b). For a fixed p , as the batch size is increased, the maximal mass of Q 2 contained in M l β decreases.
Mathematics 11 01008 g005
Figure 6. (a) The original points are sampled from a Gaussian mixture with two components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 5000 epochs; the points from the two classes are well separated.
Figure 6. (a) The original points are sampled from a Gaussian mixture with two components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 5000 epochs; the points from the two classes are well separated.
Mathematics 11 01008 g006
Figure 7. (a) Persistence diagram obtained via persistent homology at epoch 0; (b) Persistence diagram after 5000 epochs. Two connected components can be identified, the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).
Figure 7. (a) Persistence diagram obtained via persistent homology at epoch 0; (b) Persistence diagram after 5000 epochs. Two connected components can be identified, the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).
Mathematics 11 01008 g007
Figure 8. (a) The original points are sampled from a Gaussian mixture with four components, and each class corresponds to two components; the purple points are sampled from class 1, and the red points are sampled from class 2; (b) Optimized configuration after 1800 epochs; the points from the two classes are well separated.
Figure 8. (a) The original points are sampled from a Gaussian mixture with four components, and each class corresponds to two components; the purple points are sampled from class 1, and the red points are sampled from class 2; (b) Optimized configuration after 1800 epochs; the points from the two classes are well separated.
Mathematics 11 01008 g008
Figure 9. (a) Weight function values and contour lines evaluated on the mesh at epoch 0; the points from class 2 correspond to larger values. (b) Weight function values and contour lines evaluated on the mesh after 1800 epochs.
Figure 9. (a) Weight function values and contour lines evaluated on the mesh at epoch 0; the points from class 2 correspond to larger values. (b) Weight function values and contour lines evaluated on the mesh after 1800 epochs.
Mathematics 11 01008 g009
Figure 10. (a) Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 1800 epochs. Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).
Figure 10. (a) Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 1800 epochs. Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).
Mathematics 11 01008 g010
Figure 11. (a) The original points are sampled from a Gaussian mixture with nine components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 12,500 epochs; the points from the two classes are well separated.
Figure 11. (a) The original points are sampled from a Gaussian mixture with nine components; the purple points are sampled from class 1, and the red points are sampled from class 2. (b) Optimized configuration after 12,500 epochs; the points from the two classes are well separated.
Mathematics 11 01008 g011
Figure 12. (a) Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 12,500 epochs. Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).
Figure 12. (a) Persistence diagram obtained via persistent homology at epoch 0. (b) Persistence diagram after 12,500 epochs. Two connected components can be identified; the points with x ≈ 0.5 correspond to the first connected component (class 1), and the points with x ≈ 1.0 correspond to the second connected component (class 2).
Mathematics 11 01008 g012
Table 1. Comparison to previous regularizers. “Vanilla” includes batch normalization, dropout and weight decay. The average test error and the standard deviation are reported.
Table 1. Comparison to previous regularizers. “Vanilla” includes batch normalization, dropout and weight decay. The average test error and the standard deviation are reported.
RegularizationMNIST-250SVHN-250CIFAR10-500CIFAR10-1k
Vanilla7.1 ± 1.030.1 ± 2.939.4 ± 1.529.5 ± 0.8
+Jac.-Reg [9]6.2 ± 0.833.1 ± 2.839.7 ± 2.029.8 ± 1.2
+DeCov [1]6.5 ± 1.128.9 ± 2.238.2 ± 1.529.0 ± 0.6
+VR [2]6.1 ± 0.528.2 ± 2.438.6 ± 1.429.3 ± 0.7
+cw-CR [2]7.0 ± 0.628.8 ± 2.939.0 ± 1.929.1 ± 0.7
+cw-VR [2]6.2 ± 0.828.4 ± 2.538.5 ± 1.629.0 ± 0.7
+Sub-batches7.1 ± 0.527.5 ± 2.638.3 ± 3.028.9 ± 0.4
+Sub-batches + Top.-Reg [22]5.6 ± 0.722.5 ± 2.036.5 ± 1.228.5 ± 0.6
+Sub-batches + Top.-Reg [22]5.9 ± 0.323.3 ± 1.136.8 ± 0.328.8 ± 0.3
+Sub-batches + Top.-Reg(Ours)4.3 ± 0.322.9 ± 1.335.2 ± 0.627.4 ± 0.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, M.; Wang, D.; Feng, S.; Zhang, Y. Topological Regularization for Representation Learning via Persistent Homology. Mathematics 2023, 11, 1008. https://doi.org/10.3390/math11041008

AMA Style

Chen M, Wang D, Feng S, Zhang Y. Topological Regularization for Representation Learning via Persistent Homology. Mathematics. 2023; 11(4):1008. https://doi.org/10.3390/math11041008

Chicago/Turabian Style

Chen, Muyi, Daling Wang, Shi Feng, and Yifei Zhang. 2023. "Topological Regularization for Representation Learning via Persistent Homology" Mathematics 11, no. 4: 1008. https://doi.org/10.3390/math11041008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop