Next Article in Journal
Exploratory Matching Model Search Algorithm (EMMSA) for Causal Analysis: Application to the Cardboard Industry
Next Article in Special Issue
MGATs: Motif-Based Graph Attention Networks
Previous Article in Journal
Variable Gap MAG Welding Penetration Control Using Rough Set Knowledge Acquisition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymmetric Graph Contrastive Learning

1
School of New Media and Communication, Tianjin University, Tianjin 300350, China
2
Qijia Youdao Network Technology (Beijing) Co., Ltd., Beijing 100012, China
3
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
4
Department of Computer Science and Technology, Tianjin Renai College, Tianjin 301636, China
5
Data61-CSIRO, Black Mountain Laboratories, Canberra, ACT 2601, Australia
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(21), 4505; https://doi.org/10.3390/math11214505
Submission received: 30 September 2023 / Revised: 29 October 2023 / Accepted: 30 October 2023 / Published: 31 October 2023
(This article belongs to the Special Issue Graph Machine Learning for Analyzing Complex Networks)

Abstract

:
Learning effective graph representations in an unsupervised manner is a popular research topic in graph data analysis. Recently, contrastive learning has shown its success in unsupervised graph representation learning. However, how to avoid collapsing solutions for contrastive learning methods remains a critical challenge. In this paper, a simple method is proposed to solve this problem for graph representation learning, which is different from existing commonly used techniques (such as negative samples or predictor network). The proposed model mainly relies on an asymmetric design that consists of two graph neural networks (GNNs) with unequal depth layers to learn node representations from two augmented views and defines contrastive loss only based on positive sample pairs. The simple method has lower computational and memory complexity than existing methods. Furthermore, a theoretical analysis proves that the asymmetric design avoids collapsing solutions when training together with a stop-gradient operation. Our method is compared to nine state-of-the-art methods on six real-world datasets to demonstrate its validity and superiority. The ablation experiments further validated the essential role of the asymmetric architecture.

1. Introduction

Graph structure data are ubiquitous in the real world, such as social networks [1], e-commerce networks [2], traffic networks [3], chemical graphs [4,5,6], and fractal networks [7,8,9]. Graph theory is extensively utilized to analyze these complex networks, studying aspects like degree distribution, cluster coefficients, average path length, degree correlations, and community structures. In addition to these analyses, learning low-dimensional dense embeddings from nodes is widely used in graph analysis tasks, and has been studied in the past few years. The conventional network representation methods include matrix factorization-based methods [10,11] and random walk-based methods [12,13,14]. Recently, graph neural networks (GNNs) [15,16,17,18], which learn node representations by aggregating neighbor information, and have achieved the state-of-the-art performances in many graph analysis tasks, such as node classification [19,20], link prediction [21], and graph classification [22].
GNNs leverage graph theory principles to learn and encode complex relationships within graphs. By combining the theoretical foundations of graph theory with the computational capabilities of neural networks, GNNs provide a versatile and effective approach for understanding and analyzing complex relational data represented as graphs. However, the existing GNN models are mostly defined in a supervised manner, which is not applicable for the data without costly human-annotated labels.
Inspired by recent success in self-supervised contrastive representation learning in computer vision [23,24] and natural language processing [25,26], some works [27,28,29,30,31] attempt to learn graph representations in an unsupervised manner. These methods use GNNs as graph encoders and learn node-level or graph-level representations by maximizing the agreement between similar nodes or graphs. Previous works adopt a cross-scale contrastive mechanism (i.e., node–subgraph or node–graph contrasting), and they contrast the representations of the global graph and local patches by maximizing the mutual information (MI) [32,33]. More recently, some studies based on the same-scale contrastive mechanism [27,28,29] are proposed. Different from contrasting elements in different scales, they directly perform node–node or graph–graph contrasting to learn representations. Generally, these methods first generate two graph views via data augmentation and then learn representations by maximizing the agreement of elements in these two views.
Despite the fact that there are different contrastive mechanisms, avoiding collapsing solutions is always a fundamental challenge in graph contrastive learning. In collapsing solutions, graph models map all nodes into constant outputs to reduce the distance among them and the training loss decreases to zero in a cheating way. Existing graph contrastive learning methods always focus on designing data augmentation strategies to improve their performance on classification tasks [27,28,29,30,31]. None of them address the issue of collapsing solutions; instead, they adopt established techniques from computer vision to circumvent this problem. For example, GRACE [27] uses a large number of negative samples and BGRL [31] uses complex momentum-update encoder together with a predictor network. Recently, SimSiam [34] proposed a hypothesis to investigate which component implicitly works for avoiding collapsing solutions.
Based on the above methods, we perform theoretical analysis and experimental demonstrations. The results show that the asymmetric architecture of encoders is a crucial design to avoid collapsing solutions. According to the observation, an Asymmetric Graph Contrastive learning method (AGC) is proposed by employing a pair of unequal depth GNNs for contrasting. Combined with a stop-gradient operation, AGC can learn node representations without using any negative samples and avoid collapsing solutions. It does not require complex design such as momentum-update encoder or predictor network. Our contributions are summarized as follows:
  • We propose a simple asymmetric contrastive learning method for graph representation learning. The proposed AGC method first uses a pair of GNN encoders with unequal depth layers to jointly capture information from different neighbor scales and then learns more comprehensive node representations;
  • We are the first to analyze the reasons behind collapsing solutions in graph contrastive learning and make a theoretical analysis on why the asymmetric design working with stop-gradient operation in our AGC can successfully avoid this issue;
  • We compare our method with nine baselines on six real-world datasets to demonstrate the superiority of the proposed method over the state-of-the-art graph contrastive learning methods. We empirically demonstrate that batch normalization and residual connections can further improve model performance by adopting a deeper GNN-based encoder.

2. Theoretical Explanation of AGC

In this section, we discuss the reasons behind the collapsing solutions in graph contrastive learning. We explain why our asymmetric model AGC can avoid collapsing solutions even when training without using negative samples.

2.1. Analysis of Symmetric Design

We first explain that even with a stop-gradient operation, a model built up by symmetric encoders will result in trivial solutions. Node representation z is rewritten as z ( θ , G ˜ ) , denoting that representation z is learned from graph view G ˜ via an encoder with parameter θ . The loss objective is written in the format of L2-norm, which is equivalent to the cosine similarity because the node features are normalized. Consider the loss function as follows:
L = E z ( θ L , G 1 ˜ ) stopgrad ( z ( θ S , G ˜ 2 ) ) 2 2 + z ( θ L , G ˜ 2 ) stopgrad ( z ( θ S , G ˜ 1 ) ) 2 2 ,
where θ L and θ S denote the parameters of two encoders, while G ˜ 1 and G ˜ 2 represent two graph views. For symmetric-designed models, θ L and θ S are entirely shared, and can be represented by a set of parameter θ for simplicity. Therefore, the base loss and gradients can be rewritten numerically as:
L = 2 · E z ( θ , G ˜ 1 ) z ( θ , G ˜ 2 ) 2 2 ,
where the two norm terms are merged, since they are numerically equal. The terms marked as stopgrad provide no gradient for the model, therefore the gradient is calculated as:
L θ = L z ( θ , G ˜ 1 ) · z ( θ , G ˜ 1 ) θ + L z ( θ , G ˜ 2 ) · z ( θ , G ˜ 2 ) θ .
It can be easily shown that the optimizer will adjust the parameters to minimize the loss to zero, because every term in (3) is directly affected by θ to minimize the objective. As for augmentations in L , G 1 ˜ and G 2 ˜ are two different augmented graphs randomly sampled from the same distribution in every epoch. With the optimization of parameters, models will give the same constant vector regardless of the different augmented input graph, and the loss decreases to zero.
The collapsing solution is analogous to the overfitting issues in supervised learning without the regularization item. They have the similar problem of paying too much attention on reducing the loss value which leads to abnormal parameters. The difference between the two problems is that supervised training models reduce loss by looking for complex parameters to fit all samples, while in contrastive learning, models prefer simple parameters to output constant embeddings for lowering loss.
Figure 1 illustrates how the symmetric encoders are trained in the latent space. The gradient directions pointing directly at each other lead to the convergence of the two representations in the embedding space and a degradation in learning ability.

2.2. Analysis of Asymmetric Design

Based on the L2-norm loss (1), the gradients of AGC model with asymmetric design can be formulated as:
L θ = L z ( θ L , G ˜ 1 ) · z ( θ L , G ˜ 1 ) θ + L z ( θ L , G ˜ 2 ) · z ( θ L , G ˜ 2 ) θ .
Compared to (3), this loss (1) has two terms based on the parameter θ S , which cannot be optimized directly by the gradient of θ L but updated by a shared-parameters operation as:
θ L arg min θ L ( L ) , θ S θ L .
We hypothesize that GNN encoders with unequal depth layers can learn similar embeddings that can be classified into the same category, which is to be proved by experiments in Section 4.2. It can be understood that better embeddings can be learned through aggregating different neighbor scales with proper parameters. Therefore, for each node, the embeddings learned by the two encoders are regarded as a positive pair. Due to the parameters that are not shared in the long encoder, two embeddings in a positive pair will not become a completely identical solution and the loss of AGC will not converge to zero.
The training process in latent space is shown in Figure 1. The gradient of the θ L directs to the embedding from the short encoder in another view. While in this case, the total loss cannot drop to zero because with the parameters of the long encoder updating, z ( θ S , G ˜ ) keeps moving in the similar scope (marked by the dotted line) around the z ( θ L , G ˜ ) in every epoch. Therefore, the training targets can be considered to keep the two embeddings of augmented views similar but not the same in latent space. The design of the short shared-parameter encoder plays the same role as the regularization term in supervised learning to prevent loss to zero and can protect the model from collapsing.

2.3. Analysis of Predictor-Based Methods

Moreover, we display the predictor-based methods in the embedding space. It has the same principle that prevents loss from falling to zero by using additional parameters to keep the embeddings similar but not the same. In Figure 1, the model tries to use a predictor network to make the embedding approaching to the other view in latent space. Then, we use the similarity objective to measure the distance between one view and the prediction of the other view. The training objective is to enable the predictor to bridge the gap between the two augmentation distributions. By achieving this, the model avoids directly reducing the distance between embeddings, allowing the encoder to learn discriminative representations without collapsing solutions.

2.4. Discussion

Some previous studies adopt single-sampling strategy which calculates the loss with a pair of ( G 1 ˜ , G 2 ˜ ) while our proof is based on dense-sampling with an addition sample of ( G 2 ˜ , G 1 ˜ ). Actually, it makes no difference because the augmentations are based on the same hyperparameters of masking rates in graph contrastive tasks. During the long training procedure, it can be seen as the same sample augmented according to the distribution.
In total, we believe that the asymmetric design of the model changes the optimization goals from making two embeddings coincide to making them similar in latent space and avoids the collapsing solution. Compared to the methods based on predictor network, AGC provides a simpler and intuitive method for graph contrastive learning.

3. Method

In this section, we first explain notations and definitions used throughout the paper. Then, we introduce the Asymmetric Graph Contrastive learning (AGC) in detail including the overall framework and the design of asymmetric encoders.

3.1. Preliminaries

The input graph is defined as G = ( V , E ) , where V is the set of nodes ( | V | = N ) and E is the set of edges [35]. Each node v i V is represented by a feature vector x i R d and matrix X R N × d denotes all node features in which each row represents a feature vector. The edge set can also be represented by an adjacency matrix A R N × N . A i , j is 1 if there is an edge between nodes v i and v j or 0 if no edge exists between the two nodes. A graph can also be represented as G = ( A , X ) . In unsupervised node-level representation learning, the objective is to train an encoder f ( · ) without using labels. The encoder f ( · ) can learn node representations Z R N × d with topology A and features X as input, i.e., Z = f ( X , A ) . The learned node representations are low-dimensional ( d d ) and contain structural properties and feature information in graph, which can be used in downstream tasks, such as node classification or link prediction.

3.2. The Framework of AGC

AGC is a simple graph contrastive learning method with asymmetric encoders. The new method is designed to learn comprehensive node representations in a simple and effective way, without using negative samples or complex architecture. At each iteration, AGC first generates two views based on the original graph by using data augmentation. The two graph views are subsequently fed into a pair of asymmetric encoders to learn node representations. Finally, for two representations of a node in the two graph views, AGC employs a cosine similarity objective and maximizes the agreement between them. The flow chart of AGC is shown in Figure 2.
Similar to GRACE [27], we generate two augmented graphs from the original graph at both structure and feature levels. The two augmented graphs can provide different node contexts (structure level) and node content (feature level) for AGC to contrast with. For structure-level augmentation, we randomly remove edges with probability α and denote an augmented topology as an adjacency matrix A ˜ . For feature-level augmentation, we first select dimensions with probability β , and then set the feature values in these dimensions to zero for all nodes. The augmented node features are denoted as X ˜ . In AGC, two augmented graph views are generated: G ˜ 1 = ( A ˜ 1 , X ˜ 1 ) and G ˜ 2 = ( A ˜ 2 , X ˜ 2 ) .
With two augmented graph views as input, a pair of asymmetric GNN encoders are constructed to learn node representations. The two GNN encoders have unequal depth layers. The asymmetric architecture is a vital component in AGC to avoid collapsing solutions. It can also help us to learn more comprehensive node representations by capturing information from different neighbor scales because a long encoder can aggregate information from high order neighbors while a short encoder can gather information from low order neighbors. A long GNN encoder is denoted by f θ L ( · ) and a short GNN encoder is denoted by f θ S ( · ) . Then, node v i can be represented as z i L = f θ L ( A ˜ 1 , X ˜ 1 ) and z i S = f θ S ( A ˜ 2 , X ˜ 2 ) . The negative cosine similarity is minimized to make the two representations similar:
D ( z i L , z i S ) = z i L z i L 2 · z i S z i S 2 ,
where · 2 is L2-norm. We further compute z ¯ i L = f θ L ( A ˜ 2 , X ˜ 2 ) and z ¯ i S = f θ S ( A ˜ 1 , X ˜ 1 ) . A symmetrized loss of AGC can be written as the expectation of all nodes
L = E v i D z i L , stopgrad z i S + D z ¯ i L , stopgrad z ¯ i S ,
where stopgrad ( · ) is a stop-gradient operation. It means that the short encoder f θ S ( · ) will not receive gradient. The parameters of f θ S ( · ) will not be updated according to gradient, and its parameters are shared by f θ L ( · ) . The proof that asymmetric encoders together with a stopgrad ( · ) can avoid collapsing solutions is described in Section 2.

3.3. The Design of Asymmetric Encoders

In AGC, node representations are learned by designing a pair of asymmetric encoders to avoid collapsing solutions. By using the asymmetric encoders, AGC can learn efficiently without using negative samples, and follow a simple design without a predictor network or a moving-average strategy on parameter updating. The objective without dense contrast (calculating the distance between negative samples) has a lower restraint of memory and computational complexity for large graphs. Meanwhile, the asymmetric GNN encoders can also learn node representation from different neighbor scales. A short encoder is used to capture low-order information and a longer GNN for high-order information. Compared to existing graph contrastive learning methods, which employ symmetric encoders, AGC can learn more comprehensive information by considering different aggregation scales, which is the defect of other models but an important aspect in graph learning field. Generally, we formulate the representation of v i at the t-th layer as
z i ( t ) = COMBIN E ( t ) ( z i ( t 1 ) , a i ( t ) ) ,
a i ( t ) = AGGREGAT E ( k ) ( { z j ( t 1 ) : v j N ( v i ) } ) ,
where N ( v i ) is the neighbor set of v i , z i ( 0 ) = x i , COMBINE ( · ) and AGGREGATE ( · ) are component functions with learnable parameters. For the short GNN encoder with m layers in AGC, its parameters of all m layers are shared by the first m layers of the long GNN encoder.
We use batch normalization [36] for stabilizing training and use residual connections [37] in the branch of the long encoder. Results in experiment section demonstrate that batch normalization and residual connections can further improve the performance of our model. The node representation in the long encoder with residual connections can be formulated as
z i ( t ) = COMBIN E ( t ) ( z i ( t 1 ) , z i ( m ) , a i ( t ) ) , where t > m ,
where z i ( t ) is the representation of v i at the t-th layer, z i ( m ) is the representation of v i at the m-th layer, and m is the depth of the short GNN.

4. Experiments

In this section, we empirically evaluate the proposed AGC model on the node classification task, and further perform ablation experiments to investigate the role of each component in AGC.

4.1. Experimental Setup

4.1.1. Dataset

To analyze the effectiveness of AGC, we choose six widely used datasets for the node classification task, including four citation networks (Cora, CiteSeer, CoauthorCS, CoauthorPhy) and two co-purchase networks (Amazon-Computer, Amazon-Photos). For each dataset, nodes are randomly divided into training, validation, and test sets in the ratio of 80%, 10%, and 10% respectively.

4.1.2. Evaluation Protocol

AGC is compared with other baselines by following the original evaluation design of DGI [32], where models are trained in a fully unsupervised way and the learned embeddings are used to train a simple L2-regularized logistic regression classifier. The procedure is repeated for twenty runs with different training and test data splits and the average classification accuracy is reported on each dataset for a fair evaluation.

4.2. Performance on Node Classification

Table 1 presents the result of the node classification accuracy on six datasets. AGC is principally compared with three methods based on the negative samples (GRACE and GCA) or predictor network (BGRL). The performance of two classic unsupervised models (DeepWalk, Node2vec), two superior contrastive models (DGI, GMI), and a semi-supervised two-layers GCN model are also reported. The accuracy results are reused from the reports in [27,28].
According to Table 1, observations are concluded as follows. (1) AGC outperforms the baseline methods on five out of six datasets. The main reason behind the results lies in the asymmetric design of our model. Compared to other models with a fixed-size encoder, AGC enables the model to learn more comprehensive node representations from information carried by different neighbor scales. (2) Our asymmetric-based model shows considerable improvement over the models that depend on negative samples to cover the collapsed solutions. At the same time, our model requires less memory because we find that our 16GB Tesla-P100 GPU cannot stand by the training procedure of GRACE and GCA on CoauthorPhy during experiments while AGC only needs a quarter of it. (3) AGC has an obvious improvement over supervised GCN across all datasets but our performance still falls behind the current supervised state-of-the-art methods. (4) We show that both the output of long (AGC-L) and short encoder (AGC-S) have a competitive performance among all methods, which proves the hypothesis mentioned in Section 2 that the node representations obtained by aggregating neighbors in different scales can represent the same semantics.

4.3. Ablation Study

In this section, we perform a series of ablation studies to show which component may play an important role in avoiding collapsing solution and which component is benefit to outstanding performance.

4.3.1. Asymemetric

As analyzed in Section 2, our proposed model efficiently avoids the collapsing solutions by the design of asymmetric encoders. In this section, we deploy experiments with three kinds of variants in which “w/o asymmetric” represents AGC with symmetrical encoders (two GNNs with same depth layers), “w/o stop-grad” represents AGC without stop-gradient operation, and “w/o share-parameter” denotes that the parameters of short GNN encoder do not update via the share-parameter operation and keep constant.
Experimental results are shown in Figure 3. It is obviously shown that AGC successfully escapes from the collapsing solutions with a non-zero loss throughout the entire training process. Without stop-gradient or asymmetric architecture, the optimizer quickly finds the collapsing places and reaches the minimum loss value of 0. Correspondingly, the linear evaluation accuracy of the trained embeddings is in Table 2. AGC achieves top performance with 84.88% accuracy on Cora dataset. While individually eliminating stop-gradient or asymmetric encoders, the accuracy decreases to 50.19% and 30.22%, which shows poor classification ability for the few types of nodes. The results imply that both components make contributions to prevent the model from collapsing.
In addition, if the shared-parameters operation is removed in AGC, the loss is hard to coverage as a normal solution because the contrastive task becomes harder and the embeddings from the completely unrelated encoders are hard to discover their similarities. Actually, 71.07% of the result mostly benefits from the raw features of the graph.

4.3.2. Residual Operating

Table 2 shows the comparison on “with or without residual operator”, where all hyper-parameters keep unchanged and residual connection is the only difference. The reported results clearly show that an additional residual mechanism can improve the representation learning ability of GNNs, while when training without it, the accuracy has dropped about 2 percent on the Cora dataset. Building the extensive connections between the model layers and the intermediate node embedding allows the GNN-based encoder has a deeper structure with no danger of over-smooth caused by the increasing aggregation times. With the help of it, the proposed contrastive method benefits from the joint consideration of effective high-order and closer information to achieve better performance over the models with symmetric-design encoder.

4.3.3. Batch Normalization

To evaluate the effectiveness of the batch normalization layers as an important component of our model, we run experiments with all batch normalization layers removed, which results in a decline of six percent. Normalization methods are designed to make the training procedure more stable by making the different features have the same distribution. In summary, we observe that normalization layers are unrelated to the resolution of the trivial solution, but have a significant boost of about six percent to the learning process by arranging it properly.

4.3.4. Data Augmentation

Data augmentation plays an important role in most of the contrastive learning methods for creating different views. The procedure can be illustrated by the reason that providing different parts of the graph information involving topology and feature can express the same semantics. Previous works (GRACE, GCA) with the same data augmentation strategy as our methods have an exhausting process of finding the optimal solution for the four hyper-parameters that need to be adjusted properly. In our work, we prefer the idea that the two views masking over half of the information on the graph may minimize the overlapping information between each other while carrying the corresponding semantics as the simplest way to define hard positive samples.
Experiments demonstrate that setting rates to 0.6 is a straightforward and effective approach, achieved by training models with augmentation rates ranging from 0.0 to 0.9. First, we fix the rates of removing edges and vary the rates of masking features to see the impact on models including BGRL and AGC with different rate settings. The result in Figure 4 shows that large dropping rates give the model a stronger ability to distinguish nodes, where AGC achieves the best performance at 0.6 and BGRL adopts 0.9 as the best. If we remove the attribute masking, AGC reaches the highest performance at 0.8. Different observations on GRACE show that it prefers to a lower rate of augmentation (0.3, 0.4, 0.2, 0.4, as reported). This is caused by the different contrastive mechanism where methods based on negative samples need low augment rates to improve the quality of representation learned from negative samples. A similar conclusion can be observed in Figure 4, where feature masking rates have a greater impact on accuracy.

5. Related Works

5.1. Graph Neural Networks

Graph neural networks (GNNs) [15,16,17,18] have been proven to be an effective tool for analyzing graph-structural data. Most GNNs follow the principle of message passing. They learn node representations by aggregating information from neighboring nodes based on the guidance of topological structure. Among them, there are lots of semi-supervised methods [35]. GCN [19] simplifies spectral graph convolutions by using a localized first-order approximation. GAT [20] uses an attention mechanism to learn the importance between nodes and their neighbors. GIN [38] aims at distinguishing different graph structures. There are also some designs for unsupervised GNNs. One simple way is to integrate GNNs into an autoencoder framework where an encoder employs GNNs to learn representations in latent space while a decoder is used to reconstruct the graph structure [39]. Another popular way is to use DeepWalk-like objectives [12] to sample a portion of node pairs as negative pairs while existing node pairs with edge in the graphs are positive pairs [40].

5.2. Graph Contrastive Learning

Recently, self-supervised contrastive methods have achieved competitive performance in computer vision (CV) [34,41,42,43]. Benefit from advanced and effective ideas on them, a lot of graph contrastive learning methods have been proposed. DGI [32] is the most representative work which follows the cross-scale contrastive mechanism first purposed by information maximization (InfoMax [44]). DGI maximizes the mutual information (MI) between patch representation and the overall graph representation by using Jenson Shannon divergence as an estimator [45,46]. There are some extended works about DGI. GMI [47] uses MI to measure the correlation between raw features and learned node representations. MVGRL [48] uses different views for comparison such as PPR [49] and heat kernel [50]. Further, for a heterogeneous graph with multiple types of nodes, DMGI [51] follows the InfoMax principle to learn multiple kinds of representations corresponding to multiple relations, and learns the final representations by using a meta-path-based aggregation.
A more general contrastive pattern is to maximize the agreement between two same-scale elements in different augmented data views. Different from augmentation methods in CV [52,53], augmentations on graphs are restricted. For the structure of networks, subgraph sampling and node dropping [29,30] are commonly used for the graph-level task while removing edges [27,28] is suitable for node-level representation learning. Compared with removing edges randomly [27], GCA [28] proposes an adaptive augmentation scheme for both edges and features. Besides data augmentation, the contrastive objective is another important component. SimCLR [41] first adopts an NCE-form [54] loss to contrast among positive and negative samples in different augmented views. Some works on graph follow SimCLR and transfer NCE loss to node-level [27,28] and graph-level [29] representation learning. With the appearance of MoCo [42], a dynamic dictionary is used to preserve more data samples, which enables models to use a larger number of negative samples [55]. GCC [30] follows this idea and designs a graph encoder to capture structural information for pre-training tasks. More recently, some works [34,43] in CV show that a predictor component can effectively avoid collapsing solutions without using negative samples. Similarly, our work aims to avoid collapsing solutions without using negative samples in graph representation learning. Moreover, we consider the properties of graphs and design a simpler and more efficient model than existing methods.
GRACE [27] is a novel framework for graph representation learning that is similar to our model. Both of them generate two graph views at structure and feature levels by removing edges for topology and masking features for node attributes, respectively. Nevertheless, GRACE utilizes a GNN encoder for both two graph views to generate node embedding in low dimensionality. In contrast, our model constructs a pair of asymmetric GNN encoders with unequal depth, which is the key component to avoid collapsing solutions. Furthermore, while GRACE requires a substantial amount of negative samples for training, our model can prevent the occurrence of collapsing solutions even when trained without utilizing negative samples.

6. Conclusions

In this paper, we analyze the reasons why representations in contrastive learning are easy to fall into collapsing solutions when no negative samples are used. We prove that encoders with symmetrical architecture lead to collapsing solutions, while a pair of asymmetric encoders with a stop-gradient operation can avoid this problem. Following the above analysis, we propose a new graph contrastive learning method (AGC). AGC uses two graph neural networks with unequal depth layers to learn node representations from two augmented views. The design of asymmetric architecture makes AGC learn discriminative representations by jointly considering information from different neighbor scales. Based on the proposed AGC, we performed ablation experiments to demonstrate the critical roles of both asymmetric architecture and stop-gradient operation in avoiding collapsing solutions. Experimental results on node classification task also showed the superiority of our proposed model.

Author Contributions

Conceptualization, X.C. and Y.W.; methodology, J.W.; software, R.G.; validation, Y.W. and W.L.; writing—original draft preparation, X.C.; writing—review and editing, J.W.; visualization, R.G.; supervision, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62276187).

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: Available online: https://github.com/wyking36/AGC (accessed on 27 August 2023).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, E.; Tang, J.; Yin, D. Graph neural networks for social recommendation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 417–426. [Google Scholar]
  2. Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; Tan, T. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 346–353. [Google Scholar]
  3. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, USA, 6–10 July 2020; pp. 753–763. [Google Scholar]
  4. Ullah, A.; Zeb, A.; Zaman, S. A new perspective on the modeling and topological characterization of H-Naphtalenic nanosheets with applications. J. Mol. Model. 2022, 28, 211. [Google Scholar] [CrossRef] [PubMed]
  5. Ullah, A.; Zaman, S.; Hamraz, A.; Muzammal, M. On the construction of some bioconjugate networks and their structural modeling via irregularity topological indices. Eur. Phys. J. E 2023, 46, 72. [Google Scholar] [CrossRef] [PubMed]
  6. Ullah, A.; Shamsudin; Zaman, S.; Hamraz, A. Zagreb connection topological descriptors and structural property of the triangular chain structures. Phys. Scr. 2023, 98, 025009. [Google Scholar] [CrossRef]
  7. Liu, J.B.; Bao, Y.; Zheng, W.T. Analyses of some structural properties on a class of hierarchical scale-free networks. Fractals 2022, 30, 2250136. [Google Scholar] [CrossRef]
  8. Liu, J.B.; Zhao, B.Y. Study on environmental efficiency of Anhui province based on SBM-DEA model and fractal theory. Fractals 2023, 31, 2340072. [Google Scholar] [CrossRef]
  9. Liu, F.; Liu, H.; Liu, R.; Xiao, C.; Duan, X.; McClements, D.J.; Liu, X. Delivery of sesamol using polyethylene-glycol-functionalized selenium nanoparticles in human liver cells in culture. J. Agric. Food Chem. 2019, 67, 2991–2998. [Google Scholar] [CrossRef]
  10. Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
  11. Kuang, D.; Ding, C.; Park, H. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of the Annual Meeting of the Society for Industrial and Applied Mathematics (SIAM), Minneapolis, MN, USA, 9–13 July 2012; pp. 106–117. [Google Scholar]
  12. Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  13. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015. [Google Scholar]
  14. Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
  15. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
  16. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Networks 2009, 20, 61–80. [Google Scholar] [CrossRef]
  17. Cai, H.; Zheng, V.W.; Chang, K.C. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Trans. Knowl. Data Eng. 2018, 30, 1616–1637. [Google Scholar] [CrossRef]
  18. Cui, P.; Wang, X.; Pei, J.; Zhu, W. A Survey on Network Embedding. IEEE Trans. Knowl. Data Eng. 2019, 31, 833–852. [Google Scholar] [CrossRef]
  19. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  20. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 6th International Conference on Learning Representations. Vancouver Convention Center, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  21. Zhang, M.; Chen, Y. Link prediction based on graph neural networks. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS2018), Palais des Congrès de Montréal, Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
  22. Ying, R.; You, J.; Morris, C.; Ren, X.; Hamilton, W.L.; Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS2018), Palais des Congrès de Montréal, Montreal, QC, Canada, 2–8 December 2018; pp. 4801–4811. [Google Scholar]
  23. Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive Representation Learning: A Framework and Review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
  24. Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on Contrastive Self-supervised Learning. arXiv 2020, arXiv:2011.00362. [Google Scholar] [CrossRef]
  25. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
  26. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  27. Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Deep Graph Contrastive Representation Learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, 13–18 July 2020. [Google Scholar]
  28. Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Graph Contrastive Learning with Adaptive Augmentation. arXiv 2020, arXiv:2010.14945. [Google Scholar]
  29. You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph Contrastive Learning with Augmentations. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online Conference, Canada, 6–12 December 2020. [Google Scholar]
  30. Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; Tang, J. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, USA, 6–10 July 2020; pp. 1150–1160. [Google Scholar]
  31. Thakoor, S.; Tallec, C.; Azar, M.G.; Munos, R.; Veličković, P.; Valko, M. Bootstrapped Representation Learning on Graphs. arXiv 2021, arXiv:2102.06514. [Google Scholar]
  32. Velickovic, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  33. Sun, F.Y.; Hoffmann, J.; Tang, J. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. arXiv 2020, arXiv:1908.01000. [Google Scholar]
  34. Chen, X.; He, K. Exploring Simple Siamese Representation Learning. arXiv 2020, arXiv:2011.10566. [Google Scholar]
  35. He, D.X.; Guo, R. Inflation improves graph neural networks. In Proceedings of the 31st ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1466–1474. [Google Scholar]
  36. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  37. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  38. Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019; pp. 1–17. [Google Scholar]
  39. Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar]
  40. Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  41. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  42. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
  43. Grill, J.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.Á.; Guo, Z.; Azar, M.G.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online Conference, Canada, 6–12 December 2020. [Google Scholar]
  44. Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv 2019, arXiv:1808.06670. [Google Scholar]
  45. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
  46. Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  47. Peng, Z.; Huang, W.; Luo, M.; Zheng, Q.; Rong, Y.; Xu, T.; Huang, J. Graph representation learning via graphical mutual information maximization. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 259–270. [Google Scholar]
  48. Hassani, K.; Ahmadi, A.H.K. Contrastive Multi-View Representation Learning on Graphs. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, 13–18 July 2020. [Google Scholar]
  49. Page, L.; Brin, S.; Motwani, R.; Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web; Technical Report SIDL-WP-1999-0120; Stanford Digital Library Technologies Project; Stanford University: Stanford, CA, USA, 1999. [Google Scholar]
  50. Kondor, R. Diffusion kernels on graphs and other discrete structures. In Proceedings of the 19th International Conference on Machine Learning (ICML-2002), University of New South Wales, Sydney, Australia, 8–12 July 2002. [Google Scholar]
  51. Park, C.; Han, J.; Yu, H. Deep multiplex graph infomax: Attentive multiplex network embedding using global information. Knowl. Based Syst. 2020, 197, 105861. [Google Scholar] [CrossRef]
  52. Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  53. DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
  54. Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
  55. Wu, Z.; Xiong, Y.; Yu, S.; Lin, D. Unsupervised Feature Learning via Non-parametric Instance Discrimination. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
Figure 1. Visual expression of the methods analyzed in Section 2 by showing the learning process in latent space. (Left): The symmetry-design models that will result in collapsing solutions. (Middle): Our asymmetry-design AGC. (Right): Predictor-based methods.
Figure 1. Visual expression of the methods analyzed in Section 2 by showing the learning process in latent space. (Left): The symmetry-design models that will result in collapsing solutions. (Middle): Our asymmetry-design AGC. (Right): Predictor-based methods.
Mathematics 11 04505 g001
Figure 2. The flow chart of AGC. Given a graph G = ( A , X ) , AGC generates two augmented graph views based on the original graph, and then uses a pair of asymmetric encoders to learn node representations. Finally, for two representations of one node, AGC employs a cosine similarity contrastive objective and maximizes it between views.
Figure 2. The flow chart of AGC. Given a graph G = ( A , X ) , AGC generates two augmented graph views based on the original graph, and then uses a pair of asymmetric encoders to learn node representations. Finally, for two representations of one node, AGC employs a cosine similarity contrastive objective and maximizes it between views.
Mathematics 11 04505 g002
Figure 3. Loss curves of different variants.
Figure 3. Loss curves of different variants.
Mathematics 11 04505 g003
Figure 4. Node classification accuracy(%) with different ratios in augmentation. (Left): Fix the ratios of removing edges and report results under different ratios of masking features. (Right): Fix the ratios of masking features and report results under different ratios of removing edges.
Figure 4. Node classification accuracy(%) with different ratios in augmentation. (Left): Fix the ratios of removing edges and report results under different ratios of masking features. (Right): Fix the ratios of masking features and report results under different ratios of removing edges.
Mathematics 11 04505 g004
Table 1. Summary of the results of node classification tasks. Performances are measured in terms of classification accuracy along with standard deviations. Available data for each method during training is highlighted in italic font in the second column, where X, A, Y correspond to features, topology, and labels. The best results are in bold, and the second-best results are underlined.
Table 1. Summary of the results of node classification tasks. Performances are measured in terms of classification accuracy along with standard deviations. Available data for each method during training is highlighted in italic font in the second column, where X, A, Y correspond to features, topology, and labels. The best results are in bold, and the second-best results are underlined.
MethodTraining
Data
CoraCiteseerCoauthorCSCoauthorPhyAm.PhotosAm.Computers
Raw featuresX64.80 ± 0.7864.60 ± 0.7090.37 ± 0.0093.58 ± 0.0078.53 ± 0.0073.81 ± 0.00
node2vecA74.80 ± 0.3452.30 ± 0.4285.08 ± 0.0391.19 ± 0.0489.67 ± 0.1284.39 ± 0.08
DeepWalkA75.70 ± 0.6850.50 ± 0.5084.61 ± 0.2291.77 ± 0.1589.44 ± 0.1185.68 ± 0.06
DGIX, A82.60 ± 0.4068.80 ± 0.7092.15 ± 0.6394.51 ± 0.5291.61 ± 0.2283.95 ± 0.47
GMIX, A83.00 ± 0.3072.40 ± 0.10OOMOOM90.68 ± 0.1782.21 ± 0.31
GRACEX, A83.30 ± 0.4072.10 ± 0.5092.93 ± 0.0195.26 ± 0.0292.15 ± 0.2486.25 ± 0.25
GCAX, A80.90 ± 0.4172.14 ± 0.0693.10 ± 0.0195.73 ± 0.0392.53 ± 0.1687.85 ± 0.31
BGRLX, A82.77 ± 0.7564.45 ± 0.1593.21 ± 0.1895.56 ± 0.1292.87 ± 0.2789.68 ± 0.31
AGC-LX, A83.92 ± 0.7973.06 ± 0.9393.75 ± 0.1595.94 ± 0.0392.80 ± 0.1188.75 ± 0.47
AGC-SX, A84.88 ± 0.4173.72 ± 0.2493.76 ± 0.2496.09 ± 0.0793.32 ± 0.1088.94 ± 0.27
GCNX, A, Y82.80 ± 0.3072.00 ± 0.5093.03 ± 0.3195.65 ± 0.1692.42 ± 0.2286.51 ± 0.54
Table 2. Classification accuracy of different variants on Cora dataset.
Table 2. Classification accuracy of different variants on Cora dataset.
Variantsacc. (%)
AGC84.88
AGC w/o asymmetric30.22
AGC w/o stop-gradient50.19
AGC w/o share-parameter71.07
AGC w/o bn78.78
AGC w/o residual83.09
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, X.; Wang, J.; Guo, R.; Wang, Y.; Li, W. Asymmetric Graph Contrastive Learning. Mathematics 2023, 11, 4505. https://doi.org/10.3390/math11214505

AMA Style

Chang X, Wang J, Guo R, Wang Y, Li W. Asymmetric Graph Contrastive Learning. Mathematics. 2023; 11(21):4505. https://doi.org/10.3390/math11214505

Chicago/Turabian Style

Chang, Xinglong, Jianrong Wang, Rui Guo, Yingkui Wang, and Weihao Li. 2023. "Asymmetric Graph Contrastive Learning" Mathematics 11, no. 21: 4505. https://doi.org/10.3390/math11214505

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop