1. Introduction
Link prediction benefits in amplifying the relations in graph-structured data [
1], arousing interest from both academia and industries. Existing research mainly focuses on simple graphs where a link (also known as a relation) associates with two entities (also known as an entity), while some real-world relations consist of more than two entities, such as chemical reactions [
2], co-authorship relations [
3], and social networks [
4], etc. As shown in
Figure 1, the “Located In” relation contains NYC, New York City, The Big Apple, USA, and The United States, as follows:
Thus, a hyperlink is coined to model such relations, and the graph comprised of hyperlinks is defined as a
hypergraph [
5].
As the relations among entities are sophisticated, the construct of a hypergraph is time-consuming and hence expensive, making its incompleteness more severe than a simple graph. To mitigate the problem, a
hyperlink prediction task is introduced to facilitate the research [
6]. Similar to the goal of link prediction in simple graphs, the task tries to complete the missing hyperlinks in a given hypergraph.
Example 1. Consider the bottom ellipse in green in Figure 1, given several entities, e.g., NYC, New York City, The Big Apple, USA, The United States; the target of the hyperlink prediction is to determine whether there is a hyperlink and what it is (i.e., “Located In”) once existing. Furthermore, the directivity of hyperlinks also matters in some practical applications. Thus, the machine should also acquire the ability to predict the direction of the hyperlink to form the final answer, i.e., NYC, New York City, The Big Apple USA, The United States. To approach this task, current studies mainly fall into two categories: (1) Translation-based models try to generalize the translation constraint in simple graphs to hypergraphs, e.g.,
m-TransH [
7],
RAE [
8], and
NHP [
9].
m-TransH directly extends
TransH [
10] for binary relations to the n-ary case, and
RAE further integrates
m-TransH with multi-layer perceptron (MLP) by considering the relatedness of entities. Since they use the sum after the projection as the scoring function, when some entities in a hyperlink change, it may not be obvious in the scoring function. (2) Neural-network-based models exploit structural information of hypergraphs, e.g.,
NaLP [
11],
HGNN [
12], and
HyperGCN [
13]. These methods design some graph neural networks (
GNNs) to absorb neighbouring features to improve entities’ representations. As
GNNs usually incorporate a large number of parameters, the sufficient learning process relies on the amount of training samples.
Albeit attracting attention, hyperlink prediction is still notoriously challenging, since existing studies neglect the cores of the task. First, sometimes, the accurate record of facts in a hypergraph necessitates the direction of hyperlinks. For a directed hyperlink, the entities can be divided into two parts—head and tail—based on the hyperlink’s direction. This mandates that the
order of the two parts matters; in contrast, the specific order in each part is insignificant. As shown in
Figure 1, without the arrow pointing, we cannot figure out how these entities construct the relation “Located In”. In addition,
NYC,
New York City, and
The Big Apple (also known as the head) should be in front of
USA and
The United States (also known as tail), but the order inside the head or tail does not affect the determination. Nevertheless, existing methods mainly focus on undirected hyperlinks. The only method, namely,
NHP, tries to average the entity embeddings generated by
GCN [
14] to calculate a score for inferring the hyperlink direction, which is too rudimentary to embody the direction’s features. Second, as a hyperlink contains more than two entities, each entity contributes to the final existence prediction. In this light, a good representation model needs to consider the representation of all the individual entities involved in a hyperlink when making a determination. However, the current treatment of embedding tends to apply a simple sum or average strategy. This might be insensitive to the number of entities in a hyperlink since an entity with effusive containment could overwhelm other entities’ expressions. Last but not least, as it is sometimes complicated for even a human being to annotate hyperlinks, there is a lack of training data, which can be currently insufficient to train a large number of learnable parameters well.
In order to address these challenges, we propose a simple yet effective model, which is a Two-stage Framework for Directed Hyperlink Prediction, namely, TF-DHP. The model is expected to equally consider the entity’s contribution to the form of hyperlinks and emphasize not only the fixed order between two parts but also the randomness inside each part. It conceives a pipeline of two tailored modules: a Tucker decomposition-based module for hyperlink prediction and a BiLSTM-based module for direction inference.
For predicting the existence of hyperlinks, we exploit
Tucker decomposition to model hyperlinks, which, to the best of our knowledge, has not been applied to hypergraphs except simple graphs [
15]. In particular, instead of applying three-order Tucker decomposition over simple graphs, we employ high-order Tucker decomposition for hypergraphs. It produces a core tensor, which represents the degree of interaction between entities. Then, we devise a scoring function by the mode product of the tensor with each entity representation, which evaluates the existence of hyperlinks. We theoretically show that the score is invariant to the order of mode product with entities, though there is a direction of each hyperlink. In addition, it is noted that the tensors from Tucker decomposition are usually of very high order, which can bring about high computational complexity. To mitigate the issue, we further introduce Tensor Ring (TR) [
16] decomposition to decompose higher-order tensors into mode products of several third-order tensors, which effectively reduces the computational cost.
For inferring directions, we first recall that example in
Figure 1. Once
USA and
The United States are determined as the tail entities, the substances in the head entities are implied, and if there is a change in one of the tail entities, the head entities are going to be different. Thus, it is of importance for the model to pass the information between the two parts both forward and backward. This motivates us to design a model that works bidirectionally. In this connection, BiLSTM [
17] is utilized to serve as the base model. In addition, the position of entities within the head (or tail) part is insignificant, and hence, it is necessary to train the model to attend only to the order of the two parts. For this characteristic, we keep the order of two parts but randomly shuffle entities within each part to enforce the model to be ignorant of entity positions within head (or tail) part, while being attentive to the order between the two parts. In this way, the data scale is increased as a by-product, alleviating the lack of data.
Contribution. In summary, we make the following contributions:
For existence prediction, we propose, among the first, to generalize Tucker decomposition to a high dimension and introduce a tensor ring algorithm to reduce the model complexity. We theoretically prove that the mode product for scoring a hyperlink is invariant of the order of participating entities.
For direction inference, we conceive a BiLSTM-based model that can take information into consideration both forward and backward with respect to a hyperlink. A data shuffling strategy is further incorporated to enforce the model to be ignorant of entity positions within the head (or tail) part while being attentive to the order between the two parts.
The modules constitute a new model, namely, TF-DHP for predicting directed hyperlinks. Through the experiments on several real-world datasets, we confirm the superiority of TF-DHP over state-of-the-art models.
Organization. The rest of the article is structured as follows.
Section 2 introduces related work and
Section 3 provides a detailed account of
TF-DHP.
Section 4 reports the experimental setup and analyses the experimental results.
Section 5 concludes the paper.
3. Method
This section formalizes the task of the directed hypergraph link prediction and presents the proposed method, including the framework and module details. Definitions of notations used in the text are shown in the
Table 1.
3.1. Task Description
A directed hypergraph is an ordered pair
, where
denotes a set of entities and
l is the number of entities.
E comprises a set of directed hyperlinks, formally:
Each element in E can be divided into two components, where h (resp. t) serves as the head (resp. tail), with the direction being from the head to the tail.
The directed hyperlink prediction aims to predict the missing hyperlinks, including the existence and associated direction, based on the relevance of the given entities. Take relation knowledge in
Figure 1 as an instance. Entities in each relation build the
V, and their corresponding relation forms the directed hyperlinks
E. Every sample in the dataset will contain an uncertain number of substances. We have to determine whether they can support a relation knowledge and which component each entity belongs to.
3.2. Framework
TF-DHP consists of a Tucker decomposition-based hypergraph link prediction model and a BiLSTM-based direction prediction model to predict directed hyperlinks among entities sets in a directed hypergraph. It is then optimized by a ranking objective in which scores of existing hyperlinks are ranked higher than those of non-existing entity subsets and scores of positive directions are higher than those of negative directions. The framework is shown in
Figure 2.
We generalize TuckER [
15] to the high dimension and regard it as a scoring function. We use the scoring function after obtaining the embedding vectors of every entity in an entity set to evaluate whether the hyperlink exists or not. If the hyperlink does exist, we divide the entities set into two groups based on the direction label of each entity and then use the BiLSTM model [
17] to evaluate the direction between the groups which can be defined as the direction of the hyperlink. Meanwhile, we also randomly sort the entities in each group to increase training data according to the characteristic that the order of entities in each group does not influence the direction.
3.3. Tucker Decomposition-Based Hyperlink Prediction Module
To predict hyperlinks of the entity set, we propose a Tucker decomposition-based scoring function and provide mathematical proof of its irrelevance with the order of inputs.
3.3.1. Tucker Decomposition-Based Scoring Function
Tucker decomposition is a tensor decomposition algorithm that decomposes higher-order tensors into a core tensor and several factor matrices. The core tensor reflects the degree of interaction between different factor matrices. The formal expression is as follows:
where
denotes the original tensor,
denotes the core tensor and
are much smaller than
,
k denotes the order of
,
denotes the set of factor matrices, and the mathematical symbol
denotes the tensor product along with the
kth mode. The dimensions of the core tensor are smaller than those of the original tensor in each order, so the core tensor can be regarded as the dimensionality reduction in the original tensor.
Based on the Tucker decomposition of the representation tensor, we design the scoring function to score each hyperlink. Specifically, if a hyperlink contains
m entities, we first select the corresponding entity and relation embeddings. Then, a parameter tensor is designed as the core tensor containing learnable parameters shared by entities and relations [
15]. Our goal is to optimize these parameters to fully exploit the relevance among entities and the associated relations based on their embeddings. The scoring function can be expressed as below:
where
m changes with the number of entities contained in the hyperlink, and the order of the tensor
is equal to one plus the number of entities.
r denotes the relation embedding of the hyperlink to be predicted, and
are the embeddings of entities contained by the hyperlink. Since the tensor product of a tensor with a vector will change the dimension of its corresponding order to 1, we can repeat the process
times to acquire a real number. This real number is further regarded as the score of this hyperlink.
As every entity in the hyperlink and the relation embedding are computed simultaneously, Equation (
3) reduces information loss. Nevertheless, the computational complexity becomes enormous with the increase in the number of entities because of the inner computation of the high-order tensor product. To address the issue, we use the TR [
16] decomposition algorithm. It represents a high-order tensor by a sequence of third-order tensors multiplied circularly, mathematically:
where
T denotes the original tensor of size
,
denotes a set of third-order tensors whose dimensions are
,
denotes
-th layer matrix in the second-order of the tensor, and
denotes the trace of the product of matrices. The tensor ring decomposition makes the third dimension of the last decomposed tensor the same as the first dimension of the first decomposed tensor. The advantage is that when we make a circular shifting of the decomposed tensor, the results will not be changed because of the matrix trace operation. Tensor ring decomposition dramatically reduces the computational load of the model when the tensor order is large by decomposing higher-order tensors into products of third-order tensors.
The computational complexity grows sharply when the order of the core tensor grows, so we use the TR decomposition on the core tensor to decompose the high-order tensor into several three-order tensors multiplied circularly. Based on the definition of TR decomposition, every single parameter in the core tensor can be computed by the trace of the matrices product. It can be expressed in the tensor form [
16], given by:
where
denotes the vector corresponding to the index in the tensor and the symbol ∘ denotes the outer product of vectors,
correspond to the dimension of the first and 3rd order of the tensor. We use the simplified form
to represent the decomposition of the core tensor. Combining with Equation (
3), we can rewrite the scoring function as:
This scoring function not only considers all the entities and relation information contained in a hyperlink but also controls the model complexity within an acceptable range. As shown in
Table 2, the scoring function above has fewer parameters than
NaLP and is not easy to overfit in the datasets which are not large enough, concretely shown in
Figure 3.
This model is based on the scoring function of Tucker decomposition, and because the model needs to determine the order of the core tensor, the model cannot process the hyperlinks with different number of nodes in one time. For datasets with such hyperlinks, we need to classify them before predicting, which increases the workload to a certain extent.
As the order of the core tensors increases, the number of third-order tensors required by TR decomposition increases accordingly, which will increase the amount of computation to a certain extent. The machine used in this paper can deal with the prediction task of hyperlinks with up to six nodes.
3.3.2. Proof of Sequence Independence
As illustrated above, the Tucker decomposition processes the inputs sequentially, while the order of entities contained in one hyperlink does not influence the determination, which requires the invariance property of our scoring function. We prove that the order of entities’ and relations’ embeddings in the tensor product makes no difference to the result. We first rewrite the scoring function in the tensor-wise form:
In the mentioned TR decomposition, the matrix trace operation and the same dimensions of the input and output ensure the invariance of circular shifting. When it comes to the hypergraph, the dimensions of entities and relations are set to a fixed value, which makes the invariance not only in circular shifting but also in order changing between every single entity. It means the change in the order of the product does not change the result. So, we just need to prove that the order of the tensor product in the Tucker decomposition has no effect on the result. The element-wise form of the tensor product is as follows:
On the right-hand side of the equation, if we regard the indices
as a set of integer-independent variables and their variation range is from 1 to
,
can be regarded as the functions of these independent variables, the meaning of the function value is the value of the element at the corresponding position in the entity embedding vector indexed by the independent variable. We use
(in Equation (
9)) to represent the functions. The expression
can be regarded as a multivariate function whose form is
, and the value of the function means the parameter on the corresponding position of the core tensor.
Then, we find that if we make the independent variables take the value of all real numbers from 1 to
instead of being integers, we can transform Equation (
8) into a multiple definite integral:
The integral domain D of this multiple integrals is an n-order tensor that has the same size as the core tensor. Changing the order of independent variables in does not change the corresponding parameter; thus, the order of has no influence of the function .
Since the functions
are all unary function, the integral can be rewritten as:
For the multiple definite integrals
, the limit of integration for each order are finite constants, and the order of
makes no difference to the function, so changing the order of integration does not change the value of the definite integral. Therefore, the whole integral has the invariance property. Because Equation (
8) is a special case of Equation (
9), the scoring function is proven to have the invariance property.
3.4. BiLSTM-Based Direction Prediction Module
In the directed hyperlink prediction problem, the embedding of each entity further determines the existence of a hyperlink and its direction. However, different from the existence prediction, the direction of a hyperlink emphasizes the order of entities. For example, in the related knowledge “”, the direction comes from WDC and Washington D.C (also known as head entities) to USA and The United States (also known as tail entities). Once a substance is placed in the wrong component, the reaction might not even exist. In addition, the interaction between two components, e.g., conservation of materials, indicates that the model cannot individually determine the components. Therefore, we apply BiLSTM in our module to encode all entities sequentially to achieve the information passing both forward and backward.
As shown in the
Figure 4 The BiLSTM consists of several LSTM hidden layers. These hidden layers are divided into two groups that meet end-to-end in opposite directions. The entities’ embeddings in the hyperlink are calculated in the hidden layer of the corresponding position one by one. Meanwhile, the state of the previous hidden layer is calculated in the next hidden layer together with the embedding of the entities fed into the corresponding layer. After all hidden layers have been calculated, embedding containing all sequential information is generated.The same process occurs in the backward hidden layer group, which means we can obtain two embeddings of the hyperlink. We concatenate them into one vector and then send it to a Softmax layer to obtain the direction score. The specific expression of the process is as follows:
where
denotes the concatenated embedding of the sequential representation,
and
are calculated by two hidden layers in opposite directions,
denotes the embedding for the
tth entity, and the symbol ⊕ means the concatenating operation.
As the inner order of entities in one component does not change the elements, it also has no effect on the direction, e.g., “, , ” and “, , ” are the same relation knowledge. However, they might be regarded as two different instances when fed into BiLSTM concentrating only on the specific sequence. In other words, if “, , ” is annotated as the positive instance, BiLSTM cannot naturally and directly determine the correctness of “, , ” without other guidance. Therefore, we enlighten BiLSTM to focus on the order of two components and ignore the order of entities in the same component through a data shuffling strategy. Specifically, we maintain the order of two components and randomly shuffle the entities in the same component. The number of generated instances relies on how many entities every component owns. For “, , ”, there will be different sequences. We then give all generated instances a correct label to enforce BiLSTM to exploit features of the direction. The strategy can enlarge the data scale without introducing external manual efforts, which also contributes to tackling the low-data regime problem.
3.5. Training
TF-DHP is a pipeline model, which means that we predict the hyperlink’s existence in the first stage and judge the direction of the hyperlink in the second stage. If we use the data of undirected hypergraphs to train the first stage of the model separately, we can obtain a model that can perform link prediction of undirected hypergraphs. If the whole model is trained on the data of the directed hypergraph, the trained model can have the ability to predict directed hyperlinks.
The TF-DHP is trained in two stages, which keeps the same pace with the framework. The training goal of the first stage is to provide the existing hyperlink with a higher score while decreasing the score of entities that cannot comprise a hyperlink.With the initial embeddings of entities and their labels as input, we use the Tucker decomposition-based scoring function to obtain two kinds of the score, and a binary cross-entropy loss function is designed to maximize their gap.
After the first stage of the model is trained, we acquire the updated core tensor and embeddings and use these embeddings to initialize the second stage of the model. Two kinds of scores are calculated in the BiLSTM. One is the score of the correct direction, and the other is the score of the wrong direction. The specific expression of the loss function is as follows:
where
denotes an average function,
denotes the sigmoid function,
denotes the score of each negative hyperlink, and
denotes the score of each positive hyperlink. Finally, the BiLSTM-based model updates the model parameters and embeddings of entities and relations based on the loss gradients.