Next Article in Journal
Evaluation and Optimization of a Command and Control System Based on Complex Networks Theory
Next Article in Special Issue
Spatio-Temporal Heterogeneous Graph Neural Networks for Estimating Time of Travel
Previous Article in Journal
Material-Aware Path Aggregation Network and Shape Decoupled SIoU for X-ray Contraband Detection
Previous Article in Special Issue
Image Steganalysis of Low Embedding Rate Based on the Attention Mechanism and Transfer Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Reachability Ratio Computation for 2-Hop Labeling Scheme

1
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
2
School of Computer Science and Technology, Donghua University, Shanghai 201620, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(5), 1178; https://doi.org/10.3390/electronics12051178
Submission received: 11 February 2023 / Revised: 27 February 2023 / Accepted: 27 February 2023 / Published: 28 February 2023
(This article belongs to the Special Issue Intelligent Analysis and Security Calculation of Multisource Data)

Abstract

:
Reachability queries processing has been extensively studied during the past decades. Many approaches have followed the line of designing 2-hop labels to ensure acceleration. Considering its index size cannot be bounded, researchers have proposed to use a part of nodes to construct partial 2-hop labels (p2HLs) to cover as much reachability information as possible. We achieved better query performance using p2HLs with a limited index size and index construction time. However, the adoption of p2HLs was based on intuition, and the number of nodes used to generate p2HLs was fixed in advance blindly, without knowing its applicability. In this paper, we focused on the problem of efficiently computing a reachability ratio (RR) in order to obtain RR-aware p2HLs. Here, RR denoted the ratio of the number of reachable queries that could be answered by p2HLs over the total number of reachable queries involved in a given graph. Based on the RR, users could determine whether p2HLs should be used to answer the reachability queries for a given graph and how many nodes should be chosen to generate p2HLs. We discussed the difficulties of RR computation and propose an incremental-partition algorithm for RR computation. Our rich experimental results showed that our algorithm could efficiently obtain the RR and the overall effects on query performance by different p2HLs. Based on the experimental results, we provide our findings on the use p2HLs for a given graph for processing reachability queries.

1. Introduction

Reachability queries processing is a fundamental graph operation that has been extensively studied in the literature. When given a directed graph, a reachability query u ? v inquires whether there exists a directed path from node u to v. It could be used for the Semantic Web, online social networks, biological networks, ontology, transportation networks, etc., to answer whether two nodes have a certain connection. It could also be used as a building block for answering structured queries, such as XQuery (https://www.w3.org/TR/2017/REC-xquery-31-20170321, accessed on 1 January 2020) or SPARQL (https://www.w3.org/TR/rdf-sparql-query, accessed on 1 January 2020). For applications where answering reachability queries is intensively involved, any substantial progress in query time could significantly affect the performance of all applications using it. Over the past decades, researchers have proposed many efficient labeling schemes to facilitate processing reachability queries. According to [1,2], the existing approaches could be classified into two categories: Label-Only and Label+G. Label-Only means that the index conveys the complete reachability information, and the given query u ? v could be answered by comparing the labels of u and v, without graph traversal. Label+G means that the index covers partial reachability information, and we may need to perform graph traversal to answer a query.
For Label-Only, the state-of-the-art approaches in [3,4,5] were based on 2-hop labeling schemes [6], and they generated 2-hop labels based on all the nodes to maintain the whole transitive closure (TC). The problem of Label-Only approaches has been that the index size cannot be bounded with respect to the size of the input graph and minimizing the size of the 2-hop labels as NP-hard [6], which makes the index construction a difficult task in practice for dense graphs. For example, it was shown in [2] that when the number of nodes was 10 million and the average degree of the input graph was greater than 6, the state-of-the-art Label-Only approaches, such as TF  [5], DL  [3], and TOL  [4], could not construct the index successfully due to exceeding the memory limit. As compared to Label-Only, the main advantage of Label+G approaches has been that the index could be easily constructed, which meant that Label+G approaches were indispensable when Label-Only approaches were not appropriate for the given graph. Due to that, the index could not cover all the reachability information, and the Label+G approaches usually used two types of labels for pruning in order to terminate the graph traversal in advance. The first was No-Label, which was used to determine whether a query was an unreachable query. The second was Yes-Label, which was used to determine whether a query was a reachable query, including the tree interval [7,8,9] and p2HLs, which were constructed based on a portion of, rather than all of, the nodes. In this paper, we referred to nodes that were used to construct p2HLs as hop-nodes. It was shown in [2] that the state-of-the-art No-Label could correctly answer more than 95% of unreachable queries. However, this was not enough. Although a workload with completely random queries would be heavily skewed towards unreachable queries [9,10], in real scenarios, it would be highly unlikely that most queries would be unreachable, as the node pair in a query would tend to have a certain connection [11]. For Label+G approaches, therefore, processing reachable queries has been regarded as a worst-case scenario due to the need of graph traversal [7], and the query performance has been dominated by the pruning power of Yes-Label when the number of reachable queries increased.
The pruning power of Yes-Label depends on reachability ratio (RR), i.e., the ratio of the number of reachable queries that can be answered by Yes-Label based on the size of the TC. The higher the RR, the larger the probability that the given reachable query can be answered using Yes-Label without graph traversal. Our statistics showed that even using five randomly generated tree intervals, as performed in [10], the RR was less than 10% for most graphs, resulting in poor query performance when the number of reachable queries increased [7]. As a comparison, it was shown in [7] that p2HLs could improve the query performance significantly for many graphs. For example, Figure 1 shows the RR of p2HLs on three graphs, from which we found that even if we constructed p2HLs using 1 hop-node, the RR was greater than 95% on web-uk, meaning the probability that a given reachable query q could be answered by p2HLs was greater than 95% without graph traversal. In practice, however, the RR of p2HLs changed dynamically for different graphs, and the query performance was not as efficient as expected [7]. In shown in Figure 1, the RR was close to 0 on patent, meaning the probability that q could be answered by p2HLs was close to 0. In this case, using p2HLs resulted in only additional costs.
Considering that Label+G approaches are indispensable and its query performance is mainly affected by the pruning power of the most important Yes-Label, i.e., p2HLs, the key problem that needed to be solved was the following: whether we should use p2HLs for a given graph. The answer depended on the RR. For example, given the RR in Figure 1, we could decide to use p2HLs on web-uk but not on patent, as by using the same number of hop-nodes, the RR would be greater than 95% on web-uk and close to 0 on patent. Furthermore, once RR were determined, we could further determine how many hop-nodes should be chosen to construct the p2HLs. For example, according to Figure 1, we could decide to use four hop-nodes for a human, but for web-uk, one hop-node would be sufficient due to its larger RR and smaller index size.
To the best of our knowledge, this was the first work to address RR-aware p2HLs. Computing RR for given p2HLs was not a simple task and, thus, involved two operations. One was computing the size of TC, and the other was computing the exact number of reachable queries that could be answered by p2HLs, which we referred to as the coverage size. Considering that the TC size computation could be efficiently solved by existing methods in [12], the difficulty of RR computation was related to efficiently computing the coverage size. The naive way was by first generating p2HLs based on the selected k-hop-nodes and then obtaining the coverage size by reviewing all the node pairs using p2HLs. In this way, the cost of coverage size computation was O ( k | V | 2 ) and could not be scaled for larger graphs, where V was the set of nodes in the input graph. Furthermore, if the RR was too small to fulfill the requirement, we may need to increase k-value and repeat the above operation, which would then make the RR computation more difficult to solve.
We proposed the computation of the coverage size incrementally, so that when the value of k changed, we could avoid the costly coverage size re-computation to support a more efficient RR computation. The basic concept was, given the coverage size with respect to the k nodes, when we computed the coverage size with respect to k + 1 hop-nodes, we would not compute it from scratch but only compute the increased coverage size. However, the increased coverage size could not be easily computed. To obtain the increased coverage size with respect to the ( k + 1 ) th hop-node u, we had to firstly traverse from u to obtain a set of nodes D u that u could reach, then traverse from u in reverse to obtain the second set of nodes A u that could reach u. Given A u and D u , we needed to determine for each pair of nodes ( a , d ) , whether a could reach d and whether that could be determined by the current p2HLs without u, where a A u , d D u . If this was possible, we could affirm that a could reach d and whether that could be determined by p2HLs without u and, therefore, should not be considered when computing the increased coverage size with respect to u. The cost of processing one hop-node u was as high as O ( k | A u | | D u | ) . Obviously, with an increase in the number of hop-nodes for p2HLs construction, the cost could become unaffordable. For this problem, we proposed dividing both A u and D u into a set of disjointed subsets based on an equivalence relationship (as defined later), so that for each pair of subsets A 1 A u and D 1 D u , we only needed to test one reachability query, rather than | A 1 | × | D 1 | queries. The cost of the RR computation was, therefore, reduced significantly even when processing large graphs. We made the following contributions.
  • To the best of our knowledge, this was the first work to focus on RR-aware p2HLs.
  • We proposed a set of algorithms for RR computation. We showed that according to the properties of 2-hop labels, the two sets of nodes that could reach, and be reached by, a certain hop-node could be divided into a set of disjointed subsets, so that the computation cost could be reduced significantly. We proved the correctness and efficiency of our approach.
  • We conducted rich experiments on real datasets. The experimental results showed that when compared with the baseline approach, our algorithm operated much more efficiently on RR computations. We also showed the overall query performance was affected by p2HLs with different numbers of hop-nodes, based on which we provided our findings as to whether and how to use p2HLs for a given graph for processing reachability queries.
The remainder of the paper is organized as follows. We discuss the preliminaries and the related work in Section 2. In Section 3, we provide the baseline algorithm for RR computations and propose the first incremental algorithm in Section 4. After that, we propose the optimized incremental algorithm in Section 5. We report our experimental results in Section 6 and conclude our paper in Section 7.

2. Background and Related Work

2.1. Preliminaries

Given a directed graph G , we constructed a directed acyclic graph (DAG) G from G in linear time [13] by coalescing each strongly connected component (SCC) of G into a node in G. Then, the reachability query on G could be answered equivalently on G. We followed the convention and assumed that the input graph was a DAG  G = ( V , E ) , where V is the set of nodes and E the set of edges. We defined i n ( u ) = { v | ( v , u ) E } as the set of in-neighbors of u in G and o u t ( u ) = { v | ( u , v ) E } , the set of out-neighbors of u. We used u v to denote that node u could reach node v in G. The transitive closure (TC) of node u was denoted as T C ( u ) , which was the set of all nodes that u could reach. The TC size of G was | T C ( G ) | = u V | T C ( u ) | . We used T C 1 ( u ) to denote the set of all nodes that could reach u.
Given a set of k nodes S k V , we used L k to denote the 2-hop labels constructed based on the nodes of S k , where each node in S k was called a hop-node. If v T C ( u ) and L k could correctly affirm that u v , we assumed L k (or S k ) could cover the reachable query u v . Let N k be the number of distinct reachable queries that are covered by L k , the RR α with respect to S k is defined as Equation (1).
α = N k / | T C ( G ) |
Problem Statement: 
Given a DAG  G = ( V , E ) , its TC size | T C ( G ) | , and a hop-node set S k V , return the RR of S k .

2.2. Related Work

As no existing works addressed the RR computation, we only discussed existing works on processing reachability queries and the TC size computation.
Label-Only: 
The Label-Only algorithms [3,4,5,6,14] have attempted to compress TC in order to obtain a smaller index size and facilitate answering queries. Recent research includes TF  [5], DL [3], PLL [14], and TOL [4]. The main concept was based on the 2-hop labeling [6] that could answer reachability queries, where each node u was assigned two labels: one of which was an in-label L i n ( u ) , and the other was an out-label L o u t ( u ) . The statement L i n ( u ) ( L o u t ( u ) ) consists of a set of nodes v that could reach (and be reached by) u. Based on this labeling scheme, u ? v could be answered by computing the result of L o u t ( u ) L i n ( v ) . If L o u t ( u ) L i n ( v ) , then u v ; otherwise, u ¬ v .
Considering that minimizing the 2-hop label size would be NP-hard [6], Cohen et al. proposed a ( log | V | ) -approximate solution. However, the index construction cost was O ( | V | | E | log ( | V | 2 / | E | ) ) , which made it difficult to scale for large graphs. Motivated by this, the following works, [3,4,5,14], had to discard the approximation guarantee and focus on finding better ordering strategies to rank nodes in order to improve the efficiency of their 2-hop label construction. In [5], TF folded the given DAG recursively based on the topological level (topo-level). Assuming that all nodes were sorted by certain ranking values, both DL  [3] and PLL  [14] adopted the same concept to compute 2-hop labels that enumerated a node u in each iteration with a forward (backward) breadth-first search (BFS) to add u to an in (or out) label of nodes that u could reach (and could be reached). The TOL [4] framework summarized TF, DL, and PLL and then computed 2-hop labels based on a newly proposed total order.
Recently, researchers proposed computing 2-hop labels in parallel [15,16] to accelerate the construction of 2-hop labels. However, the index size still could not be bounded with respect to the size of the input graph.
Label+G: 
The Label+G methods answered reachability queries using both Yes-Label and No-Label covering partial reachability information. The main advantage, when compared with Label-Only, was that the index size could be bounded. Recent Label+G approaches include those proposed in GRAIL [9,10], FERRARI-G [7], FELINE [8], IP +  [1] and BFL +  [2]. For these approaches, No-Label was used to test whether the given query was an unreachable one, including topo-order [7,8], topo-level [7,8,9], graph interval [7,9,10], i.e., intervals covering all reachable nodes, IP label [1], and Bloom filter label [2]. The Yes-Label was used to test whether the given query was a reachable query, including the tree interval [7,8,9] and p2HLs [7,17].
It was shown by the existing research that No-Label could prune most unreachable queries [2], though the performance of Yes-Label could fluctuate unpredictably [7,17]. Our statistics showed that the RR of the tree interval was usually less than 10%. As a comparison, the RR of p2HLs could be much higher even with very few hop-nodes, as shown by Figure 1. Therefore, adopting p2HLs could improve the query performance of the Label+G approaches significantly for some graphs [7,17]. However, for some other graphs, p2HLs may not work as efficiently as expected [7,17]. Therefore, when considering p2HLs for processing reachability queries, it was necessary that we could quickly obtain the RR to correctly determine whether we should use p2HLs or not, and furthermore, we could then decide the number hop-nodes that should be chosen to construct p2HLs.
Other Reachability Approaches: 
Ref. [18] proposed an index-free approach to process reachability queries on dynamic graphs with an approximate answer. Refs. [19,20] discussed processing label-constrained reachability queries on large graphs. Ref. [21] adopted the concept of 2-hop coverage and proposed an index-based approach to answer span-reachability queries in large temporal graphs.
TC-Size Computation: 
Given a DAG  G = ( V , E ) , a path p = v 0 , v 1 , . . . , v s satisfies i [ 0 , s 1 ] , ( v i , v i + 1 ) E , where s denotes the length of p. The path decomposition of G is a partition S p = { p 1 , p 2 , . . . , p k } , satisfying V = i [ 1 , | S p | ] p i and i , j [ 1 , | S p | ] , i j , p i p j = . PTR  [22] proposed a linear greedy algorithm to obtain the path partition, based on which [12] proposed an efficient algorithm buTC + for a TC size computation with time complexity O ( r | E | ) , where r = | S p | . This approach processed nodes of each path in a bottom-up method to achieve high efficiency. It was inspired by an observation that for two nodes, u and v, in a path, if ( u , v ) E , then T C ( v ) T C ( u ) . Therefore, by processing the nodes of a path in a bottom-up approach with the nodes marked carefully, the nodes in T C ( v ) did not need to be revisited when computing | T C ( u ) | .
In addition, [4,23,24] proposed estimating the approximate TC size for all nodes in linear time O ( | V | + | E | ) .
Note that the TC size computation was different than the TC computation. The former addressed the size of T C ( v ) , i.e., | T C ( v ) | , while the latter returned all nodes in each T C ( v ) .

3. The Baseline Algorithm

In this section, we first analyze the construction of 2-hop labels and then discuss the RR computation and the baseline algorithm.
Step-1: 2-hop Label Construction. 
To construct 2-hop labels, we followed [3,14] and sorted all nodes by ( | i n ( u ) | + 1 ) × ( | i n ( u ) | + 1 ) . The sorting result was v 1 , v 2 , . . . , v | V | , where the first node has the largest rank value. We selected the first k nodes to obtain the hop-node set S k = { v 1 , v 2 , . . . , v k } . We had the following result: (Equations (2) and (3)).
= S 0 S 1 S 2 S | V | = V
S i \ S i 1 = { v i } , where 0 < i | V |
Given S i , the p2HL L i could be generated by processing v i , based on L i 1 , according to Equations (2) and (3). Specifically, we performed forward (backward) BFS from v i to obtain a set D i ( A i ) of nodes that v i could reach (could be reached by v i ),as denoted by Figure 2a. For each node a A i , we added v i to a’s out-label, i.e., L o u t i ( a ) = L o u t i 1 ( a ) { v i } , denoting that a could reach v i . For each node d D i , we added v i to d’s in-label, i.e., L i n i ( d ) = L i n i 1 ( d ) { v i } , denoting that v i could reach d. After processing v i , we obtained 2-hop labels L i . The superscript i in L o u t i ( a ) ( L i n i ( a ) ) denotes that both 2-hop labels L o u t i ( a ) and L i n i ( a ) , with respect to node a, are subsets of S i . Here, we could use L i 1 to reduce the size of both A i and D i . For example, in Figure 2c, when processing v i , the backward BFS traversal from v i could be terminated at v i 1 , because v i 1 reaching v i could be determined by L i 1 , and a T C 1 ( v i 1 ) reaching v i through v i 1 could also be answered by L i 1 .Therefore, in practice, A i T C ( v i ) D i T C 1 ( v i ) .
Example 1.
Consider G in Figure 3. Assume that the sorting result is v 1 , v 2 , v 3 , . . . , v 15 . To obtain L 2 , we first performed bidirectional BFS of v 1 to obtain A 1 = { v 1 , v 4 , v 6 , v 11 } and D 1 = { v 1 , v 2 , v 7 , v 9 , v 10 , v 13 , v 15 } . After that, we added 1 to the out-label of nodes in A 1 and the in-label of nodes in D 1 , in order to obtain L 1 . The next processed node was v 2 . Similarly, we obtained A 2 = { v 2 , v 3 , v 5 , v 12 } and D 2 = { v 2 , v 10 , v 13 , v 15 } . Note that all nodes in A 1 could reach v 2 , but some of them were not in A 2 , because for nodes that were in A 1 but not in A 2 , whether they could reach v 2 could be answered by L 1 . Then, we added 2 to the out- and in-labels of nodes in A 2 and D 2 , as shown by Table 1.
Step-2: RR Computation. 
Given L k with respect to S k , the baseline approach computed the RR, as follows. First, it obtained the set of nodes that could reach either one of the set of hop-nodes, as shown by Equation (4). Second, it obtained the set of nodes that could be reached by either one of the set of hop-nodes, as shown by Equation (5). It computed the number of reachable queries that could be answered by L k , as shown by Equation (6). Finally, we returned the RR of S k based on Equation (1).
A = i [ 1 , k ] A i
D = i [ 1 , k ] D i
N k = | { ( a , d ) | a A , d D , a d L o u t k ( a ) L i n k ( d ) } |
Example 2.
Continued example of 1. To compute the RR of S 2 = { v 1 , v 2 } , we first computed A = A 1 A 2 = { v 1 , v 2 , v 3 , v 4 , v 5 , v 6 , v 11 , v 12 } , D = D 1 D 2 = { v 1 , v 2 , v 7 , v 9 , v 10 , v 13 , v 15 } , according to Equations (4) and (5), respectively. Then, we review each pair of nodes a A and d D ( a d ) as to whether a reaching d could be answered by L 2 . Furthermore, we computed the number of answered reachable queries according to Equation (6), which was 42 for G in Figure 3 and L 2 in Table 1. Given T C ( G ) = 70 , we knew that the RR of S 2 was 42 / 70 = 60 % .
The Algorithm: 
The baseline algorithm (Algorithm 1) computed RR in two steps. Step-1 (lines 1–18) first calls the buTC + algorithm [12] to compute TC size in line 1, and then constructs 2-hop labels L k and obtained the two sets of nodes, A and D . Specifically, it first sorts all nodes in certain order in line 3 and then selects k hop-nodes in line 4. In lines 5–16, it performs forward and backward BFS for each hop-node v i to construct 2-hop labels L i . During the process, only if the reachability relationship between v i and the visited node v cannot be answered by 2-hop labels L i 1 , then it adds v i to v’s in-label (line 9) or out-label (line 14), and adds v to D i (line 10) or A i (line 15); otherwise, it terminates the process because the reachability relationship has already been covered by L i 1 . In lines 17–18, it obtains the two sets A and D according to Equations (4) and (5). Step-2 (lines 19–21) computes the number of covered reachable queries by L k according to Equation (6). Finally, it returns the RR in line 22.   
Algorithm 1: blRR ( G , k )
    1 compute | T C ( G ) | by the buTC + algorithm [12]
2 N k 0 , S k , A , D
3 rank all nodes v in G by ( | i n ( v ) | + 1 ) × ( | i n ( v ) | + 1 )
4 put the first k nodes into S k as hop-nodes
5 foreach  ( v i S k )  do
6     A i ; D i
7    perform forward BFS from v i , and for each visited v
8       if  ( L o u t i 1 ( v i ) L i n i 1 ( v ) = ) then                  /* L i 1 */
9           L i n i ( v ) L i n i 1 ( v ) { v i }                   /*compute L i */
10           D i D i { v }
11       else stop expansion from v
12    perform backward BFS from v i , for each visited v
13        if  ( L o u t i 1 ( v ) L i n i 1 ( v i ) = ) then                  /* L i 1 */
14           L o u t i ( v ) L o u t i 1 ( v ) { v i }                 /*compute L i */
15           A i A i { v }
16        else stop expansion from v
17 A = i [ 1 , k ] A i
18 D = i [ 1 , k ] D i
19 foreach  ( a A , d D , a d )  do
20    if  ( L o u t k ( a ) L i n k ( d ) ) then                    /* L k */
21        N k N k + 1
22 return  α N k / | T C ( G ) | as RR of S k
Analysis: 
The time cost of line 1 was O ( r | E | ) (details are in from Section 2.2) and was O ( | V | ) for line 3 by counting sort. The time cost of BFS from each hop-node v i was O ( | V | + | E | ) (lines 6–16). During the two BFS traversals, the cost of processing each visited node v was O ( k ) (lines 8 and 13). Therefore, the time cost of processing k hop-nodes was O ( k 2 ( | V | + | E | ) ) . As a result, the time cost of Step-1 was O ( k 2 ( | V | + | E | ) + r | E | ) . For Step-2, the time cost of computing N k was O ( k | A | | D | ) . Therefore, the time complexity of Algorithm 1 was O ( k 2 ( | V | + | E | ) + r | E | + k | A | | D | ) .
During the processing, we did not need to actually maintain every A i and D i ; instead, we only needed to maintain A and D . Furthermore, we needed to maintain the 2-hop labels with respect to k hop-nodes, so the space cost was O ( k | V | ) . As S k , A , and D were bounded by V, the space complexity of Algorithm 1 was O ( k | V | ) .
In practice, if the RR was too low to meet the requirement, then we needed to use additional hop-nodes. Therefore, Algorithm 1 would be called once more to compute the new RR, for which all reachability relationships tested for S k would be tested again for the new hop-node set.
Example 3.
Continue Example 2. Given A = { v 1 , v 2 , v 3 , v 4 , v 5 , v 6 , v 11 , v 12 } , and D = { v 1 , v 2 , v 7 , v 9 , v 10 , v 13 , v 15 } , we needed to test 56 queries in lines 19–21, due to | A | = 8 and | D | = 7 . By line 22, we knew the RR was 60%. If the RR was required to be equal or greater than 80%, we would need to enlarge the hop-node set and re-compute the RR from scratch. As a result, the 56 queries tested for S 2 would be tested again for the new hop-node set.

4. The Incremental Approach

Considering that Algorithm 1 would be called again when the hop-node set was enlarged, a natural question was: Could we compute the RR incrementally? That is, given the RR with respect to S i 1 , when we decided to compute it with respect to S i = S i 1 { v i } , we did not start from scratch; instead, we only computed the number of reachable queries that could not be covered by L i 1 but could be covered by L i .
However, the increased RR could not be easily computed. On the one hand, by constructing 2-hop labels using hop-node v i , we captured 3 types of reachability relationships: (1) v i that could reach every node in D i \ { v i } could be determined by 2-hop labels with respect to v i , and the number of covered reachable queries was | D i | 1 ; (2) Every node in A i \ { v i } that could reach v i could be determined by 2-hop labels with respect to v i , and the number of covered reachable queries was | A i | 1 ; and (3) each node in A i \ { v i } that could reach every node in D i \ { v i } could be determined by 2-hop labels with respect to v i , and the number of covered reachable queries was ( | A i | 1 ) × ( | D i | 1 ) . Therefore, the number of covered queries by 2-hop labels with respect to v i was ( | A i | 1 ) × ( | D i | 1 ) + ( | A i | 1 ) + ( | D i | 1 ) = | A i | × | D i | 1 .
On the other hand, 2-hop labels with respect to different hop-nodes could cover the same reachable queries. For example, consider Figure 2b. After processing v i 1 , every node a 1 A i 1 that could reach every node d 1 D i 1 could be covered by 2-hop labels with respect to v i 1 , because v i 1 L o u t i 1 ( a 1 ) L i n i 1 ( d 1 ) . After processing v i , we knew that a 1 could reach d 1 , which could also be covered by 2-hop labels with respect to v i , because v i L o u t i ( a 1 ) L i n i ( d 1 ) . Then, the increased number of reachable queries with respect to v i could be computed as Equations (7) and (8), and the total number of reachable queries N k covered by L k could be computed as Equation (9).
n i = | A i | × | D i | 1 λ
λ = | { ( a , d ) | a A i , d D i , a d , L o u t i 1 ( a ) L i n i 1 ( d ) } |
N k = i [ 1 , k ] n i
Therefore, the intuitive approach was to first obtain the two sets of nodes, A i and D i , and then testing for each pair of nodes a A i and d D i , whether a could reach d could be answered by L i 1 . If the answer was true, then it indicated that a d had already been covered by S i 1 ; otherwise, it was a new covered query and needed to be considered, as shown by Algorithm 2.
In Algorithm 2, we computed the RR for each S i when v i ( i [ 1 , k ] ) had been added into S i 1 . In lines 15–18, we computed the number of queries that could be covered by L i 1 . We obtained the increased number of queries in line 19 according to Equation (7). Then, we obtained the RR of S i in line 21. We computed L i based on L i 1 in lines 22–25. Finally, we returned the RR of S k in line 26.
Analysis: 
Algorithm 2 computed the two sets A i and D i in lines 2–14, and then it computed L i in lines 22–25. The overall cost of Step-1 was same as that of Algorithm 1, i.e., O ( k 2 ( | V | + | E | ) + r | E | ) . The cost of Step-2 for each hop-node was O ( i | A i | | D i | ) . For k hop-node, the cost was O ( i [ 1 , k ] i | A i | | D i | ) . Therefore, the time complexity of Algorithm 2 was O ( k 2 ( | V | + | E | ) + r | E | + i [ 1 , k ] i | A i | | D i | ) . Similar to Algorithm 1, the space complexity of Algorithm 2 was O ( k | V | ) .
Example 4.
Consider G in Figure 3. Assume that we want to construct p2HLs L 3 . The first processed node is v 1 . As A 1 = { v 1 , v 4 , v 6 , v 11 } , D 1 = { v 1 , v 2 , v 7 , v 9 , v 10 , v 13 , v 15 } , and then we know N 1 = n 1 = | A 1 | × | D 1 | 1 = 27 . The second processed node is v 2 . The p2HLs are shown in Table 1 and A 2 = { v 2 , v 3 , v 5 , v 12 } , D 2 = { v 2 , v 10 , v 13 , v 14 } . Then, in lines 16–18, we need to test | A 2 | × | D 2 | = 16 queries. As λ = 0 , n 2 = | A 2 | × | D 2 | 1 0 = 15 , and N 2 = n 1 + n 2 = 27 + 15 = 42 . The third processed node is v 3 and A 3 = { v 3 , v 4 , v 5 , v 6 , v 11 } , D 3 = { v 3 , v 7 , v 8 , v 9 , v 14 } . Then, in lines 16–18, we need to test | A 3 | × | D 3 | = 25 queries. As λ = 6 , n 3 = | A 3 | × | D 3 | 1 6 = 18 , and N 3 = N 2 + n 3 = 42 + 18 = 60 . After processing v 3 , we had L 3 , as shown in Table 1, and the RR was 60 / 70 = 85.7 % by testing 16 + 25 = 41 queries in Algorithm 2.
As a comparison, for Algorithm 1, | A 3 | = 8 , | D 3 | = 10 , and we needed to test 80 queries to obtain the RR.
Algorithm 2: incRR ( G , k )
1 compute | T C ( G ) | by the buTC + algorithm [12]
2 N 0 0
3 rank all nodes v in G by ( | i n ( v ) | + 1 ) × ( | i n ( v ) | + 1 )
4 put the first k nodes into S k as hop-nodes
5 foreach  ( v i S k )  do
6     A i , D i
7     perform forward BFS from v i , and for each visited v
8       if  ( L o u t i 1 ( v i ) L i n i 1 ( v ) = ) then                    /* L i 1 */
9           D i D i { v }
10       else stop expansion from v
11    perform backward BFS from v i , for each visited v
12       if  ( L o u t i 1 ( v ) L i n i 1 ( v i ) = ) then                   /* L i 1 */
13           A i A i { v }
14       else stop expansion from v
15     λ 0
16    foreach  ( a A i , d D i , a d )  do
17       if  ( L o u t i 1 ( a ) L i n i 1 ( d ) ) then                   /* L i 1 */
18           λ λ + 1
19     n i | A i | × | D i | 1 λ
20     N i N i 1 + n i
21     α N i / | T C ( G ) |                         /*RR of S i */
22    foreach  ( a A i ) do                       /*compute L i */
23           L o u t i ( a ) L o u t i 1 ( a ) { v i }
24    foreach  ( d D i ) do                       /*compute L i */
25           L i n i ( d ) L i n i 1 ( d ) { v i }
26 return  α as RR of S k

5. The Incremental-Partition Approach

5.1. The Equivalence Relationship

Though Algorithm 2 did not need to compute RR from scratch when the hop-node set became large, it still needed to test | A i | × | D i | queries in line 16 with cost O ( i | A i | | D i | ) . Given a large hop-node set, the cost could still become unaffordable.
Definition 1. 
[Equivalence Relationship]Given a hop-node v i , its ancestor set was A i , and its descendant set was D i . We assumed the two nodes a 1 , a 2 ( a 1 a 2 ) of A i were forward-equivalent to each other, denoted as a 1 F a 2 , if they had the same out-labels, i.e., L o u t i ( a 1 ) = L o u t i ( a 2 ) . Similarly, the two nodes d 1 , d 2 ( d 1 d 2 ) of D i were backward-equivalent to each other, denoted as d 1 B d 2 , if they had the same in-label, i.e., L i n i ( d 1 ) = L i n i ( d 2 ) .
By Definition 1, we could determine that for A i , a partition A ( i ) = { A i 1 , A i 2 , . . . , A i m } , which consists of a set of m disjointed subsets satisfying that (1) l , j [ 1 , m ] , l j , A i l A i j = and l [ 1 , m ] A i l = A i ; and (2) a l , a j belonging to the same subset, a l F a j . For D i , we also had a partition D ( i ) = { D i 1 , D i 2 , . . . , D i n } satisfying that (1) l , j [ 1 , n ] , l j , D i l D i j = and j [ 1 , n ] D i j = D i ; and (2) d l , d j belonging to the same subset, d l B d j . We had the following result.
Theorem 1. 
Given a hop-node v i , its ancestor set A i , and its descendant set D i , the number of tested queries for RR computation was | A ( i ) | × | D ( i ) | , which was bounded by min { | A i | × | D i | , 4 i 1 } .
Proof. 
Let A ( i ) ( D ( i ) ) be the partition of A i ( D i ) based on the equivalence relationship and P A ( i ) ( P D ( i ) ) the partition of V with respect to hop-node set S i and the forward (backward) equivalence relationship. Initially, A ( 1 ) = { A 1 } ( D ( 1 ) = { D 1 } ) , P A ( 1 ) = { A 1 , V \ A 1 } ( P D ( 1 ) = { D 1 , V \ D 1 } ) .
On the one hand, for each subset A i l A ( i ) , the result of testing all the reachability relationships from nodes of A i l to any other node was equivalent to each other; thus, we only needed to randomly select a node and let it be the representative node of A i l to perform the testing of reachability relationship. Similarly, for each subset D i j D ( i ) , we could also randomly select a node and let it be the representative node of D i j to test the reachability relationships from any node to the nodes of D i j . As a result, the number of tested queries from the nodes of A i to the nodes of D i was | A ( i ) | × | D ( i ) | . Since A ( i ) ( D ( i ) ) was the partition of A i ( D i ) , we knew that | A ( i ) | × | D ( i ) | | A i | × | D i | .
On the other hand, given the partition P A ( i 1 ) ( P D ( i 1 ) ) of V, the size of P A ( i ) ( P D ( i ) ) was, at most, twice the size of P A ( i 1 ) ( P D ( i 1 ) ) , and all nodes in each subset of P A ( i 1 ) ( P D ( i 1 ) ) could be further divided into, at most, two disjointed subsets. One consisted of nodes that could reach (and be reached by) v i , and the other contained nodes that could reach (or be reached by) v i . Then, the size of P A ( i ) ( P D ( i ) ) was bounded by 2 i , and the size of A ( i ) ( D ( i ) ) was bounded by 2 i 1 . Therefore, the the number of tested reachability queries was bounded by 2 i 1 × 2 i 1 = 4 i 1 .
Hence, the number of tested queries for RR with respect to hop-node v i was bounded by min { | A i | × | D i | , 4 i 1 } .    □
According to Theorem 1, we could reduce the number of tested reachability queries when processing hop-node v i . As shown by Equation (10), for each pair of subsets ( A i l A ( i ) , D i j D ( i ) ) , we only needed to review the reachability relationship between their representative nodes a A i l and d D i j . To accomplish this, we first determined the partitions of both A i and D i , according to their equivalence relationship.
λ = a A i l A ( i ) d D i j D ( i ) L o u t i 1 ( a ) L i n i 1 ( d ) | A i l | × | D i j |

5.2. Partitions Computation

To obtain the partitions of A i and D i , the intuitive approach was to sort all nodes in A i ( D i ) by comparing their out-labels (in-labels) in lexicographic order. After the sorting operation, all equivalent nodes were clustered together. As the size of each label was bounded by i, the cost of computing the partition A ( i ) ( D ( i ) ) of A i ( D i ) was O ( i × | A i | × log | A i | ) ( O ( i × | D i | × log | D i | ) ) .
Let P A ( i ) be the partition of V with respect to hop-node set S i and the forward equivalence relationship, P D ( i ) the partition of V with respect to S i and backward equivalence relationship. We obtained the following result.
Theorem 2. 
Assume the hop-node v i and its ancestor (descendant) set A i ( D i ) for v 1 , v 2 A i ( D i ) , v 1 F v 2 ( v 1 B v 2 ) , if they belong to the same subset of P A ( i 1 ) ( P D ( i 1 ) ) .
Proof. 
We proved this result from two aspects. First, we proved the correctness when both v 1 and v 2 belonged to A i (Case-1), and then we proved the correctness when both v 1 and v 2 belonged to D i (Case-2).
Case-1 where v 1 , v 2 A i and v i L o u t i ( v 1 ) L o u t i ( v 2 ) .
On the one hand, if v 1 F v 2 , it indicated that L o u t i ( v 1 ) = L o u t i ( v 2 ) according to Definition 1. Hence, L o u t i ( v 1 ) \ { v i } = L o u t i ( v 2 ) \ { v i } , i.e., they belonged to the same subset of P A ( i 1 ) .
On the other hand, if both v 1 and v 2 belonged to the same subset of P A ( i 1 ) , it indicated that before processing hop-node v i , L o u t i 1 ( v 1 ) = L o u t i 1 ( v 2 ) , according to the definition of P A ( i 1 ) . As v 1 , v 2 A i , we knew that after processing v i , v i L o u t i ( v 1 ) L o u t i ( v 2 ) and L o u t i ( v 1 ) = L o u t i ( v 2 ) still held. According to Definition 1, v 1 F v 2 .
Therefore, we determined that v 1 F v 2 , if they belonged to the same subset of P A ( i 1 ) .
Case-2 where v 1 , v 2 D i and v i L i n i ( v 1 ) L i n i ( v 2 ) .
Similar to the proof of Case-1, we knew that v 1 B v 2 , if they belonged to the same subset of P D ( i 1 ) .
Therefore, for v 1 , v 2 A i ( D i ) , v 1 F v 2 ( v 1 B v 2 ) , if they belonged to the same subset of P A ( i 1 ) ( P D ( i 1 ) ) .    □
According to Theorem 2, we assigned each node v two set IDs, denoted as i d A ( v ) and i d D ( v ) , which were used to determine which subset it belonged to in P A ( i ) and P D ( i ) , respectively. Then, given the ancestor (descendant) set A i ( D i ) of v i , we only needed to scan all nodes of A i ( D i ) once to know immediately that for two nodes v 1 and v 2 : If i d A ( v 1 ) = i d A ( v 2 ) ( i d D ( v 1 ) = i d D ( v 2 ) ) in P A ( i 1 ) ( P D ( i 1 ) ) , then v 1 F v 2 ( v 1 B v 2 ) , and definitely belonged to the same subset of P A ( i ) ( P D ( i ) ) . Therefore, P A ( i ) ( P D ( i ) ) was a refinement of P A ( i 1 ) ( P D ( i 1 ) ) , i.e., each element of P A ( i ) ( P D ( i ) ) was a subset of a unique element of P A ( i 1 ) ( P D ( i 1 ) ) .
Recall that when processing the hop-node v i , we first had its ancestor (descendant) set A i ( D i ) and then obtained the partition A ( i ) ( D ( i ) ) of A i ( D i ) , based on the equivalence relationship. Since P A ( i ) ( P D ( i ) ) was the partition of V with respect to the equivalence relationship, we knew that A ( i ) P A ( i ) ( D ( i ) P D ( i ) ) , and the relationship between P A ( i ) ( P D ( i ) ) , P A ( i 1 ) ( P D ( i 1 ) ) and A ( i ) ( D ( i ) ) was shown as Equations (11) and (12).
P A ( i ) = { P \ A i | P P A ( i 1 ) } A ( i )
P D ( i ) = { P \ D i | P P D ( i 1 ) } D ( i )
When processing hop-node v i , we used a hash table H A ( H D ) to achieve linear-time complexity. Each element of H A ( H D ) was a tuple ( i d o , e n ) , denoting a subset A i l ( D i l ) of A ( i ) ( D ( i ) ) , where i d o was, for all nodes of A i l ( D i l ) , their old set ID in P A ( i 1 ) ( P D ( i 1 ) ) , and e n = ( i d n , v 1 , s ) was a triple denoting the new set ID for all nodes of A i l ( D i l ) , the representative node of A i l ( D i l ) , and the size of A i l ( D i l ) , respectively.
Example 5.
Consider G in Figure 3. Before processing v 1 , P A ( 0 ) = P D ( 0 ) = { V } , A ( 0 ) = D ( 0 ) = , and for all nodes v, i d A ( v ) = i d D ( v ) = 0 .
For the first node v 1 , we have A 1 = { v 1 , v 4 , v 6 , v 11 } , D 1 = { v 1 , v 2 , v 7 , v 9 , v 10 , v 13 , v 15 } . Since all nodes in A 1 ( D 1 ) have the same i d A ( v ) ( i d D ( v ) ) , we know that A ( 1 ) = { A 1 } and D ( 1 ) = { D 1 } . P A ( 1 ) = { A 1 , V \ A 1 } , and P D ( 1 ) = { D 1 , V \ D 1 } . In Table 2, the two columns under v 1 denote P A ( 1 ) and P D ( 1 ) , where each 1 in the second (third) column correspond to a node in A 1 ( D 1 ) . Figure 4a shows the two hash tables denoting A ( 1 ) and D ( 1 ) , respectively. For H A , there is one (key, value) pair, denoting that A ( 1 ) contains one subset A 1 , and for all nodes in A 1 , their set ID is 0 in P A ( 0 ) . Therefore, they all belong to the same subset in A ( 1 ) , i.e., A ( 1 ) = { A 1 } . By H A in Figure 4a, we know that all nodes in A 1 now have the new set ID 1, the representative node of A 1 is v 4 , and | A 1 | = 4 .
For the second processed node v 2 , A 2 = { v 2 , v 3 , v 5 , v 12 } , D 2 = { v 2 , v 10 , v 13 , v 14 } . As all nodes in A 2 have the same set ID 0 in P A ( 1 ) , A ( 2 ) = { A 2 } . As shown by Figure 4b, the key is 0, and the triple ( 2 , v 3 , 4 ) denotes that the new set ID for all nodes in A 2 is 2, the representative node is v 3 , and | A 2 | = 4 . Similarly, all nodes in D 2 have the same set ID 1 in P D ( 1 ) , D ( 2 ) = { D 2 } , which is denoted as H D in Figure 4b.
For the third processed node v 3 , A 3 = { v 3 , v 4 , v 5 , v 6 , v 11 } , D 3 = { v 3 , v 7 , v 8 , v 9 , v 14 } . As v 3 and v 5 have the same set ID 2 in P A ( 2 ) , they form the subset in A ( 3 ) . Furthermore, v 4 , v 6 , v 11 have the same set ID 1 in P A ( 2 ) , they form the second subset in A ( 3 ) . Therefore A ( 3 ) = { { v 3 , v 5 } , { v 4 , v 6 , v 11 } } . Similarly, D ( 3 ) = { { v 3 , v 8 , v 14 } , { v 7 , v 9 } } . Both A ( 3 ) and D ( 3 ) are denoted by H A and H D in Figure 4c, respectively.
The Algorithm: 
As shown by Algorithm 3, for each hop-node v i , we first performed forward and backward BFS to obtain D i (lines 6–15) and A i (lines 16–25). At the same time, we generated their partitions D ( i ) and A ( i ) , for which each subset was recorded in H D and H A , respectively. In lines 26–29, we computed λ according to Equation (10), which was the number of reachable queries that were covered by L i 1 . In line 30, we obtained the number of reachable queries that could be covered by L i but not by L i 1 . After that, we had the total number of covered reachable queries in line 31 and the RR in line 33. Finally, we generated L i in lines 34–37 and returned the RR of S k in line 38.
It was worth noting that for Algorithm 3, the improvements was related to not only the coverage size computation, but also the TC size computation. In line 32, we obtained an estimated TC size by the approach in [23], which operated in linear time O ( | V | + | E | ) and was more efficient than the exact approach in [12]. Since the denominator in Equation (1) did not change for α i ( i [ 1 , k ] ) , we could achieve the same effect using an approximate TC size for the RR computation.
Algorithm 3: incRR + ( G = ( V , E ) , k , T C ( G ) )
1 N 0 0 ; n A 0 ; n D 0 ;
2 rank all nodes v in G by ( | i n ( v ) | + 1 ) × ( | i n ( v ) | + 1 )
3 put the first k nodes into S k as hop-nodes
4 foreach  ( v i S k )  do
5     A i ; D i ; H A ; H D
6     perform forward BFS from v i , and for each visited v
7       if  ( L o u t i 1 ( v i ) L i n i 1 ( v ) = ) then                   /* L i 1 */
8          if  ( i d A ( v ) H D )  then
9              n A n A + 1
10              H D [ i d A ( v ) ] ( n A , v , 1 )
11          else
12              H D [ i d A ( v ) ] . s H D [ i d A ( v ) ] . s + 1
13           D i D i { v }
14           i d A ( v ) n A
15       else stop expansion from v
16    perform backward BFS from v i , for each visited v
17       if  ( L o u t i 1 ( v ) L i n i 1 ( v i ) = ) then                  /* L i 1 */
18          if  ( i d D ( v ) H A )  then
19              n D n D + 1
20              H A [ i d D ( v ) ] ( n D , v , 1 )
21          else
22              H A [ i d D ( v ) ] . s H A [ i d D ( v ) ] . s + 1
23           A i A i { v }
24           i d D ( v ) n D
25    else stop expansion from v
26     λ 0
27    foreach  ( ( i d , a , s A ) H A , ( i d , d , s D ) H A )  do
28       if  ( L o u t i 1 ( a ) L i n i 1 ( d ) ) then                  /* L i 1 */
29           λ λ + | s A | × | s D |
30     n i | A i | × | D i | 1 λ
31     N i N i 1 + n i
32    estimate the TC size by Formula 3 in [23]
33     α N i / | T C ( G ) |                         /*RR of S i */
34    foreach  ( a A i ) do                      /*compute L i */
35        L o u t i ( a ) L o u t i 1 ( a ) { v i }
36    foreach  ( d D i ) do                      /*compute L i */
37        L i n i ( d ) L i n i 1 ( d ) { v i }
38 return  α as RR of S k
Example 6.
Consider G in Figure 3. Assume that we want to compute the RR of S 3 = { v 1 , v 2 , v 3 } .
For v 1 , as there is no covered reachability relationship, we do not need to test any queries in lines 28. As | A 1 | = 4 , | D 1 | = 7 , n 1 = | A 1 | × | D 1 | 1 = 27 in line 30 of Algorithm 3.
For v 2 , as both | A ( 2 ) | = | D ( 2 ) | = 1 , we only need to test one reachable query, i.e., whether v 3 ? v 10 can be answered by L 1 . As L o u t 1 ( v 3 ) L i n 1 ( v 10 ) = , we know that n 2 = | A 2 | × | D 2 | 1 = 15 in line 30 of Algorithm 3.
For v 3 , as shown by Figure 4c, we know that A ( 3 ) = { { v 3 , v 5 } , { v 4 , v 6 , v 11 } } and D ( 3 ) = { { v 3 , v 8 , v 14 } , { v 7 , v 9 } } . In line 28, we only need to test | A ( 3 ) | × | D ( 3 ) | = 2 × 2 = 4 reachable queries. As whether v 4 can reach v 7 can be answered by L 2 , we know that whether all nodes in { v 4 , v 6 , v 11 } can reach every node in { v 7 , v 9 } can be answered by L 2 ; thus, λ = 6 for v 3 . In line 30, we know that n 3 = | A 3 | × | D 3 | 1 λ = 5 × 5 1 6 = 18 .
Then, we know N 3 = n 1 + n 2 + n 3 = 27 + 15 + 18 = 60 , and the RR is α = N 3 / | T C ( G ) | = 60 / 70 = 85.7 %. During the process, the total number of tested reachability queries by Algorithm 3 was.
As a comparison, the total number of tested queries for Algorithm 2 was 41, and is 80 for Algorithm 1 to obtain the RR.
Analysis: 
As Algorithm 3 computed the approximate TC size in linear time, the time cost of Step-1 was O ( k 2 ( | V | + | E | ) . The difference was found in Step-2. The cost of Step-2 for each hop-node was O ( i | A ( i ) | | D ( i ) | ) . For k hop-nodes, the cost was O ( i [ 1 , k ] i | A ( i ) | | D ( i ) | ) for coverage computation. The time complexity of Algorithm 3 was, therefore, O ( k 2 ( | V | + | E | ) + i [ 1 , k ] i | A ( i ) | | D ( i ) | ) . The space complexity of Algorithm 3 was O ( k | V | ) .

6. Experiment

In this section, we show the experimental results of the RR computation. The baseline algorithms included blRR, incRR, and incRR + . Moreover, we show the impacts of p2HLs on processing reachability queries based on the state-of-the-art algorithm FL [8] and BFL + [2], in terms of index size, index construction time, and query time. We implemented all algorithms using C++ and compiled them using G++ 6.2.0. All experiments were conducted on a desktop computer with an Intel(R) Core(TM) i5-1135G7 @ 2.4 GHz CPU, 16 GB memory, and Ubuntu 18.04.1 Linux OS. For algorithms that processed 3 h or exceed the memory limit (16GB), we indicated their results as “–” in the tables.
Datasets: 
Table 3 shows the statistics of 18 real datasets, where the first 6 were small datasets (|V| ≤ 100,000) downloaded from the same web page (https://code.google.com/archive/p/grail/downloads, accessed on 1 January 2020). The following 12 datasets were large (|V| > 100,000). These datasets have usually been used in the recent works for processing reachability queries [1,2,3,4,5,7,8,9,14,23]. Among these datasets, human, anthra, agrocyc, ecoo, and vchocyc were graphs describing the genome and biochemical machinery of E. coli K-12 MG1655. The site (http://snap.stanford.edu/data/index.html, accessed on 1 January 2020) is an email network. LJ is an online social network soc-LiveJournal1 (http://snap.stanford.edu/data/index.html, accessed on 1 January 2020). The source web is a web graph web-Google (https://code.google.com/p/ferrari-index/downloads/list, accessed on). In addition, arxiv, 10cit-Patent (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), 10citeseerx (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), 05cit-Patent (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), 05citeseerx (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), citeseerx (https://code.google.com/archive/p/grail/downloads, accessed on 1 January 2020), and patent (https://code.google.com/archive/p/grail/downloads, accessed on 1 January 2020) (cit-Patents) are all citation networks. The source dbpedia (http://pan.baidu.com/s/1c00Jq5E, accessed on 1 January 2020) is a knowledge graph Dbpedia. The sourcetwitter (https://code.google.com/p/ferrari-index/downloads/list, accessed on 1 January 2020) is a DAG transformed from a large-scale social network with 55 million users and 1.96 billion edges [25]. The source web-uk (https://code.google.com/p/ferrari-index/downloads/list, accessed on 1 January 2020) is a DAG transformed from a web graph dataset with 133 million nodes and 5.5 billion edges. The statistics in Table 3 are that of the DAGs.

6.1. RR Computation

RR and Index Size: 
We show RR and the index size ratio (ISR) of these datasets in Figure 5, where ISR denotes the ratio of the size of p2HLs over the size of the 2-hop labels, with respect to all nodes. From Figure 5, we observed the following.
First, all datasets could be classified into three categories according to their RRs. The first (D1) included email, LJ, web, citeseerx, dbpedia, twitter, and web-uk, for which the RR was more than 99%, even when k = 1 , and both the RR and ISR do not significantly change with the increase in k. The second type (D2) included human, anthra, agrocyc, ecoo, vchocyc, and arxiv, for which both the RR and ISR increased with the increase in k. The third type (D3) included 10cit-Patent, 10citeseerx, 05cit-Patent, 05citeseerx, and patent, for which both the RR and ISR were very small or even approach zero, and with the increase in k, both RR and ISR did not significantly change.The value of k, therefore, only affected the second type of dataset, and the RR was more than 80% when k 16 for all datasets of the second type, which indicated that this approach could benefit from using p2HLs on datasets of both the first and second types.
Second, the storage space used to maintain p2HLs was small, when compared with the RR value. For example, for the first type of dataset, we used about 1 / 4 of the available storage space (ISR 25 %) to maintain more than 99% (RR > 99 %) of the reachability information.
Operational Time: 
Figure 6 shows the comparison of the operational time for the RR computation, from which we knew that incRR + was much faster than both blRR and incRR, on all datasets, and incRR operated faster than blRR on most datasets. For example, incRR + was faster than blRR by more than two or three orders of magnitude on most datasets, and incRR was ten-times faster than blRR on email, LJ, web, citeseerx, and dbpedia. According to the last to the second column of Table 3, the average number of reachable nodes was usually high. Therefore, the number of tested reachability queries by blRR was significantly high. Although incRR could reduce the number of tested reachability queries, it still needed to test significantly more reachability queries than incRR + . Neither blRR nor incRR could obtain the value of the RR on both twitter and web-uk for k 2 in a limited time frame (3 h), due to testing too many reachability queries.
Based on the above results, we knew that our incRR + algorithm could be used to efficiently compute the RR for a given dataset, which allowed us to determine whether we could use p2HLs to facilitate processing reachability queries.

6.2. Processing Reachability Queries

We combined p2HLs with FELINE [8] (abbr. FL) and BFL + [2] to show the impact of p2HLs on processing reachability queries in Table 4, Table 5 and Table 6, where FL-k (BFL + -k) denotes the FL (BFL + ) algorithm combined with the p2HLs generated using k hop-nodes by Algorithm 3. Note that we did not set k = 1 , 2 , 4 , 8 , because when k = 16 , we only needed to use one integer as a bit-vector for each node v to represent both L o u t 16 ( v ) and L i n 16 ( v ) .
Index Size: 
Table 4 shows the impact of k on index size, from which we found that with an increase in k, the index size increased accordingly. For example, for the web-uk dataset, the index size of FL-128 was more than two times the size of FL-0 on all datasets. The reason was that the larger the value of k, the more space required to maintain p2HLs.
Index Construction Time: 
Figure 5 shows the impact of k on the index construction time, from which we found that with an increase in k, we required more time for index construction. The time for generating p2HLs was usually much less than the index construction time for FL, i.e., FL-0, and the increased time for index construction could be omitted. As a comparison, the index of BFL + , i.e., BFL + -0, could be constructed very efficiently, as the time used for p2HLs construction was usually one-to-two times longer than the index construction time of BFL + . However, the whole index construction was still less than 5 s for all the tested graphs.
Query Time: 
We reported the query time was about equal workload, which contained 1,000,000 reachability queries for each dataset. The equal workload consisted of 50% reachable queries and 50% unreachable queries. The reason that we used equal workloads was that using completely random queries would be heavily skewed towards unreachable queries [9,10], which was highly unlikely for the real workload as the node pair in a query tended to have a certain connection [11]. Here, unreachable queries were generated by sampling node pairs with the same probability until we reached the required number of unreachable queries by testing each query using the FL algorithm. For reachable queries, we could not choose them randomly by sampling the TC because the TC computation was not similar to the TC size computation, and it suffered from high processing times and space complexity. We could not obtain it within the limited time frame and available memory size for large graphs. To address this problem, we randomly selected a node u in each iteration, and then randomly selected an out-neighbor v recursively until v had no out-neighbors available. Then, we had a path p from u. Finally, we randomly selected a node v u from p to obtain a reachable query u v . This operation was continued until we reached the required number of reachable queries.
We show the comparison of query time from FL-0 to FL-128 and from BFL + -0 to BFL + -128 in Table 6, from which we observed the following.
First, FL-16 (BFL + -16) and FL-32 (BFL + -32) usually required the least amount of time on the first type of datasets D1, including email, LJ, web, citeseerx, dbpedia, twitter, and web-uk, where the RR was more than 99% even when k = 1 . For these datasets, although the index size increased along with the index construction time, as compared to FL-0 (BFL + -0), we used the least amount of resources to achieve significant improvements.For example, when compared with FL-0, FL-16 used about 1.2-times the index size and 1.4-times the index construction time to achieve a more than 30-fold improvement in query time on the citeseerx dataset. Furthermore, when compared with BFL + -0, BFL + -16 consumed 1.1-times the index size and 2.3-times index construction time to achieve a more than 15-fold improvement on the amount of query time on the citeseerx dataset.
Second, FL-128 (BFL + -128) suffered from the largest index size (about 2–3-times larger than FL-0 (BFL + -0)) and longest index construction time (about 1–3-times longer than FL-0 (BFL + -0)), but it usually achieved a better query performance on the second type of datasets D2, including human, anthra, agrocyc, ecoo, vchocyc, and arxiv. On these datasets, the RR increased along with the increase in k. From Figure 5, we found that for human, anthra, agrocyc, ecoo, and vchocyc, the RR was greater than 95% when k 16 , and thus, both BFL + -16 and BFL + -32 usually performed better. For arxiv, when k = 32 , the RR was still less than 90%, indicating that FL-128 and BFL + -128 performed the best.
Third, for the third type of datasets D3, including 10cit-Patent, 10citeseerx, 05cit-Patent, 05citeseerx, and patent, the RR was very small or even approached zero, and it did not significantly change with an increase in k. For these datasets, FL-0 and BFL + -0 usually performed better, and the use of p2HLs did not yield positive results. For example, the index size of FL-128 (BFL + -128) was 2.6 (2.1)-times larger than that of FL-0 (BFL + -0) on 10cit-Patent, and the index construction time of FL-128 (BFL + -128) was 1.3 (2.5)-times longer than that of FL-0 (BFL + -0) on 10cit-Patent. Such cost, however, resulted in more query time. The query time of FL-128 (BFL + -128) was 1.3 (5.5)-times longer than that of FL-0 (BFL + -0) on 10cit-Patent.
Finally, we selected one dataset from each type and showed the trends of their query times, with respect to k in Figure 7. Based on these results, we provide the following suggestions for applying p2HLs: (1) For the first type of datasets D1, we highly recommend using p2HLs with k = 16 to process reachability queries, because this could increase the speed of reachability queries significantly with a minimal increase in index size and index construction time. (2) For the second type of datasets D2, we also recommend using p2HLs, because could increase the speed of reachability queries. However, for balancing the value of k, this depends on the resources available for the increases in index size and index construction time. In general, the larger the value of k, the less the query time, but the longer the index construction time and the larger the index size. (3) For the third type of datasets D3, we do not recommend using p2HLs to process reachability queries.

7. Conclusions

Using p2HLs for processing reachability queries could be useful for some graphs when p2HLs can answer the most queries with larger RR, but for other graphs, p2HLs may not be as powerful as expected and could even lead to query performance degradation for small RR. In this paper, we addressed the important problem of using p2HLs for processing reachability queries in a given graph. To solve this problem, we proposed an RR-aware p2HLs, formally defined the RR problem, and proposed a set of algorithms for efficient RR computation. Our initial experimental results demonstrated that our optimized algorithm could efficiently compute RR for a given graph, opening up the possibility for users to determine whether p2HLs should be used. Our subsequent experimental results revealed that the query performance of combining p2HLs with state-of-the-art algorithms varied with the increase in hop-nodes k. Based on these results, we provided recommendations to guide the use of p2HLs for processing reachability queries: (1) for datasets with large RR, p2HLs should be used with k = 16 ; (2) for datasets with small RR, we do not recommend using p2HLs; and (3) for the remaining datasets, p2HLs can be used, and users can determine the k-value based on their requirements for index size, index construction time, and query time.

Author Contributions

Conceptualization, X.T. and J.Z.; methodology, X.T. and J.Z.; soft-ware, X.T.; validation, J.Z. and Y.S.; formal analysis, J.Z.; investigation, X.T.; resources, X.T.; data curation, X.L.; writing—original draft preparation, X.T. and J.Z.; writing—reviewing and editing, X.T., X.L. and L.K.; funding acquisition, X.T. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by grants from the Natural Science Foundation of Shanghai (No. 20ZR1402700) and from the Natural Science Foundation of China (No.: 61472339, 61873337).

Data Availability Statement

All datasets used in this study are publicly available and are discussed in Section 6. They are also available from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wei, H.; Yu, J.X.; Lu, C.; Jin, R. Reachability querying: An independent permutation labeling approach. Proc. VLDB Endow. 2014, 7, 1191–1202. [Google Scholar] [CrossRef] [Green Version]
  2. Su, J.; Zhu, Q.; Wei, H.; Yu, J.X. Reachability querying: Can it be even faster? IEEE Trans. Knowl. Data Eng. 2017, 29, 683–697. [Google Scholar] [CrossRef]
  3. Jin, R.; Wang, G. Simple, fast, and scalable reachability oracle. Proc. VLDB Endow. 2013, 6, 1978–1989. [Google Scholar] [CrossRef] [Green Version]
  4. Zhu, A.D.; Lin, W.; Wang, S.; Xiao, X. Reachability queries on large dynamic graphs: A total order approach. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22–27 June 2014; pp. 1323–1334. [Google Scholar]
  5. Cheng, J.; Huang, S.; Wu, H.; Fu, A.W. Tf-label: A topological-folding labeling scheme for reachability querying in a large graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 193–204. [Google Scholar]
  6. Cohen, E.; Halperin, E.; Kaplan, H.; Zwick, U. Reachability and distance queries via 2-hop labels. SIAM J. Comput. 2003, 32, 1338–1355. [Google Scholar] [CrossRef]
  7. Seufert, S.; Anand, A.; Bedathur, S.J.; Weikum, G. FERRARI: Flexible and efficient reachability range assignment for graph indexing. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD, Australia, 8–12 April 2013; pp. 1009–1020. [Google Scholar]
  8. Veloso, R.R.; Cerf, L.; Junior, W.M.; Zaki, M.J. Reachability queries in very large graphs: A fast refined online search approach. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT), Athens, Greece, 24–28 March 2014; pp. 511–522. [Google Scholar]
  9. Yildirim, H.; Chaoji, V.; Zaki, M.J. GRAIL: A scalable index for reachability queries in very large graphs. VLDB J. 2012, 21, 509–534. [Google Scholar] [CrossRef]
  10. Yildirim, H.; Chaoji, V.; Zaki, M.J. GRAIL: Scalable reachability index for large graphs. Proc. VLDB Endow. 2010, 3, 276–284. [Google Scholar] [CrossRef]
  11. Jin, R.; Ruan, N.; Dey, S.; Yu, J.X. SCARAB: Scaling reachability computation on large graphs. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 169–180. [Google Scholar]
  12. Tang, X.; Chen, Z.; Li, K.; Liu, X. Efficient computation of the transitive closure size. Clust. Comput. 2019, 22, 6517–6527. [Google Scholar] [CrossRef]
  13. Tarjan, R.E. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972, 1, 146–160. [Google Scholar] [CrossRef]
  14. Yano, Y.; Akiba, T.; Iwata, Y.; Yoshida, Y. Fast and scalable reachability queries on graphs by pruned labeling with landmarks and paths. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1601–1606. [Google Scholar]
  15. Jin, R.; Peng, Z.; Wu, W.; Dragan, F.F.; Agrawal, G.; Ren, B. Parallelizing pruned landmark labeling: Dealing with dependencies in graph algorithms. In Proceedings of the ICS ’20: 2020 International Conference on Supercomputing, Barcelona, Spain, 29 June–2 July 2020; Ayguadé, E., Hwu, W.W., Badia, R.M., Hofstee, H.P., Eds.; ACM: New York, NY, USA, 2020; pp. 11:1–11:13. [Google Scholar]
  16. Li, W.; Qiao, M.; Qin, L.; Zhang, Y.; Chang, L.; Lin, X. Scaling distance labeling on small-world networks. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1060–1077. [Google Scholar]
  17. Du, M.; Yang, A.; Zhou, J.; Tang, X.; Chen, Z.; Zuo, Y. HT: A novel labeling scheme for k-hop reachability queries on dags. IEEE Access 2019, 7, 172110–172122. [Google Scholar] [CrossRef]
  18. Sengupta, N.; Bagchi, A.; Ramanath, M.; Bedathur, S. ARROW: Approximating reachability using random walks over web-scale graphs. In Proceedings of the 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019; pp. 470–481. [Google Scholar]
  19. Peng, Y.; Lin, X.; Zhang, Y.; Zhang, W.; Qin, L. Answering reachability and k-reach queries on large graphs with label constraints. VLDB J. 2022, 31, 101–127. [Google Scholar] [CrossRef]
  20. Peng, Y.; Zhang, Y.; Lin, X.; Qin, L.; Zhang, W. Answering billion-scale label-constrained reachability queries within microsecond. Proc. VLDB Endow. 2020, 13, 812–825. [Google Scholar] [CrossRef] [Green Version]
  21. Wen, D.; Yang, B.; Zhang, Y.; Qin, L.; Cheng, D.; Zhang, W. Span-reachability querying in large temporal graphs. VLDB J. 2022, 31, 629–647. [Google Scholar] [CrossRef]
  22. Simon, K. An improved algorithm for transitive closure on acyclic digraphs. Theor. Comput. Sci. 1988, 58, 325–346. [Google Scholar] [CrossRef] [Green Version]
  23. Zhou, J.; Zhou, S.; Yu, J.X.; Wei, H.; Chen, Z.; Tang, X. DAG reduction: Fast answering reachability queries. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017; Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D., Eds.; ACM: New York, NY, USA, 2017; pp. 375–390. [Google Scholar]
  24. Zhou, J.; Yu, J.X.; Li, N.; Wei, H.; Chen, Z.; Tang, X. Accelerating reachability query processing based on DAG reduction. VLDB J. 2018, 27, 271–296. [Google Scholar] [CrossRef]
  25. Cha, M.; Haddadi, H.; Benevenuto, F.; Gummadi, P.K. Measuring user influence in twitter: The million follower fallacy. In Proceedings of the ICWSM, Washington, DC, USA, 23–26 May 2010. [Google Scholar]
Figure 1. Reachability ratio (RR) of p2HLs based on k nodes with a large degree.
Figure 1. Reachability ratio (RR) of p2HLs based on k nodes with a large degree.
Electronics 12 01178 g001
Figure 2. The relationship between different hop-nodes, where (a) v i is the first node, (b) v i 1 ¬ v i and v i ¬ v i 1 and (c) v i 1 v i .
Figure 2. The relationship between different hop-nodes, where (a) v i is the first node, (b) v i 1 ¬ v i and v i ¬ v i 1 and (c) v i 1 v i .
Electronics 12 01178 g002
Figure 3. A sample DAG G.
Figure 3. A sample DAG G.
Electronics 12 01178 g003
Figure 4. Running status of the two hash tables H A and H D .
Figure 4. Running status of the two hash tables H A and H D .
Electronics 12 01178 g004
Figure 5. Comparison of the Reachability Ratio (RR) and Index-Size Ratio (ISR), where ISR is ratio of the index size of p2HLs, with respect to k hop nodes, over that of the total 2-hop label size with respect to all nodes.
Figure 5. Comparison of the Reachability Ratio (RR) and Index-Size Ratio (ISR), where ISR is ratio of the index size of p2HLs, with respect to k hop nodes, over that of the total 2-hop label size with respect to all nodes.
Electronics 12 01178 g005
Figure 6. Operational time of different algorithms for RR computation (ms).
Figure 6. Operational time of different algorithms for RR computation (ms).
Electronics 12 01178 g006
Figure 7. Impacts of k on query time (ms) over different datasets.
Figure 7. Impacts of k on query time (ms) over different datasets.
Electronics 12 01178 g007
Table 1. The p2HLs L 1 , L 2 , L 3 constructed based on S 1 = { v 1 } , S 2 = { v 1 , v 2 } , and S 3 = { v 1 , v 2 , v 3 } , respectively.
Table 1. The p2HLs L 1 , L 2 , L 3 constructed based on S 1 = { v 1 } , S 2 = { v 1 , v 2 } , and S 3 = { v 1 , v 2 , v 3 } , respectively.
Node L 1 L 2 L 3
L o u t 1 ( v ) L i n 1 ( v ) L o u t 2 ( v ) L i n 2 ( v ) L o u t 3 ( v ) L i n 3 ( v )
v 1 111111
v 2 121, 221, 2
v 3 2 2, 33
v 4 1 1 1, 3
v 5 2 2, 3
v 6 1 1 1, 3
v 7 1 1 1, 3
v 8 3
v 9 1 1 1, 3
v 10 1 1, 2 1, 2
v 11 1 1 1, 3
v 12 2 2
v 13 1 1, 2 1, 2
v 14 3
v 15 1 1, 2 1, 2
Table 2. The status of set IDs for all nodes.
Table 2. The status of set IDs for all nodes.
Node v 1 v 2 v 3
i d A ( v ) i d D ( v ) i d A ( v ) i d D ( v ) i d A ( v ) i d D ( v )
v 1 111111
v 2 12222
v 3 2 33
v 4 1 1 4
v 5 2 3
v 6 1 1 4
v 7 1 1 4
v 8 3
v 9 1 1 4
v 10 1 2 2
v 11 1 1 4
v 12 2 2
v 13 1 2 2
v 14 3
v 15 1 2 2
Table 3. Statistics of datasets, where d = 2 | E | / | V | is the average degree of G, | T C ( · ) | is the average number of reachable nodes for nodes of G, and n t is the number of topological levels (the length of the longest path) of G.
Table 3. Statistics of datasets, where d = 2 | E | / | V | is the average degree of G, | T C ( · ) | is the average number of reachable nodes for nodes of G, and n t is the number of topological levels (the length of the longest path) of G.
Dataset | V | | E | d | T C ( · ) | n t
human38,81139,5762.04918
anthra12,49913,1042.101216
agrocyc12,68413,4082.111316
ecoo12,62013,3502.121422
vchocyc949110,1432.141421
arxiv600066,70722.24928167
email231,000223,0041.9311,6987
LJ971,2321,024,1402.11206,90724
web371,764517,8052.7955,05534
10cit-Patent1,097,7751,651,8943.0137
10citeseerx770,5391,501,1263.907036
05cit-Patent1,671,4883,303,7893.95812
05citeseerx1,457,0573,002,2524.1211636
citeseerx6,540,40115,011,2604.5915,51059
dbpedia3,365,6237,989,1914.7583,659146
patent3,774,76816,518,9478.75154432
twitter18,121,16818,359,4872.031,346,82022
web-uk22,753,64438,184,0393.363,417,9302793
Table 4. Comparison of the index size (MB).
Table 4. Comparison of the index size (MB).
DatasetFL-0FL-16FL-32FL-64FL-128BFL + -0BFL + -16BFL + -32BFL + -64BFL + -128
human0.740.891.041.331.921.091.231.381.682.27
anthra0.240.290.330.430.620.360.410.450.550.74
agrocyc0.240.290.340.430.630.360.410.460.560.75
ecoo0.240.290.340.430.620.360.410.460.560.75
vchocyc0.180.220.250.320.470.270.310.350.420.56
arxiv0.110.140.160.200.300.250.270.290.340.43
email4.415.296.177.9311.456.397.278.159.9213.44
LJ18.5222.2325.9333.3448.1627.2630.9634.6742.0856.90
web7.098.519.9312.7618.4311.5212.9414.3517.1922.86
10cit-Patent20.9425.1329.3137.6954.4432.1636.3540.5448.9165.66
10citeseerx14.7017.6420.5826.4538.2122.3425.2828.2234.1045.85
05cit-Patent31.8838.2644.6357.3882.8951.5657.9464.3277.07102.57
05citeseerx27.7933.3538.9150.0272.2542.1147.6753.2364.3586.58
citeseerx124.75149.70174.65224.55324.34185.10210.05235.00284.90384.70
dbpedia64.1977.0389.87115.55166.90109.16122.00134.84160.52211.87
patent72.0086.40100.80129.60187.19132.91147.31161.71190.51248.11
twitter345.63414.76483.89622.14898.65475.78544.91614.03752.291028.79
web-uk433.99520.79607.59781.181128.38654.25741.05827.841001.441348.63
Table 5. Comparison of the index construction time (ms).
Table 5. Comparison of the index construction time (ms).
DatasetFL-0FL-16FL-32FL-64FL-128BFL + -0BFL + -16BFL + -32BFL + -64BFL + -128
human9.0111.6313.1213.2713.290.923.523.644.474.54
anthra2.893.534.204.264.390.311.081.161.181.21
agrocyc2.963.804.154.724.480.251.111.211.261.47
ecoo3.083.784.426.564.510.251.111.241.311.61
vchocyc2.222.793.583.894.070.210.860.890.930.99
arxiv4.715.937.658.579.761.053.453.664.325.00
email81.387.9101.897.297.98.727.627.929.830.4
LJ325.1381.7438.3430.7442.243.5135.7138.1138.3139.0
web178.5215.8241.7240.6245.533.290.293.491.296.5
10cit-Patent801.31026.21084.91055.01085.0189.4448.9451.7454.2466.7
10citeseerx376.3497.7527.9531.4550.090.2233.7236.4243.0248.6
05cit-Patent1517.81966.11989.42037.92048.2417.8937.8954.9975.9974.0
05citeseerx775.61032.41087.61076.41146.7197.5509.2517.3525.6552.4
citeseerx4063.95615.06061.86002.46123.31371.53179.83182.93221.63234.7
dbpedia2264.32982.43200.03218.63209.9595.81391.91396.71410.91436.2
patent5022.87642.47818.67890.37862.52266.94725.94786.54833.94984.7
twitter6287.67497.88287.88285.58770.71014.22760.42761.92791.22812.9
web-uk8689.79968.611,138.911,185.511,559.91116.63204.83236.53374.63375.9
Table 6. Comparison of the query time (ms).
Table 6. Comparison of the query time (ms).
DatasetFL-0FL-16FL-32FL-64FL-128BFL + -0BFL + -16BFL + -32BFL + -64BFL + -128
human24.64.44.14.45.015.57.98.58.410.1
anthra26.25.44.65.05.313.36.86.96.97.5
agrocyc24.55.55.35.05.616.37.77.56.98.2
ecoo28.05.75.15.45.613.47.27.17.27.6
vchocyc26.95.04.84.85.513.36.97.77.07.5
arxiv479.1153.4144.2140.7140.3177.437.336.335.632.6
email82.212.413.213.920.547.623.923.023.427.7
LJ108.727.725.027.433.558.031.930.734.537.9
web142.934.231.734.941.076.532.732.333.337.0
10cit-Patent155.8169.2175.0189.2196.496.2112.7135.3147.1145.4
10citeseerx206.2190.3182.7177.3188.9113.2114.7115.0113.8117.8
05cit-Patent410.4381.2367.2352.3363.7197.5199.3192.0185.7184.7
05citeseerx250.8218.8217.2207.0220.1130.9136.1131.2130.3138.1
citeseerx2207.876.373.577.485.7988.164.267.467.173.7
dbpedia346.655.854.857.662.5205.049.049.348.852.6
patent13,716.512,584.412,629.112,026.312,021.33415.13388.73248.23204.33114.2
twitter168.942.843.047.854.5113.064.964.272.683.0
web-uk256.9115.0127.5136.6157.6126.086.691.287.5100.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, X.; Zhou, J.; Shi, Y.; Liu, X.; Kong, L. Efficient Reachability Ratio Computation for 2-Hop Labeling Scheme. Electronics 2023, 12, 1178. https://doi.org/10.3390/electronics12051178

AMA Style

Tang X, Zhou J, Shi Y, Liu X, Kong L. Efficient Reachability Ratio Computation for 2-Hop Labeling Scheme. Electronics. 2023; 12(5):1178. https://doi.org/10.3390/electronics12051178

Chicago/Turabian Style

Tang, Xian, Junfeng Zhou, Yunyu Shi, Xiang Liu, and Lihong Kong. 2023. "Efficient Reachability Ratio Computation for 2-Hop Labeling Scheme" Electronics 12, no. 5: 1178. https://doi.org/10.3390/electronics12051178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop