# A Distributed Indexing Method for Timeline Similarity Query

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

^{+}-tree [19,20], RI-tree [21,22], TD-Tree [8], TB-Tree [9], and the SHB

^{+}-tree [23], etc. can be categorized into four groups [8]. Most existing interval index methods are extensions of traditional B

^{+}-tree [24,25] or R-tree [26]. In general, the tree-like index structures are a kind of centralized data structure. They are hierarchical and are not efficient for parallel implementations due to high concurrency control cost [9,27,28]. How to transform the tree-like structure into linear structure effectively and efficiently is one of the key solutions to this problem. Another noteworthy issue is that most space partitioning indexes, such as TD-Tree and TB-Tree, often have a predefined space region for the root node. It limits the scalability to indexing dynamic interval data set.

^{+}-Tree, and its future work is to support Apache Spark and Hadoop for specialized indexing on very large datasets. Furthermore, it refers to indexing NGS data rather than timelines.

- The triangle increment partition strategy (TIPS) is proposed to support the representation and partition of the interval dataset for indexing timeline datasets.
- The DTI-Tree based on TIPS is proposed to effectively and efficiently index timelines. DTI-Tree is a novel indexing method for interval-based discrete timelines similarity query based on Apache Spark.
- The evaluation function of timeline similarity has been presented and the timeline similarity query has been implemented by the supporting of the DTI-Tree.
- An open source timeline test dataset generator named TimelineGenerator is developed. It can generate various timeline test datasets for different conditions, and may provide some benchmark timeline datasets for the future research.

## 2. Timeline’s Representation and Similarity Function

_{id}, I

_{id}, I

_{s}, I

_{e}, I

_{t}, I

_{l}, I

_{d}], I

_{Min}≤ I

_{s}≤ I

_{e}≤ I

_{Max},

_{Max}is the maximum value of the intervals, I

_{Min}is the minimum value of the intervals, T

_{id}is the identifier of the timeline that the interval belongs to, I

_{id}is the identifier of the interval, here it is the order in a timeline, I

_{s}is the starting of interval, I

_{e}is the ending of interval, I

_{t}is the type of the interval, I

_{l}is the label of the interval, and I

_{d}is the attachment data of the interval shown in Table 1.

_{s}= I

_{e}. An interval can be described as a point in a 2D plane, as shown in Figure 4 [9]. The axis S is the starting value of the interval, and axis E is the ending value of the interval. Because I

_{e}is always larger than or equal to I

_{s}, the region below the line NP is always empty in the two-dimensional space. This offers the possibility for reducing the query computation by skipping the empty region. Our previous work [9] described 15 distinct algebra relationships between two intervals. Each of the 15 interval algebra relationships can be represented as a point, line, and region in the 2D space. If the universal interval is [I

_{s}, I

_{e}], and the input interval of a query is [Q

_{s}, Q

_{e}], we can represent the relationships of two interval objects in two-dimensional space, as defined in Table 2.

^{T}== L

^{S}). map (overlaps (I

^{T}, IS)). reduce (_+_)

^{T}(Label of an interval in T) is not equal L

^{S}(Label of an interval in S). The result is also a pair set of intervals, except that the two intervals in a pair have same labels. The function map calls the interval operator overlaps to calculate the overlap value of each pair of intervals. At last, the function reduce adds all of the overlap values to a number and return the result. For example,

_{id}= 1, I

_{id}= 0, I

_{s}= 10, I

_{e}= 12, I

_{t}= IT_CLOSED, I

_{l}= “reading”, I

_{d}= “a book”],

_{id}= 1, I

_{id}= 1, I

_{s}= 22, I

_{e}= 24, I

_{t}= IT_CLOSED, I

_{l}= “writing”, I

_{d}= “a paper”]

_{id}= 2, I

_{id}= 0, I

_{s}= 11, I

_{e}= 12, I

_{t}= IT_CLOSED, I

_{l}= “reading”, I

_{d}= “a book”],

_{id}= 2, I

_{id}= 1, I

_{s}= 20, I

_{e}= 24, I

_{t}= IT_CLOSED, I

_{l}= “writing”, I

_{d}= “a paper”]

## 3. Distributed Triangle Increment Tree (DTI-Tree)

#### 3.1. The Strategy of Partition

Algorithm 1: Tips |

Input: pi: an interval [I_{s}, I_{e}]; root: root node of TI-Tree or T-Tree; Output: root: the new root node { 1. read the root node from disk; 2. if (root._triangle contains pi) return root; 3. if (pi is located at the left of line AB) executes left extension shown in Figure 8a; else executes right extension shown in Figure 8b; 4. while (root. _triangle does not contain pi) call Algorithm 1 recursively; 5. return root; } |

## 4. The Structure of DTI-Tree

_{id}, I

_{id}, I

_{s}, I

_{e}, I

_{t}], where [I

_{s}, I

_{e}] is the spatial location on a 2D plane (Figure 7a), defined by the start value and the end value of the data. Each object belongs to a triangle partition, as shown in Figure 7a. The DTI-Tree is a combination of T-Tree and TI-Trees that are processed by each slave node in the cluster. Let M be the master node and have n slave nodes. Each slave node is recorded as S

_{i}and i $\in $ [0, n − 1]. Each slave node has one or K TI-Trees. They may be recorded as

#### 4.1. The Construction of DTI-Tree

_{t}is a timeline in TD and t $\in $ [0, TN − 1]. The first step of the data partition is to load the timeline data set from HDFS and to call the function flapMap in Table 3 to execute a mapping from timeline data set to interval data set IDS. Then, we get the minimum and maximum values of the interval data set IDS from the interval Spark Resilient Distributed Datasets (RDD) by calling the functions max and min in Table 3. A root triangle of the T-Tree will be set up on the base of the minimum value and maximum value by the algorithm that is presented in [9]. The root triangle will be partitioned into n sub triangles by the partition strategy that is mentioned in Section 3.1. Each sub triangle is related to a slave node. The last step is to call the function groupBy listed in Table 3 to form n groups according to the interval point location in a certain sub triangle. The main construction process is shown in Figure 10. The second phase is the construction of T-Tree according to the sub triangles. Its construction is simple because the T-Tree is an absolute binary search tree. The last phase is the constructions of TI-Trees. The TI-Tree is an extension of the TD-Tree’s [8]. They share almost all the algorithms except the construction methods. The Algorithm 2 shows the process for constructing TI-Trees.

Algorithm 2: TI-Tree construction |

Input: IDS: an interval data set; Output: TI-Tree { 1. initial the root triangle variable root; 2. i = 0; 3. while (i < IDS.size()) if (root._triangle.constains(IDS [i])) call the TD-Tree’s insertion procedure; i++; else call Algorithm 1; 4. return TI-Tree. } |

_{i}. If the standard deviation $\sigma $ is larger than the threshold given, the slave that has the maximum value of $count\left({S}_{i}\right)$ should be partitioned into two parts, reconstruct two TI-Trees in the same slave node, and inform the T-Tree in the master node to adjust tree. The Algorithm 3 shows the construction method of the DTI-Tree.

Algorithm 3: DTI-Tree construction shown in Figure 10 |

Input: TD: a timeline data set; Output: DTI-Tree { 1. IDS= flapMap(TD); //convert timeline set into interval set. 2. initialize and recursively binary split the root triangle by TIPS until the number of the leaf node triangle is equal to the number of all the slaves SN in the cluster; 3. call the function groupBy to split IDS into SN groups according to the leaf nodes’ triangles generated in Step 2; 4. executes load balance evaluation and avoid adjustment after construction; 5. for each group, call Algorithm 2 to construct TI-Tree on the corresponding slave; } |

#### 4.2. The Update of DTI-Tree

Algorithm 4: DTI-Tree insertion |

Input: T: a timeline; Output: DTI-Tree { 1. call the function intervalize listed in Table 3 to get a set of interval points IPS; 2. select the minimum triangle minTri which contains IPS; 3. get the slaves whose root triangle is intersected with minTri; 4. if (the number of slaves is equal 1) let the corresponding TI-Tree insert all the interval points of the timeline else divide the IPS into several parts; let the corresponding TI-Tree insert each part. } |

#### 4.3. The Timeline Similarity Query of DTI-Tree

Algorithm 5: DTI-Tree similarity query shown in Figure 11 |

Input: T: a timeline; Output: the most similar timeline to T { 1. call the function intervalize to get a set of interval points Q = {Q _{0}, Q_{1}, …, Q_{k}, …, Q_{m}}; 2. calculate the overlap query region of Q _{k}, noted as R_{k}; R = R _{0} $\cup $ R_{1}, …, $\cup $ R_{k}, …, $\cup $ R_{m}; 3. find all the slave root triangles which overlaps R by the T-Tree in the master node, the result is {RTi}. RT _{i} is the root triangle of the TI-Tree_{i} on the slave node S_{i}. 4. find all the interval points P _{i} in the rectangle R by the TI-Tree_{i} on the slave S_{i} separately; 5. execute P _{i} .groupBy(timeline ID) on the slave S_{i} separately, and the result is some partial timelines PT_{i}={PT_{ij}}; 6. execute similarity (T, PT _{ij}) and send the result that are some partial similarity values PSV_{i} to the master; 7. combine all the partial similarity values {PSV _{ij}} on the master node, here i is a slave identifier and j is a timeline identifier; 8. return the result of {PSV _{ij}}. reduceBy(j). max (); } |

## 5. Experimental Evaluation

^{®}Core™ i7 processor and 8 GB RAM. The communications among the master node, the slave node, and the client are implemented by the Hadoop Remote Procedure Call (RPC) (https://wiki.apache.org/hadoop/HadoopRpc). We design two protocol interfaces, called MasterProtocol and SlaveProtocol, to support the communications. The operation systems of the cluster are all Ubuntu 16.04. All the algorithms are implemented by java (JDK 1.8.0_144) using IntelliJ IDEA 2017 (community edition).

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Brehmer, M.; Lee, B.; Bach, B.; Riche, N.H.; Munzner, T. Timelines revisited: A design space and considerations for expressive storytelling. IEEE Trans. Vis. Comput. Graph.
**2017**, 23, 2151–2164. [Google Scholar] [CrossRef] [PubMed] - Camerra, A.; Palpanas, T.; Shieh, J.; Keogh, E. Isax 2.0: Indexing and Mining One Billion Time Series. In Proceedings of the 2010 10th IEEE International Conference on Data Mining (ICDM), Sydney, NSW, Australia, 14–17 December 2010; Institute of Electrical and Electronics Engineers Inc.: Sydney, NSW, Australia, 2010; pp. 58–67. [Google Scholar]
- Zoumpatianos, K.; Idreos, S.; Palpanas, T. Ads: The adaptive data series index. VLDB J.
**2016**, 25, 843–866. [Google Scholar] [CrossRef] - Yagoubi, D.E.; Akbarinia, R.; Masseglia, F.; Palpanas, T. Dpisax: Massively distributed partitioned isax. In Proceedings of the 2017 17th IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; IEEE Computer Society Press: New Orleans, LA, USA, 2017; pp. 1135–1140. [Google Scholar]
- Kondylakis, H.; Dayan, N.; Zoumpatianos, K.; Palpanas, T. Coconut: A scalable bottom-up approach for building data series indexes. Proc. VLDB Endow.
**2018**, 11, 677–690. [Google Scholar] - Kriegel, H.P.; Potke, M.; Seidl, T. Interval sequences: An object-relational approach to manage spatial data. In Proceedings of the 2001 7th International Symposium Advances in Spatial and Temporal Databases (SSTD), Redondo Beach, CA, USA, 12–15 July 2001; Jensen, C.S., Schneider, M., Seeger, B., Tsotras, V., Eds.; Springer: Berlin, Germany, 2001; Volume 2121, pp. 481–501. [Google Scholar]
- Kriegel, H.-P.; Pötke, M.; Seidl, T.; Jensen, C.; Schneider, M.; Seeger, B.; Tsotras, V. Object-relational indexing for general interval relationships. In Advances in Spatial and Temporal Databases; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2121, pp. 522–542. [Google Scholar]
- Stantic, B.; Topor, R.; Terry, J.; Sattar, A. Advanced indexing technique for temporal data. Comput. Sci. Inf. Syst.
**2010**, 7, 679–703. [Google Scholar] [CrossRef] - He, Z.; Kraak, M.-J.; Huisman, O.; Ma, X.; Xiao, J. Parallel indexing technique for spatio-temporal data. ISPRS J. Photogramm. Remote Sens.
**2013**, 78, 116–128. [Google Scholar] [CrossRef] - Ma, X.; Fox, P. Recent progress on geologic time ontologies and considerations for future works. Earth Sci. Inform.
**2013**, 6, 31–46. [Google Scholar] [CrossRef] - Ma, X.; Carranza, E.J.M.; Wu, C.; van der Meer, F.D.; Liu, G. A skos-based multilingual thesaurus of geological time scale for interoperability of online geological maps. Comput. Geosci.
**2011**, 37, 1602–1615. [Google Scholar] [CrossRef] - He, Z.; Wu, C.; Liu, G.; Zheng, Z.; Tian, Y. Decomposition tree: A spatio-temporal indexing method for movement big data. Clust. Comput.
**2015**, 18, 1481–1492. [Google Scholar] [CrossRef] - Xie, X.; Mei, B.; Chen, J.; Du, X.; Jensen, C.S. Elite: An elastic infrastructure for big spatiotemporal trajectories. VLDB J.
**2016**, 25, 473–493. [Google Scholar] [CrossRef] - Elmasri, R.; Wuu, G.; Kim, Y. The Time Index: An Access Structure for Temporal Data. In Proceedings of the 16th International Conference on Very Large Databases, Brisbane, Australia, 13–16 August 1990; pp. 1–12. [Google Scholar]
- Curtis, P.K.; Michael, S. Segment Indexes: Dynamic Indexing Techniques for Multi-Dimensional Interval Data. SIGMOD Rec.
**1991**, 20, 138–147. [Google Scholar] - Chaabouni, M.; Chung, S.M. The point-range tree: A Data Structure for Indexing Intervals. In Proceedings of the 1993 ACM Conference on Computer Science, Washington, DC, USA, 25–28 May 1993; ACM: Indianapolis, IN, USA, 1993; pp. 453–460. [Google Scholar]
- Ang, C.H.; Tan, K.P. The interval B-tree. Inf. Process. Lett.
**1995**, 53, 85–89. [Google Scholar] [CrossRef] - Tsotras, V.J.; Gopinath, B.; Hart, G.W. Efficient management of time-evolving databases. IEEE Trans. Knowl. Data Eng.
**1995**, 7, 591–608. [Google Scholar] [CrossRef] - Lanka, S.; Mays, E. Fully Persistent B
^{+}-Trees. SIGMOD Rec.**1991**, 20, 426–435. [Google Scholar] [CrossRef] - Brodal, G.S.; Brodal, L.; Tsakalidis, K.; Sioutas, S.; Tsichlas, K. Fully Persistent B-Trees. In Proceedings of the 23rd annual ACM-SIAM Symposium on Discrete algorithms, Kyoto, Japan, 17–19 January 2012; Society for Industrial and Applied Mathematics: Kyoto, Japan, 2012; pp. 602–614. [Google Scholar]
- Kriegel, H.-P.; Pötke, M.; Seidl, T. Managing intervals efficiently in object-relational databases. In Proceedings of the 26th International Conference on Very Large Data Bases, San Francisco, CA, USA, 10–14 September 2000; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 407–418. [Google Scholar]
- Enderle, J.; Schneider, N.; Seidl, T. Efficiently processing queries on interval-and-value tuples in relational databases. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, 30 August–2 September 2005; VLDB Endowment: Trondheim, Norway, 2005; pp. 385–396. [Google Scholar]
- Wang, M.; Xiao, M. SHB
^{+}-tree: A segmentation hybrid index structure for temporal data. In Proceedings of the 2015 15th International Conference on Algorithms and Architectures for Parallel (ICA3PP), Zhangjiajie, China, 18–20 November 2015; Wang, G., Zomaya, A., Martinez Perez, G., Li, K., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 282–296. [Google Scholar] - Bayer, R.; McCreight, E.M. Organization and maintenance of large ordered indexes. Acta Inform.
**1972**, 1, 173–189. [Google Scholar] [CrossRef] - Douglas, C. Ubiquitous B-tree. ACM Comput. Surv. (CSUR)
**1979**, 11, 121–137. [Google Scholar] - Guttman, A. R-trees: A dynamic index structure for spatial searching. SIGMOD Rec.
**1984**, 14, 47–57. [Google Scholar] [CrossRef] - Kim, J.; Nam, B. Parallel multi-dimensional range query processing with R-trees on GPU. J. Parallel Distrib. Comput.
**2013**, 73, 1195–1207. [Google Scholar] [CrossRef] - Eldawy, A.; Mokbel, M.F. Spatialhadoop: A mapreduce framework for spatial data. In Proceedings of the 2015 31st IEEE International Conference on Data Engineering, Seoul, Korea, 13–17 April 2015; IEEE Computer Society: Seoul, Korea, 2015; pp. 1352–1363. [Google Scholar]
- Dean, J.; Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Commun. ACM
**2008**, 51, 107–113. [Google Scholar] [CrossRef] - Aji, A.; Wang, F.; Vo, H.; Lee, H.; Liu, Q.; Zhang, X.; Saltz, J. Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proc. VLDB Endow.
**2013**, 6, 1009–1020. [Google Scholar] [CrossRef] - Alarabi, L.; Mokbel, M.F.; Musleh, M. St-hadoop: A mapreduce framework for spatio-temporal data. In Proceedings of the 2017 15th International Symposium Advances in Spatial and Temporal Databases (SSTD), Arlington, VA, USA, 21–23 August 2017; Gertz, M., Renz, M., Zhou, X., Hoel, E., Ku, W.-S., Voisard, A., Zhang, C., Chen, H., Tang, L., Huang, Y., et al., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 84–104. [Google Scholar]
- Yu, J.; Wu, J.X.; Sarwat, M. Geospark: A Cluster Computing Framework for Processing Large-Scale Spatial Data; Assoc Computing Machinery: New York, NY, USA, 2015. [Google Scholar]
- Baig, F.; Mehrotra, M.; Vo, H.; Wang, F.; Saltz, J.; Kurc, T. Sparkgis: Efficient comparison and evaluation of algorithm results in tissue image analysis studies. In Proceedings of the Biomedical Data Management and Graph Online Querying: VLDB 2015 Workshops, Big-O(Q) and DMAH, Waikoloa, HI, USA, 31 August–4 September 2015; Wang, F., Luo, G., Weng, C., Khan, A., Mitra, P., Yu, C., Eds.; Revised Selected Papers. Springer International Publishing: Cham, Switzerland, 2016; pp. 134–146. [Google Scholar]
- Wang, H.; Belhassena, A. Parallel trajectory search based on distributed index. Inf. Sci.
**2017**, 388–389, 62–83. [Google Scholar] [CrossRef] - Jalili, V.; Matteucci, M.; Masseroli, M.; Ceri, S. Indexing next-generation sequencing data. Inf. Sci.
**2017**, 384, 90–109. [Google Scholar] [CrossRef]

**Figure 2.**Some Empires of China, Japan, and Korea (https://timelinestoryteller.com/#examples).

**Figure 7.**TD-tree’s partition strategy (TDPS) and triangle binary tree. (

**a**) TD-Tree’s triangle decomposition; (

**b**) Triangle binary tree with codes.

**Figure 9.**DTI-Tree index structure (example of Figure 7a).

Interval Type | Expression |
---|---|

IT_CLOSED | [I_{s}, I_{e}] |

IT_LEFTOPEN | (I_{s}, I_{e}] |

IT_RIGHTOPEN | [I_{s}, I_{e}) |

IT_OPEN | (I_{s}, I_{e}) |

**Table 2.**Interval algebra relationships [9].

Operator | Definition | Result Shape |
---|---|---|

equals | I_{s} = Q_{s} and I_{e} = Q_{e} | point |

covers | I_{s} = Q_{s} and I_{e} = Q_{e} | point, if Q is close and I is open |

coveredBy | I_{s} = Q_{s} and I_{e} = Q_{e} | point, if Q is open and I is close |

Starts | I_{s} = Q_{s} and Q_{s} < I_{e} < Q_{e} | line |

startedBy | I_{s} = Q_{s} and I_{e} > Q_{e} | line |

Meets | I_{s} ≤ I_{e} = Q_{s} ≤ Q_{e} | line |

metBy | Q_{s} ≤ Q_{e} = I_{s} ≤ I_{e} | line |

finishes | Q_{s} < I_{s} < Q_{e} and I_{e} = Q_{e} | line |

finishedBy | I_{s} < Q_{s} and I_{e} = Q_{e} | line |

before | I_{s} ≤ I_{e} < Q_{s} ≤ Q_{e} | region |

After | Q_{s} ≤ Q_{e} < I_{s} ≤ I_{e} | region |

overlaps | I_{s} < Q_{s} and Q_{s} < I_{e} < Q_{e} | region |

overlappedBy | Q_{s} < I_{s} < Q_{e} and I_{e}>Q_{e} | region |

during | Q_{s} < I_{s} ≤ I_{e} < Q_{e} | region |

contains | I_{s} < Q_{s} ≤ Q_{e} < I_{e} | region |

Function | Description |
---|---|

cartesian | returns the cartesian product of this timeline and another one, that is, the result of all pairs of elements (a, b) where a is in this and b is in other. |

count | returns the number of elements in the set. |

filter | returns a new set containing only the elements that satisfy a predicate. |

flatMap | returns a new set by first applying a function to all elements of this set, and then flattening the results. |

foreach | applies a function f to all elements of this set. |

groupBy | returns a set of grouped elements. |

intervalize | convert a timeline into a set of intervals |

min | returns the minimum element from this set as defined by the specified comparator. |

map | returns a new set by applying a function to all elements of this set. |

max | returns the maximum element from this set as defined by the specified comparator. |

reduce | reduces the elements of this set using the specified commutative and associative binary operator. |

similarity | calculate the similarity of two timelines. |

Parameters | Description |
---|---|

maxTimeValue | the maximum time value of all the timelines generated by generator, the default value is 0. |

minTimeValue | the minimum time value of all the timelines generated by generator, the default value is 10,000. |

timelineMinLength | the shortest length of a timeline generated by generator, the default is 10. |

timelineMaxLength | the longest length of a timeline generated by generator, the default is 50. |

intervalMinLength | the shortest length of an interval generated by generator, the default is 2. |

intervalMaxLength | the longest length of an interval generated by generator, the default is 10. |

intervalAttachment | the range of the attachment data size, the default is [1K, 2K] |

labelTypes | the maximum types involved in the dataset generated by generator, the default value is 5. |

numberTimelines | the total number of the timelines generated by generator, the default value is 10,000,000. |

outputFileName | the generated timeline data file. |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

He, Z.; Ma, X.
A Distributed Indexing Method for Timeline Similarity Query. *Algorithms* **2018**, *11*, 41.
https://doi.org/10.3390/a11040041

**AMA Style**

He Z, Ma X.
A Distributed Indexing Method for Timeline Similarity Query. *Algorithms*. 2018; 11(4):41.
https://doi.org/10.3390/a11040041

**Chicago/Turabian Style**

He, Zhenwen, and Xiaogang Ma.
2018. "A Distributed Indexing Method for Timeline Similarity Query" *Algorithms* 11, no. 4: 41.
https://doi.org/10.3390/a11040041