1. Introduction
Extensible Markup Language (XML), which became a World Wide Web Consortium (W3C) Recommendation in 1998, still belongs to the main methods of exchanging data over the Internet. It also plays an important role in many aspects of software development, often to simplify data storage and sharing. Thus, efficient storing and querying of XML data are key tasks that have been extensively studied during the past few years [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15].
To be able to retrieve the data from XML documents, various query languages such as XPath [
16], XPointer [
17] and XLink [
18] have been designed. However, without a structural summary, query processing can be quite inefficient due to an exhaustive traversal on XML data. To achieve fast searching, we can preprocess the data subject and construct an index.
Basically, the problem of XML data indexing is constructing a data structure able to efficiently process XML query languages, such as XPath. There are two crucial issues connected with all indexing methods: first, the requirement for a small size of the index; second, very fast query processing, which ideally means that the answers to the queries are found in time linear to the size of the query and do not depend on the size of the subject where the queries are located. If these requirements are fulfilled, the index structure allows one to answer a number of queries with low requirements for both time and space complexity.
However, the flexibility of the specifications of XML queries adds to the challenge of indexing methods, and the creation of a universal index that is able to process all of the possible XML queries efficiently is a very challenging area. Using only the two most commonlyused XPath axes (child axis and descendantorself axis), the number of potential queries is exponential (e.g.,
$\mathcal{O}\left({2.62}^{n}\right)$ for a simple linear XML tree with
n nodes [
19]). Therefore, there is always a tradeoff between the size and the power of an XML index. It either needs to be large to perform well or performs poorly as a consequence of saving space.
In this paper, we propose three indexing methods that are all based on finite state automata. These methods are simple and allow one to efficiently process a small subset of XPath. Therefore, having an XML data structure, our methods can be used efficiently as auxiliary data structures that enable answering a particular set of queries.
All automata presented in this paper support some fragments of linear XPath queries. In particular, we focused on the two common axes (i.e., child and descendantorself) with name tests. However, the techniques described here may be also relevant to the general XPath processing problem. First, we believe that a similar approach can be used to build automata that support other XPath axes (e.g, an automaton supporting the parent and ancestor axis). Second, processing linear expressions is a subproblem of processing more complex queries, as we can decompose them into linear fragments. Third, this can be seen as a building block for more powerful processors able to process branching queries. Moreover, it is easy to combine our indexes presented in this paper with other automatabased indexes using standard methods of automata theory.
First, we present Tree String Paths Automaton (TSPA) and Tree String Path Subsequences Automaton (TSPSA; introduced in [
20]), aimed at assisting in evaluating XPath queries with either child or descendantorself axes only. Then, we present Tree Paths Automaton (TPA; introduced in [
21]), which is designed to process XPath queries using any combination of child and descendantorself axes.
The rest of this paper is organized as follows.
Section 2 discusses stateoftheart methods for XML data indexing.
Section 3 gives the necessary theoretical background including a brief description of both XML and XPath. Next, in
Section 4, we introduce our approach to XML data indexing using automata theory. The theoretical time and space complexities of the proposed methods and experimental evaluation are discussed in
Section 5 and
Section 6, respectively. Finally, we summarize the contributions of our research, discuss our future work directions and conclude the paper in
Section 7.
4. Automata Approach to XML Data Indexing
In this section, we introduce three new methods for the problem of XML data indexing using the automata theory and show that automata can be used efficiently for the purpose of indexing XML documents. These methods are simple and allow one to efficiently process a small subset of XPath. Therefore, having an XML data structure, our methods can be used efficiently as auxiliary data structures that enable answering a particular set of queries. Given an XML document and an input XPath query, the searching phase finds the answer of the query in time linear in the size of the query and does not depend on the size of the original XML document.
This section is organized as follows. First, we provide some common preliminaries. Next, we introduce the Tree String Paths Automaton (TSPA) representing an index for all linear XPath queries using the child axis (i.e., /) only, denoted as $X{P}^{\{/,nametest\}}$. After that, we present the Tree String Path Subsequences Automaton (TSPSA), an index for all $X{P}^{\{//,nametest\}}$ queries using the descendantorself axis (i.e., //) only. Finally, we introduce the Tree Paths Automaton (TPA), which is designed to process all XPath queries with any combination of child (i.e., /) and descendantorself (i.e., //) axes, denoted as $X{P}^{\{/,//,nametest\}}$.
4.3. Tree String Path Subsequences Automaton
The Tree String Path Subsequences Automaton (TSPSA) is a finite state automaton that efficiently evaluates all linear XPath queries
$X{P}^{\{//,nametest\}}$ where only the descendantorself axis (i.e., //axis) is used. Again, we can represent such a fragment of XPath queries over an XML document
D by the contextfree grammar as follows:
Definition 6 (Tree string path subsequences automaton).
Let D be an XML document. TSPSA accepts all $X{P}^{\{//,nametest\}}$ queries of D, and for each query Q, it gives a list of elements satisfying Q.
As for the TSPA, the construction of TSPSA is very systematic. The given XML tree model
T is preprocessed and the string path set
$P\left(T\right)$ obtained. However, to satisfy XPath queries with the //axis, we are interested in subsequences of a string path rather than its prefixes. Which is why we construct a subsequence automaton for each string path
${P}_{i}\in P\left(T\right)$ instead of a prefix automaton. The automaton solving the problem of subsequences for both single and multiple strings is also referred to as the Directed Acyclic Subsequence Graph (DASG) and is further studied in [
38,
39]. Therefore, we propose the XML index problem to be another application area of DASG.
There are three building algorithms for DASG for a set of strings available: righttoleft [
40], lefttoright [
41] and online [
39]. However, none of them is based on a subset construction, which gives the sets of positions serving as answers for input queries. Therefore, we propose a direct subset construction of a deterministic subsequence automaton; see Algorithm 3 and
Figure 4.
Figure 4 shows transition diagrams of the deterministic subsequence automata constructed by Algorithm 3 for all string paths contained in
$P\left(T\right)$, where
T is the XML tree model
T illustrated in
Figure 1.
Algorithm 3: Construction of a deterministic subsequence automaton for a single string path. 
Data: A string path $P={n}_{1}{n}_{2}\cdots {n}_{\leftP\right}$. 
Result: DFSA $M=(Q,A,\delta ,{q}_{0},F)$ accepting all (nonempty) $X{P}^{\{//,nametest\}}$ queries of P 
$\forall e\in A\left(P\right)$ compute ${O}_{P}\left(e\right)$. Build the “backbone” of the finite state automaton $M=(Q,A,\delta ,{q}_{0},F)$:
 (a)

$Q\leftarrow \{{q}_{0},{q}_{1},\cdots ,{q}_{\leftP\right}\}$, $A\leftarrow \{//a:a\in A(P\left)\right\}$, $F\leftarrow Q\backslash \left\{{q}_{0}\right\}$, ${q}_{0}\leftarrow 0$  (b)

$\forall i$, where $i\leftarrow 1,2,\cdots ,\leftP\right$:
set state ${q}_{i}\leftarrow {O}_{P}\left(label\left({n}_{i}\right)\right)$, add $\delta ({q}_{i1},//label\left({n}_{i}\right))\leftarrow {q}_{i}$, ${O}_{P}\left(label\left({n}_{i}\right)\right)\leftarrow ButFirst\left({O}_{P}\left(label\left({n}_{i}\right)\right)\right)$.
Insert “additional transitions” into the automaton M:
$\forall i\in \{0,1,\cdots ,P1\}$, $\forall a\in A\left(P\right)$:
add $\delta ({q}_{i},//a)\leftarrow {q}_{s}$, if there exists such $s>i$ where $\delta ({q}_{s1},//a)={q}_{s}\wedge \neg \exists r<s:\delta ({q}_{r1},//a)={q}_{r}$ $\delta ({q}_{i},//a)\leftarrow \varnothing $ otherwise.

Definition 7 (Set of occurrences of an element label in a string path).
Let $P={n}_{1}{n}_{2}\cdots {n}_{\leftP\right}$ be a string path and e be an element label occurring at several positions in P (i.e., $label\left({n}_{i}\right)=e$ for some i). A set of occurrences of the element label e in P is a totally ordered set ${O}_{P}\left(e\right)=\left\{o\phantom{\rule{0.277778em}{0ex}}\right\phantom{\rule{0.277778em}{0ex}}o=id\left({n}_{i}\right)\phantom{\rule{0.277778em}{0ex}}\wedge \phantom{\rule{0.277778em}{0ex}}label\left({n}_{i}\right)=e,\phantom{\rule{0.277778em}{0ex}}i=1,2,\cdots ,\leftP\right\}$. The ordering is equal to ordering of element prefix identifiers as natural numbers.
Definition 8 (ButFirst).
Let P and ${O}_{P}\left(e\right)=\{{o}_{1},{o}_{2},\cdots ,{o}_{{O}_{P}\left(e\right)}\}$ be a string path and a set of occurrences of an element label e in the string path P, respectively. Then, we define a function $ButFirst\left({O}_{P}\left(e\right)\right)=\{{o}_{2},\cdots ,{o}_{{O}_{P}\left(e\right)}\}$.
Theorem 1. Given a string path $P={n}_{1}{n}_{2}\cdots {n}_{\leftP\right}$, Algorithm 3 correctly constructs a deterministic finite state automaton M accepting all $X{P}^{\{//,nametest\}}$ queries of P.
Proof. We will prove the following equivalence: M accepts a string X if and only if X is the $X{P}^{\{//,nametest\}}$ query of the string path P.
If M accepts a string X, then X is the $X{P}^{\{//,nametest\}}$ query of P.
If X is the $X{P}^{\{//,nametest\}}$ query of P then M accepts X. Assume to the contrary that $X=//{x}_{1}//{x}_{2}\cdots //{x}_{\leftX\right}$ over the alphabet $A=\{//a:a\in A(P\left)\right\}$ is the $X{P}^{\{//,nametest\}}$ query of P and M does not accept X. If this is the case, either M reads the whole input and terminates in a nonfinal state or M does not read the whole input. In the first case, terminating in a nonfinal state means to stop in the initial state, contradicting our assumption that X is nonempty. The second case, reading just part of the input, means there exists such a symbol $//{x}_{i}$ that M has no transition leading from the current state labeled by $//{x}_{i}$.
However, if the automaton reads some symbol, it always goes from the current state to the closest higher state representing an occurrence of that symbol. During Step ii, in Phase 2, all transitions added lead to the neighbor state, and during Step i, in Phase 3, we choose suitable higher state ${q}_{s}$ using two conditions. First, the state has to correspond to the correct symbol, which is satisfied by the first condition: there exists such $s>i$ where $\delta ({q}_{s1},//a)={q}_{s})$. Second, we need to ensure that the state is the closest possible, which is satisfied by the second condition: $\neg \exists r<s:\delta ({q}_{r1},//a)={q}_{r}$. Therefore, no subsequence is missed.
Thus, M reads ${x}_{1}\cdots {x}_{i1}$, and the current state of M is ${q}_{j}$. Due to Steps 2 and 3, there exists no transition from state ${q}_{a}$ to a state ${q}_{b}$ where $a\ge b$ (i.e., to the “left”); therefore, $j\ge i1$. Because of (2b)i., each state ${q}_{k}$ of M corresponds to a node ${n}_{k}$ of P. Because of (2b)ii. and 3i., there exists a transition from ${q}_{j}$ for ${x}_{i}$ such that ${x}_{i}$ occurs right of ${x}_{i1}$, as every transition leads from ${q}_{j}$ to the state with the incoming transition labeled with ${x}_{i}$ (the nonexistent part of 3i.). Therefore, ${x}_{1}\cdots {x}_{i}$ is not the $X{P}^{\{//,nametest\}}$ query of P, which is a contradiction. ☐
We can run all subsequence automata “in parallel” using the product construction (see
Section 3.1) and obtain the index for all
$X{P}^{\{//,nametest\}}$ queries of the particular XML document; see Algorithm 4 and
Figure 5.
Figure 5 illustrates TSPSA constructed by Algorithm 4 for the XML document
D and its XML tree model
$T\left(D\right)$ from Example 5 and
Figure 1, respectively.
The searching phase of TSPSA evaluates input queries in the same way as TSPA. Again, the answer for the input query is given by the dsubset contained in the terminal state.
Algorithm 4: Construction of TSPSA for an XML document D. 
Data: String paths set $P\left(T\right)=\{{P}_{1},{P}_{2},\cdots ,{P}_{k}\}$ of XML tree model $T\left(D\right)$ with k leaves. 
Result: DFSA $M=(Q,A(D),\delta ,0,F)$ accepting all $X{P}^{\{//,nametest\}}$ queries of the XML document D. 
For all ${P}_{i}\in P\left(T\right)$, construct a finite state automaton ${M}_{i}=({Q}_{i},\{//a:a\in A\left({P}_{i}\right)\},{\delta}_{i},0,{F}_{i})$ accepting all $X{P}^{\{//,nametest\}}$ queries of ${P}_{i}$ using Algorithm 3. Construct the deterministic tree string path subsequences automaton $M=(Q,\{//a:a\in A\left(D\right)\},\delta ,0,Q\backslash \{0\left\}\right)$ accepting all $X{P}^{\{//,nametest\}}$ queries of the XML document D using the product construction (union).

5. Discussion of Time and Space Complexities
5.2. Tree String Path Subsequences Automaton
TSPSA efficiently supports the evaluation of all $X{P}^{\{//,nametest\}}$ queries of an XML document D. The runtime for a query of length m clearly becomes $\mathcal{O}\left(m\right)$ and does not depend on the size of the document D. Again, considering also the answering phase, the whole input query Q is evaluated in time $\mathcal{O}(m+k)$, where k is the number of nodes in the XML document D satisfying the query Q. In practice the number of such nodes is expected to be much smaller than the size of the XML document.
The number of linear XPath queries using the //axis only is exponential in the number of nodes of the XML tree model. For example, consider just a linear XML tree model
T with
n nodes. The number of
$X{P}^{\{//,nametest\}}$ queries is
$\mathcal{O}\left({2}^{n}\right)$, which is determined by the following deduction: There are
$\left(\genfrac{}{}{0pt}{}{n}{i}\right)$ combinations of
i elements
$(1\le i\le n)$. Therefore, the exact number of all possible
$X{P}^{\{//,nametest\}}$ queries is given by the following formula:
Each state of TSPSA corresponds to an answer of a single query or a collection of queries. Although the number of different queries accepted by TSPSA is exponential, usually many queries are equivalent (i.e., their result sets of elements are equal). Therefore, the equivalence problem of queries is closely related to the problem of the determination of the number of states of TSPSA. That is, if we know the number of unique query answers, we can construct a deterministic automaton answering all queries using exactly this number of states. On the other hand, we can obviously use the TSPSA to decide the equivalence of two queries and even determine equivalence classes.
From another point of view, we can examine the number of states of a TSPSA as a size of DASG for a set of strings (see [
40,
41]). For
k strings of length
h, the number of states can be trivially bounded by
$\mathcal{O}\left({h}^{k}\right)$, i.e., the size of a product of
k automata with
$\mathcal{O}\left(h\right)$ states. Therefore, the number of transitions of TSPSA is bounded by
$\mathcal{O}\left(\rightA\left(D\right)\left{h}^{k}\right)$. The lower bound for
$k>2$ strings is not known, while Crochemore and Troníček in [
38] showed that
$\Omega \left({h}^{2}\right)$ states are required for
$k=2$ in the worst case.
However, considering the XML index problem, the set of strings is rather specific. Thanks to the branching tree structure, we can expect common prefixes in the set of strings, i.e., a lesser number of states (and transitions) in the resulting automaton. In the context of the XML index problem, k is a number of leaves in an XML tree model, and h is its height.
When space is more crucial, we do not need to combine the subsequence automata and just traverse them simultaneously. Finally, we return the union of resulting dsubsets of the automata that accept the input query as the answer. Given a query of length m, this approach obviously works in $\mathcal{O}(k\xb7m)$ time complexity and $\mathcal{O}(h\xb7k)$ space complexity. For parallel systems, each subsequence automaton can be handled by a different computing node.
For a common XML document (XML with the level (l)property), in which nodes with the same label can only appear at the same level of the XML tree model, the asymptotic upper bound of the space complexity is $\mathcal{O}(h\xb7{2}^{k})$. The necessary definitions and formal proof follow.
Definition 9 (Level property).
Let $T=(N,E)$ be a labeled directed rooted tree. Level property (lproperty): Definition 10 (State level).
Let $M=(Q,A,\delta ,{q}_{0},F)$ be an acyclic deterministic finite state automaton. A state level s of a state q is a maximal number of transitions leading from the initial state ${q}_{0}$ to q.
Theorem 4. Let D be an XML document and $T\left(D\right)$ be its XML tree model satisfying the lproperty with height h and k leaves. The number of states of deterministic TSPSA constructed for the XML document D by Algorithm 4 is $\mathcal{O}(h\xb7{2}^{k})$.
Proof. There are k string paths in $T\left(D\right)$, for which we construct a set S of k deterministic subsequence automata of no more than h states each (due to the lproperty). We can run all automata “in parallel”, by remembering the states of all automata by constructing ktuples q while reading the input. This is achieved by the product construction. This way we construct TSPSA M for T.
Due to the lproperty of T, it holds that: The target state of a transition labeled with $a\in A\left(D\right)$ is either a sink state or its state level is the same in each automaton in S. Hence, the ktuples $({q}_{1},{q}_{2},\cdots ,{q}_{k})$ are restricted as follows: If the state level of ${q}_{1}$ is s, then each of ${q}_{2},\cdots ,{q}_{k}$ is either a sink state or of state level s. If ${q}_{1}$ is a sink state, then ${q}_{2}$ is arbitrary, but each of ${q}_{3},\cdots ,{q}_{k}$ is either a sink state or the same state level as ${q}_{2}$. In addition, the ktuples of Levels 0 and 1 are always $({0}_{1},{0}_{2},\cdots ,{0}_{k})$ and $({1}_{1},{1}_{2},\cdots ,{1}_{k})$, respectively. Therefore, the maximum number of states of M is $2+{2}^{k1}\xb7(h1)+{2}^{k2}\xb7(h2)$. ☐
Theorem 5. Let D be an XML document and $T\left(D\right)$ be its XML tree model satisfying the lproperty with height h and k leaves. The number of transitions of deterministic TSPSA constructed for the XML document D by Algorithm 4 is $\mathcal{O}\left(\rightA\left(D\right)h\xb7{2}^{k})$.
Proof. The maximum possible number of transitions leading from each state is $\leftA\right(D\left)\right$. ☐