Next Article in Journal
A Dual-Attention Autoencoder Network for Efficient Recommendation System
Next Article in Special Issue
An Attack Simulation and Evidence Chains Generation Model for Critical Information Infrastructures
Previous Article in Journal
Reliable Multicast Based on Congestion-Aware Cache in ICN
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimization of Decision Trees with Hypotheses for Knowledge Representation

1
Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakaka 72441, Saudi Arabia
2
Intel Corporation, 5000 W Chandler Blvd, Chandler, AZ 85226, USA
3
Computer Science Program, Dhanani School of Science and Engineering, Habib University, Karachi 75290, Pakistan
4
Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
*
Author to whom correspondence should be addressed.
Electronics 2021, 10(13), 1580; https://doi.org/10.3390/electronics10131580
Submission received: 26 May 2021 / Revised: 23 June 2021 / Accepted: 28 June 2021 / Published: 30 June 2021
(This article belongs to the Special Issue AI-Based Knowledge Management)

Abstract

:
In this paper, we consider decision trees that use two types of queries: queries based on one attribute each and queries based on hypotheses about values of all attributes. Such decision trees are similar to the ones studied in exact learning, where membership and equivalence queries are allowed. We present dynamic programming algorithms for minimization of the depth and number of nodes of above decision trees and discuss results of computer experiments on various data sets and randomly generated Boolean functions. Decision trees with hypotheses generally have less complexity, i.e., they are more understandable and more suitable as a means for knowledge representation.

1. Introduction

Decision trees are used in many areas of computer science as a means for knowledge representation, as classifiers, and as algorithms to solve different problems of combinatorial optimization, computational geometry, etc. [1,2,3]. They are studied, in particular, in test theory initiated by Chegis and Yablonskii [4], rough set theory initiated by Pawlak [5,6,7], and exact learning initiated by Angluin [8,9]. These theories are closely related: attributes from rough set theory and test theory correspond to membership queries from exact learning. Exact learning studies additionally the so-called equivalence queries. The notion of “minimally adequate teacher” that allows both membership and equivalence queries was discussed by Angluin in Reference [10]. Relations between exact learning and PAC learning proposed by Valiant [11] are discussed in Reference [8].
In this paper, which is an extension of two conference papers [12,13], we add the notion of a hypothesis to the model that has been considered in rough set theory, as well as in test theory. This model allows us to use an analog of equivalence queries. Our goal is to check whether it is possible to reduce the time and space complexity of decision trees if we use additionally hypotheses. Decision trees with less complexity are more understandable and more suitable as a means for knowledge representation. Note that, to improve the understandability, we should not only try to minimize the number of nodes in a decision tree but also its depth that is the unimprovable upper bound on the number of conditions describing objects accepted by a path from the root to a terminal node of the tree. In this paper, we concentrate only on the consideration of complexity of decision trees and do not study many recent problems considered in machine learning [14,15,16,17].
Let T be a decision table with n conditional attributes f 1 , , f n having values from the set ω = { 0 , 1 , 2 , } in which rows are pairwise different, and each row is labeled with a decision from ω . For a given row of T, we should recognize the decision attached to this row. To this end, we can use decision trees based on two types of queries. We can ask about the value of an attribute f i { f 1 , , f n } on the given row. We will obtain an answer of the kind f i = δ , where δ is the number in the intersection of the given row and the column f i . We can also ask if a hypothesis f 1 = δ 1 , , f n = δ n is true, where δ 1 , , δ n are numbers from the columns f 1 , , f n , respectively. Either this hypothesis will be confirmed or we obtain a counterexample in the form f i = σ , where f i { f 1 , , f n } , and σ is a number from the column f i different from δ i . The considered hypothesis is called proper if ( δ 1 , , δ n ) is a row of the table T.
In this paper, we study four cost functions that characterize the complexity of decision trees: the depth, the number of realizable nodes relative to T, the number of realizable terminal nodes relative to T, and the number of working nodes. We consider the depth of a decision tree as its time complexity, which is equal to the maximum number of queries in a path from the root to a terminal node of the tree. The remaining three cost functions characterize the space complexity of decision trees. A node is called realizable relative to T if, for a row of T and some choice of counterexamples, the computation in the tree will pass through this node. Note that, in the considered trees, all working nodes are realizable.
Decision trees using hypotheses can be essentially more efficient than the decision trees using only attributes. Let us consider an example, the problem of computation of the conjunction x 1 x n . The minimum depth of a decision tree solving this problem using the attributes x 1 , , x n is equal to n. The minimum number of realizable nodes in such decision trees is equal to 2 n + 1 , the minimum number of working nodes is equal to n, and the minimum number of realizable terminal nodes is equal to n + 1 . However, the minimum depth of a decision tree solving this problem using proper hypotheses is equal to 1: it is enough to ask only about the hypothesis x 1 = 1 , , x n = 1 . If it is true, then the considered conjunction is equal to 1. Otherwise, it is equal to 0. The obtained decision tree contains one working node and n + 1 realizable terminal nodes, altogether n + 2 realizable nodes.
We study the following five types of decision trees:
  • Decision trees that use only attributes.
  • Decision trees that use only hypotheses.
  • Decision trees that use both attributes and hypotheses.
  • Decision trees that use only proper hypotheses.
  • Decision trees that use both attributes and proper hypotheses.
For each cost function, we propose a dynamic programming algorithm that, for a given decision table and given type of decision trees, finds the minimum cost of a decision tree of the considered type for this table. Note that dynamic programming algorithms for the optimization of decision trees of the type 1 were studied in Reference [18] for decision tables with one-valued decisions and in Reference [19] for decision tables with many-valued decisions. The dynamic programming algorithms for the optimization of decision trees of all five types were studied in References [12,13] for the depth and for the number of realizable nodes.
It is interesting to consider not only specially chosen examples as the conjunction of n variables. For each cost function, we compute the minimum cost of a decision tree for each of the considered five types for eight decision tables from the UCI ML Repository [20]. We do the same for randomly generated Boolean functions with n variables, where n = 3 , , 6 .
From the obtained experimental results, it follows that, generally, the decision trees of the types 3 and 5 have less complexity than the decision trees of the type 1. Therefore, such decision trees can be useful as a means for knowledge representation. Decision trees of the types 2 and 4 have, generally, too many nodes.
Based on the experimental results, we formulate and prove the following hypothesis: for any decision table, we can construct a decision tree with the minimum number of realizable terminal nodes using only attributes.
The motivation for the work is related to the use of decision trees to represent knowledge: we try to reduce the complexity of decision trees (and improve their understandability) by using hypotheses. The main achievements of the work are the following: (i) we have proposed dynamic programming algorithms for optimizing five types of decision trees relative to four cost functions, and (ii) we have shown cases, when the use of hypotheses leads to the decrease in the complexity of decision trees.
The rest of the paper is organized as follows. In Section 2 and Section 3, we consider main notions. In Section 4, Section 5, Section 6, Section 7 and Section 8, we describe dynamic programming algorithms for the decision tree optimization. In Section 9, we prove the above hypothesis. Section 10 contains results of computer experiments, and Section 11 gives short conclusions.

2. Decision Tables

A decision table is a table T with n 1 columns filled with numbers from the set ω = { 0 , 1 , 2 , } . Columns of this table are labeled with conditional attributes f 1 , , f n . Rows of the table are pairwise different. Each row is labeled with a number from ω that is interpreted as a decision. Rows of the table are interpreted as tuples of values of the conditional attributes.
Each decision table can be represented by a word (sequence) over the alphabet { 0 , 1 , ; , | } : numbers from ω are in binary representation, we use the symbol “;” to separate two numbers from ω , and we use the symbol “|” to separate two rows (for each row, we add corresponding decision as the last number in the row). The length of this word is called the size of the decision table.
A decision table T is called empty if it has no rows. The table T is called degenerate if it is empty or all rows of T are labeled with the same decision.
We denote F ( T ) = { f 1 , , f n } and denote by D ( T ) the set of decisions attached to the rows of T. For any conditional attribute f i F ( T ) , we denote by E ( T , f i ) the set of values of the attribute f i in the table T. We denote by E ( T ) the set of conditional attributes of T for which E ( T , f i ) 2 .
A system of equations over T is an arbitrary equation system of the kind
{ f i 1 = δ 1 , , f i m = δ m } ,
where m ω , f i 1 , , f i m F ( T ) , and δ 1 E ( T , f i 1 ) , , δ m E ( T , f i m ) (if m = 0 , then the considered equation system is empty).
Let T be a nonempty table. A subtable of T is a table obtained from T by removal of some rows. We correspond to each equation system S over T a subtable T S of the table T. If the system S is empty, then T S = T . Let S be nonempty and S = { f i 1 = δ 1 , , f i m = δ m } . Then, T S is the subtable of the table T containing the rows from T that, in the intersection with columns, f i 1 , , f i m have numbers δ 1 , , δ m , respectively. Such nonempty subtables, including the table T, are called separable subtables of T. We denote by S E P ( T ) the set of separable subtables of the table T.

3. Decision Trees

Let T be a nonempty decision table with n conditional attributes f 1 , , f n . We consider the decision trees with two types of queries. We can choose an attribute f i F ( T ) = { f 1 , , f n } and ask about its value. This query has the set of answers A ( f i ) = { { f i = δ } : δ E ( T , f i ) } . We can formulate a hypothesis over T in the form of H = { f 1 = δ 1 , , f n = δ n } , where δ 1 E ( T , f 1 ) , , δ n E ( T , f n ) , and ask about this hypothesis. This query has the set of answers A ( H ) = { H , { f 1 = σ 1 } , , { f n = σ n } : σ 1 E ( T , f 1 ) { δ 1 } , , σ n E ( T , f n ) { δ n } } . The answer H means that the hypothesis is true. Other answers are counterexamples. The hypothesis H is called proper for T if ( δ 1 , , δ n ) is a row of the table T.
A decision tree over T is a marked finite directed tree with the root in which:
  • Each terminal node is labeled with a number from the set D ( T ) { 0 } .
  • Each node, which is not terminal (such nodes are called working), is labeled with an attribute from the set F ( T ) or with a hypothesis over T.
  • If a working node is labeled with an attribute f i from F ( T ) , then, for each answer from the set A ( f i ) , there is exactly one edge labeled with this answer, which leave this node and there are no any other edges leaving this node.
  • If a working node is labeled with a hypothesis H = { f 1 = δ 1 , , f n = δ n } over T, then, for each answer from the set A ( H ) , there is exactly one edge labeled with this answer, which leaves this node and there are no any other edges leaving this node.
Let Γ be a decision tree over T and v be a node of Γ . We now define an equation system S ( Γ , v ) over T associated with the node v. We denote by ξ the directed path from the root of Γ to the node v. If there are no working nodes in ξ , then S ( Γ , v ) is the empty system. Otherwise, S ( Γ , v ) is the union of equation systems attached to the edges of the path ξ .
A decision tree Γ over T is called a decision tree for T if, for any node v of Γ ,
  • The node v is terminal if and only if the subtable T S ( Γ , v ) is degenerate.
  • If v is a terminal node and the subtable T S ( Γ , v ) is empty, then the node v is labeled with the decision 0.
  • If v is a terminal node and the subtable T S ( Γ , v ) is nonempty, then the node v is labeled with the decision attached to all rows of T S ( Γ , v ) .
A complete path in Γ is an arbitrary directed path from the root to a terminal node in Γ . As the time complexity of a decision tree, we consider its depth that is the maximum number of working nodes in a complete path in the tree or, which is the same, the maximum length of a complete path in the tree. We denote by h ( Γ ) the depth of a decision tree Γ .
As the space complexity of the decision tree Γ , we consider the number of its realizable relative to T nodes. A node v of Γ is called realizable relative to T if and only if the subtable T S ( Γ , v ) is nonempty. We denote by L ( T , Γ ) the number of nodes in Γ that are realizable relative to T. We also consider two more cost functions relative to the space complexity: L t ( T , Γ ) — the number of terminal nodes in Γ that are realizable relative to T and L w ( T , Γ ) — the number of working nodes in Γ . Note that all working nodes of Γ are realizable relative to T.
We will use the following notation:
  • For k = 1 , , 5 , h ( k ) ( T ) is the minimum depth of a decision tree of the type k for T.
  • For k = 1 , , 5 , L ( k ) ( T ) is the minimum number of nodes realizable relative to T in a decision tree of the type k for T.
  • For k = 1 , , 5 , L t ( k ) ( T ) is the minimum number of terminal nodes realizable relative to T in a decision tree of the type k for T.
  • For k = 1 , , 5 , L w ( k ) ( T ) is the minimum number of working nodes in a decision tree of the type k for T.

4. Construction of Directed Acyclic Graph Δ ( T )

Let T be a nonempty decision table with n conditional attributes f 1 , , f n . We now describe an Algorithm A D A G for the construction of a directed acyclic graph (DAG) Δ ( T ) that will be used for the study of decision trees. Nodes of this graph are separable subtables of the table T. During each iteration we process one node. We start with the graph that consists of one node T, which is not processed and finish when all nodes of the graph are processed. This algorithm can be considered as a special case of the algorithm for DAG construction considered in Reference [18].
Algorithm A D A G (construction of DAG Δ ( T ) ).
Input: A nonempty decision table T with n conditional attributes f 1 , , f n .
Output: Directed acyclic graph Δ ( T ) .
  • Construct the graph that consists of one node T, which is not marked as processed.
  • If all nodes of the graph are processed, then the algorithm halts and returns the resulting graph as Δ ( T ) . Otherwise, choose a node (table) Θ that has not been processed yet.
  • If Θ is degenerate, then mark the node Θ as processed and proceed to step 2.
  • If Θ is not degenerate, then, for each f i E ( Θ ) , draw a bundle of edges from the node Θ . Let E ( Θ , f i ) = { a 1 , , a k } . Then, draw k edges from Θ and label these edges with systems of equations { f i = a 1 } , , { f i = a k } . These edges enter nodes Θ { f i = a 1 } , ,   Θ { f i = a k } , respectively. If some of the nodes Θ { f i = a 1 } , , Θ { f i = a k } are not present in the graph, then add these nodes to the graph. Mark the node Θ as processed and return to step 2.
The following statement about time complexity of the Algorithm A D A G follows immediately from Proposition 3.3 [18].
Proposition 1.
The time complexity of the Algorithm A D A G is bounded from above by a polynomial on the size of the input table T and the number S E P ( T ) of different separable subtables of T.
Generally, the time complexity of the Algorithm A D A G is exponential, depending on the size of the input decision tables. Note that, in Section 3.4 of the book [18], classes of decision tables are described for each of which the number of separable subtables of decision tables from the class is bounded from above by a polynomial on the number of columns in the tables. For each of these classes, the time complexity of the Algorithm A D A G is polynomial depending on the size of the input decision tables.
Note that similar results can be obtained for the space complexity of the considered algorithm.

5. Minimizing the Depth

In this section, we consider some results obtained in Reference [12]. Let T be a nonempty decision table with n conditional attributes f 1 , , f n . We can use the DAG Δ ( T ) to compute values h ( 1 ) ( T ) , , h ( 5 ) ( T ) . Let k { 1 , , 5 } . To find the value h ( k ) ( T ) , for each node Θ of the DAG Δ ( T ) , we compute the value h ( k ) ( Θ ) . It will be convenient for us to consider not only subtables that are nodes of Δ ( T ) but also empty subtable Λ of T and subtables T r that contain only one row r of T and are not nodes of Δ ( T ) . We begin with these special subtables and terminal nodes of Δ ( T ) (nodes without leaving edges) that are degenerate separable subtables of T and step-by-step move to the table T.
Let Θ be a terminal node of Δ ( T ) or Θ = T r for some row r of T. Then, h ( k ) ( Θ ) = 0 : the decision tree that contains only one node labeled with the decision attached to all rows of Θ is a decision tree for Θ . If Θ = Λ , then h ( k ) ( Θ ) = 0 : the decision tree that contains only one node labeled with 0 will be considered as a decision tree for Λ .
Let Θ be a nonterminal node of Δ ( T ) such that, for each child Θ of Θ , we already know the value h ( k ) ( Θ ) . Based on this information, we can find the minimum depth of a decision tree for Θ , which uses for the subtables corresponding to the children of the root decision trees of the type k and in which the root is labeled:
  • With an attribute from F ( T ) (we denote by h a ( k ) ( Θ ) the minimum depth of such a decision tree).
  • With a hypothesis over T (we denote by h h ( k ) ( Θ ) the minimum depth of such a decision tree).
  • With a proper hypothesis over T (we denote by h p ( k ) ( Θ ) the minimum depth of such a decision tree).
Since Θ is nondegenerate, the set E ( Θ ) is nonempty. We now describe three procedures for computing the values h a ( k ) ( Θ ) , h h ( k ) ( Θ ) , and h p ( k ) ( Θ ) , respectively.
Let us consider a decision tree Γ ( f i ) for Θ in which the root is labeled with an attribute f i E ( Θ ) . For each δ E ( T , f i ) , there is an edge that leaves the root and enters a node v ( δ ) . This edge is labeled with the equation system { f i = δ } . The node v ( δ ) is the root of a decision tree of the type k for Θ { f i = δ } for which the depth is equal to h ( k ) ( Θ { f i = δ } ) . It is clear that
h ( Γ ( f i ) ) = 1 + max { h ( k ) ( Θ { f i = δ } ) : δ E ( T , f i ) } .
Since h ( k ) ( Θ { f i = δ } ) = h ( k ) ( Λ ) = 0 for any δ E ( T , f i ) E ( Θ , f i ) ,
h ( Γ ( f i ) ) = 1 + max { h ( k ) ( Θ { f i = δ } ) : δ E ( Θ , f i ) } .
Evidently, for any δ E ( Θ , f i ) , the subtable Θ { f i = δ } is a child of Θ in the DAG Δ ( T ) , i.e., we know the value h ( k ) ( Θ { f i = δ } ) .
One can show that h ( Γ ( f i ) ) is the minimum depth of a decision tree for Θ in which the root is labeled with the attribute f i and which uses for the subtables corresponding to the children of the root decision trees of the type k.
We should not consider attributes f i F ( T ) E ( Θ ) since, for each such attribute, there is δ E ( T , f i ) with Θ { f i = δ } = Θ , i.e., based on this attribute, we cannot construct an optimal decision tree for Θ . As a result, we have
h a ( k ) ( Θ ) = min { h ( Γ ( f i ) ) : f i E ( Θ ) } .
Computation of h a ( k ) ( Θ ) . Construct the set of attributes E ( Θ ) . For each attribute f i E ( Θ ) , compute the value h ( Γ ( f i ) ) using (1). Compute the value h a ( k ) ( Θ ) using (2).
Remark 2.
Let Θ be a nonterminal node of the DAG Δ ( T ) such that, for each child Θ of Θ, we already know the value h ( k ) ( Θ ) . Then, the procedure of computation of the value h a ( k ) ( Θ ) has polynomial time complexity depending on the size of decision table T.
A hypothesis H = { f 1 = δ 1 , , f n = δ n } over T is called admissible for Θ and an attribute f i F ( T ) = { f 1 , , f n } if, for any σ E ( T , f i ) { δ i } , Θ { f i = σ } Θ . The hypothesis H is not admissible for Θ and an attribute f i F ( T ) if and only if | E ( Θ , f i ) | = 1 and δ i E ( Θ , f i ) . The hypothesis H is called admissible for Θ if it is admissible for Θ and any attribute f i F ( T ) .
Let us consider a decision tree Γ ( H ) for Θ in which the root is labeled with an admissible for Θ hypothesis H = { f 1 = δ 1 , , f n = δ n } . The set of answers for the query corresponding to the hypothesis H is equal to A ( H ) = { H , { f 1 = σ 1 } , , { f n = σ n } : σ 1 E ( T , f 1 ) { δ 1 } , , σ n E ( T , f n ) { δ n } } . For each S A ( H ) , there is an edge that leaves the root of Γ ( H ) and enters a node v ( S ) . This edge is labeled with the equation system S. The node v ( S ) is the root of a decision tree of the type k for Θ S , which depth is equal to h ( k ) ( Θ S ) . It is clear that
h ( Γ ( H ) ) = 1 + max { h ( k ) ( Θ S ) : S A ( H ) } .
We have Θ H = Λ or Θ H = T r for some row r of T. Therefore, h ( k ) ( Θ H ) = 0 . Since H is admissible for Θ , E ( Θ , f i ) { δ i } = for any attribute f F ( T ) E ( Θ ) . It is clear that Θ { f i = σ } = Λ and h ( k ) ( Θ { f i = σ } ) = 0 for any attribute f i E ( Θ ) and any σ E ( T , f i ) { δ i } such that σ E ( Θ , f i ) . Therefore,
h ( Γ ( H ) ) = 1 + max { h ( k ) ( Θ { f i = σ } ) : f i E ( Θ ) , σ E ( Θ , f i ) { δ i } } .
It is clear that, for any f i E ( Θ ) and any σ E ( Θ , f i ) { δ i } , the subtable Θ { f i = σ } is a child of Θ in the DAG Δ ( T ) , i.e., we know the value h ( k ) ( Θ { f i = σ } ) .
One can show that h ( Γ ( H ) ) is the minimum depth of a decision tree for Θ in which the root is labeled with the hypothesis H and which uses for the subtables corresponding to the children of the root decision trees of the type k.
We should not consider hypotheses that are not admissible for Θ since, for each such hypothesis H for corresponding query, there is an answer S A ( H ) with Θ S = Θ , i.e., based on this hypothesis, we cannot construct an optimal decision tree for Θ .
Computation of h h ( k ) ( Θ ) . First, we construct a hypothesis:
H Θ = { f 1 = δ 1 , , f n = δ n }
for Θ . Let f i F ( T ) E ( Θ ) . Then, δ i is equal to the only number in the set E ( Θ , f i ) . Let f i E ( Θ ) . Then, δ i is the minimum number from E ( Θ , f i ) for which h ( k ) ( Θ { f i = δ i } ) = max { h ( k ) ( Θ { f i = σ } ) : σ E ( Θ , f i ) } . It is clear that H Θ is admissible for Θ . Compute the value h ( Γ ( H Θ ) ) using (3). Simple analysis of (3) shows that h ( Γ ( H Θ ) ) = h h ( k ) ( Θ ) .
Remark 3.
Let Θ be a nonterminal node of the DAG Δ ( T ) such that, for each child Θ of Θ, we already know the value h ( k ) ( Θ ) . Then, the procedure of computation of the value h h ( k ) ( Θ ) has polynomial time complexity depending on the size of decision table T.
Computation of h p ( k ) ( Θ ) . For each row r = ( δ 1 , , δ n ) of the decision table T, we check if the corresponding proper hypothesis H r = { f 1 = δ 1 , , f n = δ n } is admissible for Θ . For each admissible for Θ proper hypothesis H r = { f 1 = δ 1 , , f n = δ n } , we compute the value h ( Γ ( H r ) ) using (3). One can show that the minimum among the obtained numbers is equal to h p ( k ) ( Θ ) .
Remark 4.
Let Θ be a nonterminal node of the DAG Δ ( T ) such that, for each child Θ of Θ, we already know the value h ( k ) ( Θ ) . Then, the procedure of computation of the value h p ( k ) ( Θ ) has polynomial time complexity depending on the size of decision table T.
We describe an Algorithm A h that, for a given nonempty decision table T and given k { 1 , , 5 } , calculates the value h ( k ) ( T ) , which is the minimum depth of a decision tree of the type k for the table T. During the work of this algorithm, we find for each node Θ of the DAG Δ ( T ) the value h ( k ) ( Θ ) .
Algorithm A h (computation of h ( k ) ( T ) ).
Input: A nonempty decision table T, the directed acyclic graph Δ ( T ) , and number k { 1 , , 5 } .
Output: The value h ( k ) ( T ) .
  • If a number is attached to each node of the DAG Δ ( T ) , then return the number attached to the node T as h ( k ) ( T ) and halt the algorithm. Otherwise, choose a node Θ of the graph Δ ( T ) without attached number, which is either a terminal node of Δ ( T ) or a nonterminal node of Δ ( T ) for which all children have attached numbers.
  • If Θ is a terminal node, then attach to it the number h ( k ) ( Θ ) = 0 and proceed to step 1.
  • If Θ is not a terminal node, then, depending on the value k, do the following:
    • In the case k = 1 , compute the value h a ( 1 ) ( Θ ) and attach to Θ the value h ( 1 ) ( Θ ) = h a ( 1 ) ( Θ ) .
    • In the case k = 2 , compute the value h h ( 2 ) ( Θ ) and attach to Θ the value h ( 2 ) ( Θ ) = h h ( 2 ) ( Θ ) .
    • In the case k = 3 , compute the values h a ( 3 ) ( Θ ) and h h ( 3 ) ( Θ ) , and attach to Θ the value h ( 3 ) ( Θ ) = min { h a ( 3 ) ( Θ ) , h h ( 3 ) ( Θ ) } .
    • In the case k = 4 , compute the value h p ( 4 ) ( Θ ) and attach to Θ the value h ( 4 ) ( Θ ) = h p ( 4 ) ( Θ ) .
    • In the case k = 5 , compute the values h a ( 5 ) ( Θ ) and h p ( 5 ) ( Θ ) , and attach to Θ the value h ( 5 ) ( Θ ) = min { h a ( 5 ) ( Θ ) , h p ( 5 ) ( Θ ) } .
    Proceed to step 1.
Using Remarks 2–4, one can prove the following statement.
Proposition 5.
The time complexity of the Algorithm A h is bounded from above by a polynomial on the size of the input table T and the number S E P ( T ) of different separable subtables of T.
A similar bound can be obtained for the space complexity of the considered algorithm.

6. Minimizing the Number of Realizable Nodes

In this section, we consider some results obtained in Reference [13]. Let T be a nonempty decision table with n conditional attributes f 1 , , f n . We can use the DAG Δ ( T ) to compute values L ( 1 ) ( T ) , , L ( 5 ) ( T ) . Let k { 1 , , 5 } . To find the value L ( k ) ( T ) , we compute the value L ( k ) ( Θ ) for each node Θ of the DAG Δ ( T ) . We will consider not only subtables that are nodes of Δ ( T ) but also empty subtable Λ of T and subtables T r that contain only one row r of T and are not nodes of Δ ( T ) . We begin with these special subtables and terminal nodes of Δ ( T ) (nodes without leaving edges) that are degenerate separable subtables of T and step-by-step move to the table T.
Let Θ be a terminal node of Δ ( T ) or Θ = T r for some row r of T. Then, L ( k ) ( Θ ) = 1 : the decision tree that contains only one node labeled with the decision attached to all rows of Θ is a decision tree for Θ . The only node of this tree is realizable relative to Θ . If Θ = Λ , then L ( k ) ( Θ ) = 0 : the decision tree that contains only one node labeled with 0 will be considered as a decision tree for Λ . The only node of this tree is not realizable relative to Λ .
Let Θ be a nonterminal node of Δ ( T ) such that, for each child Θ of Θ , we already know the value L ( k ) ( Θ ) . Based on this information, we can find the minimum number of realizable relative to Θ nodes in a decision tree for Θ , which uses for the subtables corresponding to the children of the root decision trees of the type k and in which the root is labeled
  • With an attribute from F ( T ) (we denote by L a ( k ) ( Θ ) the minimum number of realizable relative to Θ nodes in such a decision tree).
  • With a hypothesis over T (we denote by L h ( k ) ( Θ ) the minimum number of realizable relative to Θ nodes in such a decision tree).
  • With a proper hypothesis over T (we denote by L p ( k ) ( Θ ) the minimum number of realizable relative to Θ nodes in such a decision tree).
We now describe three procedures for computing the values L a ( k ) ( Θ ) , L h ( k ) ( Θ ) , and L p ( k ) ( Θ ) , respectively. Since Θ is nondegenerate, the set E ( Θ ) is nonempty.
Let us consider a decision tree Γ ( f i ) for Θ in which the root is labeled with an attribute f i E ( Θ ) . For each δ E ( T , f i ) , there is an edge that leaves the root and enters a node v ( δ ) . This edge is labeled with the equation system { f i = δ } . The node v ( δ ) is the root of a decision tree of the type k for Θ { f i = δ } for which the number of realizable relative to Θ { f i = δ } nodes is equal to L ( k ) ( Θ { f i = δ } ) . It is clear that L ( Θ , Γ ( f i ) ) = 1 + δ E ( T , f i ) L ( k ) ( Θ { f i = δ } ) . Since L ( k ) ( Θ { f i = δ } ) = L ( k ) ( Λ ) = 0 for any δ E ( T , f i ) E ( Θ , f i ) ,
L ( Θ , Γ ( f i ) ) = 1 + δ E ( Θ , f i ) L ( k ) ( Θ { f i = δ } ) .
Evidently, for any δ E ( Θ , f i ) , the subtable Θ { f i = δ } is a child of Θ in the DAG Δ ( T ) , i.e., we know the value L ( k ) ( Θ { f i = δ } ) . One can show that L ( Θ , Γ ( f i ) ) is the minimum number of realizable relative to Θ nodes in a decision tree for Θ , which uses for the subtables corresponding to the children of the root decision trees of the type k and in which the root is labeled with the attribute f i .
We should not consider attributes f i F ( T ) E ( Θ ) since, for each such attribute, there is δ E ( T , f i ) with Θ { f i = δ } = Θ , i.e., based on this attribute, we cannot construct an optimal decision tree for Θ . As a result, we have
L a ( k ) ( Θ ) = min { L ( Θ , Γ ( f i ) ) : f i E ( Θ ) } .
Computation of L a ( k ) ( Θ ) . Construct the set of attributes E ( Θ ) . For each attribute f i E ( Θ ) , compute the value L ( Θ , Γ ( f i ) ) using (4). Compute the value L a ( k ) ( Θ ) using (5).
Let us consider a decision tree Γ ( H ) for Θ in which the root is labeled with an admissible for Θ hypothesis H = { f 1 = δ 1 , , f n = δ n } . For each S A ( H ) , there is an edge that leaves the root of Γ ( H ) and enters a node v ( S ) . This edge is labeled with the equation system S. The node v ( S ) is the root of a decision tree of the type k for Θ S , for which the number of realizable relative to Θ S nodes is equal to L ( k ) ( Θ S ) . It is clear that L ( Θ , Γ ( H ) ) = 1 + S A ( H ) L ( k ) ( Θ S ) .
Denote r = ( δ 1 , , δ n ) . It is easy to show that Θ H = Λ if r is not a row of Θ and Θ H = T r if r is a row of Θ . Therefore,
L ( k ) ( Θ H ) = 0 , if   r   is   not   a   row   of   Θ , 1 , if   r   is   a   row   of   Θ .
Since H is admissible for Θ , E ( Θ , f i ) { δ i } = for any attribute f i F ( T ) E ( Θ ) . It is clear that Θ { f i = σ } = Λ and L ( k ) ( Θ { f i = σ } ) = 0 for any attribute f i E ( Θ ) and any σ E ( T , f i ) { δ i } such that σ E ( Θ , f i ) . Therefore,
L ( Θ , Γ ( H ) ) = L ( k ) ( Θ H ) + K ( Θ , H ) ,
where
K ( Θ , H ) = 1 + f i E ( Θ ) , σ E ( Θ , f i ) { δ i } L ( k ) ( Θ { f i = σ } ) .
Evidently, for any f i E ( Θ ) and any σ E ( Θ , f i ) { δ i } , the subtable Θ { f i = σ } is a child of Θ in the DAG Δ ( T ) , i.e., we know the value L ( k ) ( Θ { f i = σ } ) . It is easy to show that L ( Θ , Γ ( H ) ) is the minimum number of realizable relative to Θ nodes in a decision tree for Θ , which uses for the subtables corresponding to the children of the root decision trees of the type k and in which the root is labeled with the admissible for Θ hypothesis H.
We should not consider hypotheses that are not admissible for Θ since, for each such hypothesis H for corresponding query, there is an answer S A ( H ) with Θ S = Θ , i.e., based on this hypothesis, we cannot construct an optimal decision tree for Θ . As a result, we have
L h ( k ) ( Θ ) = min { L ( Θ , Γ ( H ) ) : H A d m ( Θ ) } ,
where A d m ( Θ ) is the set of admissible hypotheses for Θ .
For each f i { f 1 , , f n } , denote a i ( Θ ) = max { L ( k ) ( Θ { f i = σ } ) : σ E ( Θ , f i ) } and C ( Θ , f i ) = { σ E ( Θ , f i ) : L ( k ) ( Θ { f i = σ } ) = a i ( Θ ) } . Set C ( Θ ) = C ( Θ , f 1 ) × × C ( Θ , f n ) . It is clear that, for each δ ¯ = ( δ 1 , , δ n ) C ( Θ ) , the hypothesis H δ ¯ = { f 1 = δ 1 , , f n = δ n } is admissible for Θ . Simple analysis of (8) shows that the set { H δ ¯ : δ ¯ C ( Θ ) } coincides with the set of admissible for Θ hypotheses H that minimize the value K ( Θ , H ) . Denote K min = K ( Θ , H δ ¯ ) , where δ ¯ C ( Θ ) .
Let there be a tuple δ ¯ C ( Θ ) , which is not a row of Θ . Then, L ( k ) ( Θ H δ ¯ ) = 0 and L h ( k ) ( Θ ) = K min . Let all tuples from C ( Θ ) be rows of Θ . We now show that L h ( k ) ( Θ ) = 1 + K min . For any δ ¯ C ( Θ ) , we have L ( Θ , Γ ( H δ ¯ ) ) = 1 + K min . Therefore, L h ( k ) ( Θ ) 1 + K min . Let us assume that L h ( k ) ( Θ ) < 1 + K min . Then, by (9), there exists an admissible for Θ hypothesis H = { f 1 = σ 1 , , f n = σ n } for which ( σ 1 , , σ n ) C ( Θ ) and L ( Θ , Γ ( H ) ) < 1 + K min , but this is impossible since, according to (7), L ( Θ , Γ ( H ) ) K ( Θ , H ) K min + 1 .
As a result, we have L h ( k ) ( Θ ) = K min if not all tuples from C ( Θ ) are rows of Θ , and L h ( k ) ( Θ ) = 1 + K min if all tuples from C ( Θ ) are rows of Θ .
Computation of L h ( k ) ( Θ ) . For each f i { f 1 , , f n } , we compute the value: a i ( Θ ) = max { L ( k ) ( Θ { f i = σ } ) : σ E ( Θ , f i ) } and construct the set C ( Θ , f i ) = { σ E ( Θ , f i ) : L ( k ) ( Θ { f i = σ } ) = a i ( Θ ) } . For a tuple δ ¯ C ( Θ ) = C ( Θ , f 1 ) × × C ( Θ , f n ) , using (8), we compute the value K min = K ( Θ , H δ ¯ ) . Then, we count the number N of rows from Θ that belong to the set C ( Θ ) and compute the cardinality | C ( Θ ) | of the set C ( Θ ) that is equal to | C ( Θ , f 1 ) | · · | C ( Θ , f n ) | . As a result, we have L h ( k ) ( Θ ) = K min if N < | C ( Θ ) | and L h ( k ) ( Θ ) = 1 + K min if N = | C ( Θ ) | .
Computation of L p ( k ) ( Θ ) . For each row r = ( δ 1 , , δ n ) of the decision table T, we check if the corresponding proper hypothesis H r = { f 1 = δ 1 , , f n = δ n } is admissible for Θ . For each admissible for Θ proper hypothesis H r , we compute the value L ( Θ , Γ ( H r ) ) using (6), (7), and (8). One can show that the minimum among the obtained numbers is equal to L p ( k ) ( Θ ) .
We now consider an algorithm A L that, for a given nonempty decision table T and number k { 1 , , 5 } , calculates the value L ( k ) ( T ) , which is the minimum number of nodes realizable relative to T in a decision tree of the type k for the table T. During the work of this algorithm, we find for each node Θ of the DAG Δ ( T ) the value L ( k ) ( Θ ) .
The description of the algorithm A L is similar to the description of the Algorithm A h . Instead of h ( k ) , we should use L ( k ) . For each b { a , h , p } , instead of h b ( k ) , we should use L b ( k ) . In particular, for each terminal node Θ , L ( k ) ( Θ ) = 1 .
One can show that the procedures of computation of the values L a ( k ) ( Θ ) , L h ( k ) ( Θ ) , and L p ( k ) ( Θ ) have polynomial time complexity depending on the size of the decision table T. Using this fact, one can prove the following statement.
Proposition 6.
The time complexity of the algorithm A L is bounded from above by a polynomial on the size of the input table T and the number S E P ( T ) of different separable subtables of T.
A similar bound can be obtained for the space complexity of the considered algorithm.

7. Minimizing the Number of Realizable Terminal Nodes

The procedure considered in this section is similar to the procedure of the minimization of the number of realizable nodes. The main difference is that, in decision trees with the minimum number of realizable terminal nodes, it is possible to meet constant attributes and hypotheses that are not admissible. Fortunately, for any decision table and any type of decision trees, there is a decision tree of this type with the minimum number of realizable terminal nodes for the considered table that do not use such attributes and hypotheses. We will omit many details and describe main steps only.
Let T be a nonempty decision table with n conditional attributes f 1 , , f n and k { 1 , , 5 } . To find the value L t ( k ) ( T ) , we compute the value L t ( k ) ( Θ ) for each node Θ of the DAG Δ ( T ) . We begin with terminal nodes of Δ ( T ) that are degenerate separable subtables of T and step-by-step move to the table T.
Let Θ be a terminal node of Δ ( T ) . Then, L t ( k ) ( Θ ) = 1 : the decision tree that contains only one node labeled with the decision attached to all rows of Θ is a decision tree for Θ . The only node of this tree is a terminal node realizable relative to Θ .
Let Θ be a nonterminal node of Δ ( T ) such that, for each child Θ of Θ , we already know the value L t ( k ) ( Θ ) . Based on this information, we can find the minimum number of realizable relative to Θ terminal nodes in a decision tree for Θ , which uses for the subtables corresponding to children of the root decision trees of the type k and in which the root is labeled
  • With an attribute from F ( T ) (we denote by L t , a ( k ) ( Θ ) the minimum number of realizable relative to Θ terminal nodes in such a decision tree).
  • With a hypothesis over T (we denote by L t , h ( k ) ( Θ ) the minimum number of realizable relative to Θ terminal nodes in such a decision tree).
  • With a proper hypothesis over T (we denote by L t , p ( k ) ( Θ ) the minimum number of realizable relative to Θ terminal nodes in such a decision tree).
We now describe three procedures for computing the values L t , a ( k ) ( Θ ) , L t , h ( k ) ( Θ ) , and L t , p ( k ) ( Θ ) , respectively. Since Θ is nondegenerate, the set E ( Θ ) is nonempty.
Computation of L t , a ( k ) ( Θ ) . Construct the set of attributes E ( Θ ) . For each attribute f i E ( Θ ) , compute the value:
L t , a ( k ) ( Θ , f i ) = δ E ( Θ , f i ) L t ( k ) ( Θ { f i = δ } ) .
Then, compute the value:
L t , a ( k ) ( Θ ) = min { L t , a ( k ) ( Θ , f i ) : f i E ( Θ ) } .
Computation of L t , h ( k ) ( Θ ) . For each f i { f 1 , , f n } , we compute the value: a i ( Θ ) = max { L t ( k ) ( Θ { f i = σ } ) : σ E ( Θ , f i ) } and construct the set C ( Θ , f i ) = { σ E ( Θ , f i ) : L t ( k ) ( Θ { f i = σ } ) = a i ( Θ ) } . For a tuple ( δ 1 , , δ n ) C ( Θ ) = C ( Θ , f 1 ) × × C ( Θ , f n ) , we compute the value:
K min = f i E ( Θ ) , σ E ( Θ , f i ) { δ i } L t ( k ) ( Θ { f i = σ } ) .
Then, we count the number N of rows from Θ that belong to the set C ( Θ ) and compute the cardinality | C ( Θ ) | of the set C ( Θ ) that is equal to | C ( Θ , f 1 ) | · · | C ( Θ , f n ) | . As a result, we have L t , h ( k ) ( Θ ) = K min if N < | C ( Θ ) | and L t , h ( k ) ( Θ ) = 1 + K min if N = | C ( Θ ) | .
Computation of L t , p ( k ) ( Θ ) . For each row r = ( δ 1 , , δ n ) of the decision table T, we check if the corresponding proper hypothesis H r = { f 1 = δ 1 , , f n = δ n } is admissible for Θ . For each admissible for Θ proper hypothesis H r , we compute the value:
L t , p ( k ) ( Θ , H r ) = 1 + f i E ( Θ ) , σ E ( Θ , f i ) { δ i } L t ( k ) ( Θ { f i = σ } ) .
One can show that the minimum among the obtained numbers is equal to L t , p ( k ) ( Θ ) .
We now consider an algorithm A L t that, for a given nonempty decision table T and number k { 1 , , 5 } , calculates the value L t ( k ) ( T ) , which is the minimum number of terminal nodes realizable relative to T in a decision tree of the type k for the table T. During the work of this algorithm, we find for each node Θ of the DAG Δ ( T ) the value L t ( k ) ( Θ ) .
The description of the algorithm A L t is similar to the description of the Algorithm A h . Instead of h ( k ) , we should use L t ( k ) . For each b { a , h , p } , instead of h b ( k ) , we should use L t , b ( k ) . In particular, for each terminal node Θ , L t ( k ) ( Θ ) = 1 .
One can show that the procedures of computation of the values L t , a ( k ) ( Θ ) , L t , h ( k ) ( Θ ) , and L t , p ( k ) ( Θ ) have polynomial time complexity depending on the size of the decision table T. Using this fact, one can prove the following statement.
Proposition 7.
The time complexity of the algorithm A L t is bounded from above by a polynomial on the size of the input table T and the number S E P ( T ) of different separable subtables of T.
A similar bound can be obtained for the space complexity of the considered algorithm.

8. Minimizing the Number of Working Nodes

The procedure considered in this section is similar to the procedure of the minimization of the depth. We will omit many details and describe main steps only.
Let T be a nonempty decision table with n conditional attributes f 1 , , f n and k { 1 , , 5 } . To find the value L w ( k ) ( T ) , we compute the value L w ( k ) ( Θ ) for each node Θ of the DAG Δ ( T ) . We begin with terminal nodes of Δ ( T ) that are degenerate separable subtables of T and step-by-step move to the table T.
Let Θ be a terminal node of Δ ( T ) . Then, L t ( k ) ( Θ ) = 0 : the decision tree that contains only one node labeled with the decision attached to all rows of Θ is a decision tree for Θ . This tree has no working nodes.
Let Θ be a nonterminal node of Δ ( T ) such that, for each child Θ of Θ , we already know the value L w ( k ) ( Θ ) . Based on this information, we can find the minimum number of working nodes in a decision tree for Θ , which uses for the subtables corresponding to children of the root decision trees of the type k and in which the root is labeled
  • With an attribute from F ( T ) (we denote by L w , a ( k ) ( Θ ) the minimum number of working nodes in such a decision tree).
  • With a hypothesis over T (we denote by L w , h ( k ) ( Θ ) the minimum number of working nodes in such a decision tree).
  • With a proper hypothesis over T (we denote by L w , p ( k ) ( Θ ) the minimum number of working nodes in such a decision tree).
We now describe three procedures for computing the values L w , a ( k ) ( Θ ) , L w , h ( k ) ( Θ ) , and L w , p ( k ) ( Θ ) , respectively. Since Θ is nondegenerate, the set E ( Θ ) is nonempty.
Computation of L w , a ( k ) ( Θ ) . Construct the set of attributes E ( Θ ) . For each attribute f i E ( Θ ) , compute the value:
L w , a ( k ) ( Θ , f i ) = δ E ( Θ , f i ) L w ( k ) ( Θ { f i = δ } ) .
Then, compute the value:
L w , a ( k ) ( Θ ) = min { L w , a ( k ) ( Θ , f i ) : f i E ( Θ ) } .
Computation of L w , h ( k ) ( Θ ) . First, we construct a hypothesis:
H Θ = { f 1 = δ 1 , , f n = δ n }
for Θ . Let f i F ( T ) E ( Θ ) . Then, δ i is equal to the only number in the set E ( Θ , f i ) . Let f i E ( Θ ) . Then, δ i is the minimum number from E ( Θ , f i ) for which L w ( k ) ( Θ { f i = δ i } ) = max { L w ( k ) ( Θ { f i = σ } ) : σ E ( Θ , f i ) } . Then
L w , h ( k ) ( Θ ) = 1 + f i E ( Θ ) , σ E ( Θ , f i ) { δ i } L w ( k ) ( Θ { f i = σ } ) .
Computation of L w , p ( k ) ( Θ ) . For each row r = ( δ 1 , , δ n ) of the decision table T, we check if the corresponding proper hypothesis H r = { f 1 = δ 1 , , f n = δ n } is admissible for Θ . For each admissible for Θ proper hypothesis H r , we compute the value:
L w , p ( k ) ( Θ , H r ) = 1 + f i E ( Θ ) , σ E ( Θ , f i ) { δ i } L w ( k ) ( Θ { f i = σ } ) .
One can show that the minimum among the obtained numbers is equal to L w , p ( k ) ( Θ ) .
We now consider an algorithm A L w that, for a given nonempty decision table T and k { 1 , , 5 } , calculates the value L w ( k ) ( T ) , which is the minimum number of working nodes in a decision tree of the type k for the table T. During the work of this algorithm, we find for each node Θ of the DAG Δ ( T ) the value L w ( k ) ( Θ ) .
The description of the algorithm A L w is similar to the description of the Algorithm A h . Instead of h ( k ) , we should use L w ( k ) . For each b { a , h , p } , instead of h b ( k ) , we should use L w , b ( k ) . In particular, for each terminal node Θ , L w ( k ) ( Θ ) = 0 .
One can show that the procedures of computation of the values L w , a ( k ) ( Θ ) , L w , h ( k ) ( Θ ) , and L w , p ( k ) ( Θ ) have polynomial time complexity depending on the size of the decision table T. Using this fact, one can prove the following statement.
Proposition 8.
The time complexity of the algorithm A L w is bounded from above by a polynomial on the size of the input table T and the number S E P ( T ) of different separable subtables of T.
A similar bound can be obtained for the space complexity of the considered algorithm.

9. On Number of Realizable Terminal Nodes

Based on the results of experiments, we formulated the following hypothesis: L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = L t ( 5 ) ( T ) for any decision table T. In this section, we prove it. First, we consider a simple lemma.
Lemma 9.
Let T be a decision table and T be a subtable of the table T. Then, L t ( 3 ) ( T ) L t ( 3 ) ( T ) .
Proof. 
It is easy to prove the considered inequality if T is degenerate. Let T be nondegenerate and Γ be a decision tree of the type 3 for T with the minimum number of realizable relative to T terminal nodes. Then, the root r of Γ is a working node. It is clear that the table T S ( Γ , r ) is nondegenerate. For each working node v of Γ such that the table T S ( Γ , v ) is degenerate and the table T S ( Γ , v ) is nondegenerate, where v is the parent of v, we do the following. We remove all nodes and edges of the subtree of Γ with the root v with the exception of the node v. If T S ( Γ , v ) = Λ , then we label the node v with the number 0. If the subtable T S ( Γ , v ) is nonempty, then we label the node v with the decision attached to each row of this subtable. We denote by Γ the obtained decision tree. One can show that Γ is a decision tree of the type 3 for the table T and L t ( 3 ) ( T , Γ ) L t ( 3 ) ( T , Γ ) . Therefore, L t ( 3 ) ( T ) L t ( 3 ) ( T ) . □
Proposition 10.
For any decision table T, the following equalities hold:
L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = L t ( 5 ) ( T ) .
Proof. 
It is clear that L t ( 3 ) ( T ) L t ( 5 ) ( T ) L t ( 1 ) ( T ) for any decision table T. To prove the considered statement, it is enough to show that L t ( 1 ) ( T ) L t ( 3 ) ( T ) for any decision table T. We will prove this inequality by induction on the number of attributes in the set E ( T ) .
We now show that L t ( 1 ) ( T ) L t ( 3 ) ( T ) for any decision table T with | E ( T ) | = 0 . If | E ( T ) | = 0 , then either the table T is empty or the table T contains one row. Let T be empty. In this case, the decision tree that contains only one node labeled with 0 is considered as a decision tree for T. The only node of this tree is not realizable relative to T. Therefore, L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = 0 . Let T contain one row. In this case, the decision tree that contains only one node labeled with the decision attached to the row of T is a decision tree for T. The only node of this tree is realizable relative to T. Therefore, L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = 1 .
Let n 1 and, for any decision table T with | E ( T ) | n 1 , the inequality L t ( 1 ) ( T ) L t ( 3 ) ( T ) hold. Let T be a decision table with | E ( T ) | = n and T have m n columns labeled with the attributes f 1 , , f m . Let, for the definiteness, E ( T ) = { f 1 , , f n } . If T is a degenerate table, then, as it is easy to show, L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = 1 . Let T be nondegenerate.
We denote by Γ a decision tree of the type 3 for the table T for which L t ( T , Γ ) = L t ( 3 ) ( T ) and Γ has the minimum number of nodes among such decision trees. One can show that the root of Γ is either labeled with an attribute from E ( T ) or with a hypothesis over T that is admissible for T. We now prove that the tree Γ can be transformed into a decision tree Γ * of the type 1 for the table T such that L t ( T , Γ * ) L t ( 3 ) ( T ) .
Let the root of Γ be labeled with an attribute f i E ( T ) . Then, for each σ E ( T , f i ) , the root of Γ has a child v σ such that T S ( Γ , v σ ) = T { f i = σ } and the root of Γ has no other children. Since f i E ( T ) , | E ( T { f i = σ } | n 1 . Using the inductive hypothesis, we obtain that there is a decision tree Γ σ of the type 1 for the table T { f i = σ } such that L t ( T { f i = σ } , Γ σ ) L t ( 3 ) ( T { f i = σ } ) . For each child v σ of the root of Γ , we replace the subtree of Γ with the root v σ with the tree Γ σ . As a result, we obtain a decision tree Γ * of the type 1 for the table T such that L t ( T , Γ * ) L t ( T , Γ ) = L t ( 3 ) ( T ) .
Let the root of Γ be labeled with a hypothesis H = { f 1 = δ 1 , , f m = δ m } over T that is admissible for T; see Figure 1, which depicts a prefix of the tree Γ . The root of Γ has a child v 0 such that T S ( Γ , v 0 ) = T H = T { f 1 = δ 1 , , f n = δ n } . For each f i E ( T ) and each σ i E ( T , f i ) { δ i } , the root of Γ has a child v i , σ i such that T S ( Γ , v i , σ i ) = T { f i = σ i } . The root of Γ has no other children.
We transform the tree Γ into a decision tree Γ * of the type 1 for the table T; see Figure 2, which depicts a prefix of the tree Γ * . For the node u 0 of the considered prefix, T S ( Γ * , u 0 ) = T { f 1 = δ 1 , , f n = δ n } = T H . For each f i E ( T ) and each σ i E ( T , f i ) { δ i } , the node of this prefix labeled with the attribute f i has a child u i , σ i such that T S ( Γ * , u i , σ i ) = T { f 1 = δ 1 , , f i 1 = δ i 1 , f i = σ i } . It is clear that T S ( Γ * , u i , σ i ) is a subtable of T S ( Γ , v i , σ i ) . By Lemma 9, L t ( 3 ) ( T S ( Γ * , u i , σ i ) ) L t ( 3 ) ( T S ( Γ , v i , σ i ) ) . It is also clear that | E ( T S ( Γ * , u i , σ i ) | n 1 . Using the inductive hypothesis, we obtain that there is a decision tree Γ i , σ i of the type 1 for the table T S ( Γ * , u i , σ i ) such that L t ( T S ( Γ * , u i , σ i ) , Γ i , σ i ) L t ( 3 ) ( T S ( Γ * , u i , σ i ) ) L t ( 3 ) ( T S ( Γ , v i , σ i ) ) .
We now transform the prefix of a decision tree Γ * depicted in Figure 2 into a decision tree Γ * of the type 1 for the table T. First, we transform the node u 0 into a terminal node labeled with the number 0 if ( δ 1 , , δ n ) is not a row of T and labeled with the decision attached to ( δ 1 , , δ n ) if this tuple is a row of T. Next, for each f i E ( T ) and each σ i E ( T , f i ) { δ i } , we replace the node u i , σ i with the tree Γ i , σ i . It is clear that the obtained tree Γ * is a decision tree of the type 1 for the decision table T and L t ( T , Γ * ) L t ( T , Γ ) = L t ( 3 ) ( T ) .
We proved that, for any decision table T, L t ( 1 ) ( T ) L t ( 3 ) ( T ) ; hence, L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = L t ( 5 ) ( T ) . □

10. Results of Experiments

We conducted experiments with eight decision tables from the UCI ML Repository [20]. Table 1 contains information about each of these decision tables: its name, the number of rows, and the number of attributes. For each of the considered four cost functions, each of the considered five types of decision trees, and each of the considered eight decision tables, we find the minimum cost of a decision tree of the given type for the given table.
For n = 3 , , 6 , we randomly generate 100 Boolean functions with n variables. We represent each Boolean function f with n variables x 1 , , x n as a decision table T f with n columns labeled with variables x 1 , , x n considered as attributes and with 2 n rows that are all possible n-tuples of values of the variables. Each row is labeled with the decision that is the value of the function f on the corresponding n-tuple. We consider decision trees for the table T f as decision trees computing the function f.
For each of the considered four cost functions, each of the considered five types of decision trees, and each of the generated Boolean functions, using its decision table representation, we find the minimum cost of a decision tree of the given type computing this function.
The following remarks clarify some experimental results considered later.
From Proposition 10, it follows that L t ( 1 ) ( T ) = L t ( 3 ) ( T ) = L t ( 5 ) ( T ) for any decision table T.
Let f be a Boolean function with n 1 variables. Since each hypothesis over the decision table T f is proper, the following equalities hold:
h ( 2 ) ( T f ) = h ( 4 ) ( T f ) , h ( 3 ) ( T f ) = h ( 5 ) ( T f ) , L ( 2 ) ( T f ) = L ( 4 ) ( T f ) , L ( 3 ) ( T f ) = L ( 5 ) ( T f ) , L t ( 2 ) ( T f ) = L t ( 4 ) ( T f ) , L t ( 3 ) ( T f ) = L t ( 5 ) ( T f ) , L w ( 2 ) ( T f ) = L w ( 4 ) ( T f ) , L w ( 3 ) ( T f ) = L w ( 5 ) ( T f ) .

10.1. Depth

In this section, we consider some results obtained in Reference [12]. Results of experiments with eight decision tables from Reference [20] and the depth are represented in Table 2. The first column contains the name of the considered decision table T. The last five columns contain values h ( 1 ) ( T ) , , h ( 5 ) ( T ) (minimum values for each decision table are in bold).
Decision trees with the minimum depth using attributes (type 1) are optimal for 5 decision tables, using hypotheses (type 2) are optimal for 4 tables, using attributes and hypotheses (type 3) are optimal for 8 tables, using proper hypotheses (type 4) are optimal for 3 tables, using attributes and proper hypotheses (type 5) are optimal for 7 tables.
For the decision table soybean-small, we must use attributes to construct an optimal decision tree. For this table, it is enough to use only attributes. For the decision tables breast-cancer and nursery, we must use both attributes and hypotheses to construct optimal decision trees. For these tables, it is enough to use attributes and proper hypotheses. For the decision table tic-tac-toe, we must use both attributes and hypotheses to construct optimal decision trees. For this table, it is not enough to use attributes and proper hypotheses.
Results of experiments with Boolean functions and the depth are represented in Table 3. The first column contains the number of variables in the considered Boolean functions. The last five columns contain information about values h ( 1 ) , , h ( 5 ) in the format min Avg max .
From the obtained results, it follows that, generally, the decision trees of the types 2 and 4 are better than the decision trees of the type 1, and the decision trees of the types 3 and 5 are better than the decision trees of the types 2 and 4.

10.2. Number of Realizable Nodes

In this section, we consider some results obtained in Reference [13]. Results of experiments with eight decision tables from Reference [20] and the number of realizable nodes are represented in Table 4. The first column contains the name of the considered decision table T. The last five columns contain values L ( 1 ) ( T ) , , L ( 5 ) ( T ) (minimum values for each decision table are in bold).
Decision trees with the minimum number of realizable nodes using attributes (type 1) are optimal for 4 decision tables, using hypotheses (type 2) are optimal for 0 tables, using attributes and hypotheses (type 3) are optimal for 8 tables, using proper hypotheses (type 4) are optimal for 0 tables, and using attributes and proper hypotheses (type 5) are optimal for 8 tables.
Decision trees of the types 3 and 5 can be a bit better than the decision trees of the type 1. Decision trees of the types 2 and 4 are far from the optimal.
For the decision tables hayes-roth-data, soybean-small, tic-tac-toe, and zoo-data, we must use attributes to construct optimal decision trees. For these tables, it is enough to use only attributes. For the rest of the considered decision tables, we must use both attributes and hypotheses to construct optimal decision trees. For these tables, it is enough to use attributes and proper hypotheses.
Results of experiments with Boolean functions and the number of realizable nodes are represented in Table 5. The first column contains the number of variables in the considered Boolean functions. The last five columns contain information about values L ( 1 ) , , L ( 5 ) in the format m i n A v g m a x .
From the obtained results, it follows that, generally, the decision trees of the types 3 and 5 are slightly better than the decision trees of the type 1, and the decision trees of the types 2 and 4 are far from the optimal.

10.3. Number of Realizable Terminal Nodes

Results of experiments with eight decision tables from Reference [20] and the number of realizable terminal nodes are represented in Table 6. The first column contains the name of the considered decision table T. The last five columns contain values L t ( 1 ) ( T ) , , L t ( 5 ) ( T ) (minimum values for each decision table are in bold).
Decision trees of the types 1, 3, and 5 are optimal for each of the considered tables. Decision trees of the types 2 and 4 are far from the optimal.
Results of experiments with Boolean functions and the number of realizable terminal nodes are represented in Table 7. The first column contains the number of variables in the considered Boolean functions. The last five columns contain information about values L t ( 1 ) , , L t ( 5 ) in the format m i n A v g m a x .
From the obtained results, it follows that, generally, the decision trees of the types 1, 3, and 5 are optimal, and the decision trees of the types 2 and 4 are far from the optimal.

10.4. Number of Working Nodes

Results of experiments with eight decision tables from Reference [20] and the number of working nodes are represented in Table 8. The first column contains the name of the considered decision table T. The last five columns contain values L w ( 1 ) ( T ) , , L w ( 5 ) ( T ) (minimum values for each decision table are in bold).
Decision trees with the minimum number of working nodes using attributes (type 1) are optimal for 2 decision tables, using hypotheses (type 2) are optimal for 0 tables, using attributes and hypotheses (type 3) are optimal for 8 tables, using proper hypotheses (type 4) are optimal for 0 tables, using attributes and proper hypotheses (type 5) are optimal for 7 tables.
Decision trees of the types 3 and 5 can be a bit better than the decision trees of the type 1. Decision trees of the types 2 and 4 are far from the optimal.
For all decision tables with the exception of soybean-small and zoo-data, we must use both attributes and hypotheses to construct optimal decision trees. Moreover, for tic-tac-toe, it is not enough to use attributes and proper hypotheses. For soybean-small and zoo-data, it is enough to use only attributes to construct optimal decision trees.
Results of experiments with Boolean functions and the number of working nodes are represented in Table 9. The first column contains the number of variables in the considered Boolean functions. The last five columns contain information about values L w ( 1 ) , , L w ( 5 ) in the format m i n A v g m a x .
From the obtained results, it follows that, generally, the decision trees of the types 3 and 5 are better than the decision trees of the type 1, and the decision trees of the types 2 and 4 are far from the optimal.
We can now sum up the results of the experiments. Generally, the decision trees of the types 3 and 5 are slightly better than the decision trees of the type 1. Decision trees of the types 2 and 4 have, generally, too many nodes.

11. Conclusions

In this paper, we studied modified decision trees that use both queries based on one attribute each and queries based on hypotheses about values of all attributes. We designed dynamic programming algorithms for minimization of four cost functions for such decision trees and considered results of computer experiments. The main result of the paper is that the use of hypotheses can decrease the complexity of decision trees and make them more suitable for knowledge representation. In the future, we are planning to compare the length and coverage of decision rules derived from different types of decision trees constructed by the dynamic programming algorithms. Unfortunately, the considered algorithms cannot work together to optimize more than one cost function. In the future, we are also planning to consider two extensions of these algorithms: (i) sequential optimization relative to a number of cost functions and (ii) bi-criteria optimization that allows us to construct for some pairs of cost functions the corresponding Pareto front.

Author Contributions

Conceptualization, all authors; methodology, all authors; software, I.C.; validation, I.C., M.A., and S.H.; formal analysis, all authors; investigation, M.A. and S.H.; resources, all authors; data curation, M.A. and S.H.; writing—original draft preparation, M.M.; writing—review and editing, all authors; visualization, M.A. and S.H.; supervision, I.C. and M.M.; project administration, M.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

Research funded by King Abdullah University of Science and Technology.

Data Availability Statement

Data available in a publicly accessible repository that does not issue DOIs. Publicly available datasets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml (accessed on 12 April 2017).

Acknowledgments

Research reported in this publication was supported by King Abdullah University of Science and Technology (KAUST) including the provision of computing resources. The authors are greatly indebted to the anonymous reviewers for useful comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman and Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
  2. Moshkov, M. Time complexity of decision trees. In Trans. Rough Sets III; Lecture Notes in Computer Science; Peters, J.F., Skowron, A., Eds.; Springer: Berlin, Germany, 2005; Volume 3400, pp. 244–459. [Google Scholar]
  3. Rokach, L.; Maimon, O. Data Mining with Decision Trees—Theory and Applications; Series in Machine Perception and Artificial Intelligence; World Scientific: Singapore, 2007; Volume 69. [Google Scholar]
  4. Chegis, I.A.; Yablonskii, S.V. Logical methods of control of work of electric schemes. Trudy Mat. Inst. Steklov 1958, 51, 270–360. (In Russian) [Google Scholar]
  5. Pawlak, Z. Rough sets. Int. J. Parallel Program. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  6. Pawlak, Z. Rough Sets—Theoretical Aspects of Reasoning about Data; Theory and Decision Library: Series D; Kluwer: Dordrecht, The Netherlands, 1991; Volume 9. [Google Scholar]
  7. Pawlak, Z.; Skowron, A. Rudiments of rough sets. Inf. Sci. 2007, 177, 3–27. [Google Scholar] [CrossRef]
  8. Angluin, D. Queries and concept learning. Mach. Learn. 1988, 2, 319–342. [Google Scholar] [CrossRef] [Green Version]
  9. Angluin, D. Queries revisited. Theor. Comput. Sci. 2004, 313, 175–194. [Google Scholar] [CrossRef] [Green Version]
  10. Angluin, D. Learning regular sets from queries and counterexamples. Inf. Comput. 1987, 75, 87–106. [Google Scholar] [CrossRef] [Green Version]
  11. Valiant, L.G. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef] [Green Version]
  12. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Minimizing depth of decision trees with hypotheses (to appear). In Proceedings of the International Joint Conference on Rough Sets (IJCRS 2021), Bratislava, Slovakia, 19–24 September 2021. [Google Scholar]
  13. Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Minimizing number of nodes in decision trees with hypotheses (to appear). In Proceedings of the 25th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2021), Szczecin, Poland, 8–10 September 2021. [Google Scholar]
  14. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Ruiz-Garcia, A.; Schmidhuber, J.; Palade, V.; Took, C.C.; Mandic, D.P. Deep neural network representation and Generative Adversarial Learning. Neural Netw. 2021, 139, 199–200. [Google Scholar] [CrossRef] [PubMed]
  16. Shamshirband, S.; Fathi, M.; Chronopoulos, A.T.; Montieri, A.; Palumbo, F.; Pescapè, A. Computational intelligence intrusion detection techniques in mobile cloud computing environments: Review, taxonomy, and open research issues. J. Inf. Secur. Appl. 2020, 55, 102582. [Google Scholar] [CrossRef]
  17. Shamshirband, S.; Fathi, M.; Dehzangi, A.; Chronopoulos, A.T.; Alinejad-Rokny, H. A review on deep learning approaches in healthcare systems: Taxonomies, challenges, and open issues. J. Biomed. Inform. 2021, 113, 103627. [Google Scholar] [CrossRef] [PubMed]
  18. AbouEisha, H.; Amin, T.; Chikalov, I.; Hussain, S.; Moshkov, M. Extensions of Dynamic Programming for Combinatorial Optimization and Data Mining; Intelligent Systems Reference Library; Springer: Berlin, Germany, 2019; Volume 146. [Google Scholar]
  19. Alsolami, F.; Azad, M.; Chikalov, I.; Moshkov, M. Decision and Inhibitory Trees and Rules for Decision Tables with Many-Valued Decisions; Intelligent Systems Reference Library; Springer: Berlin, Germany, 2020; Volume 156. [Google Scholar]
  20. Dua, D.; Graff, C. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 12 April 2017).
Figure 1. Prefix of decision tree Γ .
Figure 1. Prefix of decision tree Γ .
Electronics 10 01580 g001
Figure 2. Prefix of decision tree Γ * .
Figure 2. Prefix of decision tree Γ * .
Electronics 10 01580 g002
Table 1. Decision tables from Reference [20] used in experiments.
Table 1. Decision tables from Reference [20] used in experiments.
DecisionNumber ofNumber of
TableRowsAttributes
balance-scale6255
breast-cancer26610
cars17287
hayes-roth-data695
nursery12,9609
soybean-small4736
tic-tac-toe95810
zoo-data5917
Table 2. Experimental results for decision tables from Reference [20]—depth.
Table 2. Experimental results for decision tables from Reference [20]—depth.
Decision h ( 1 ) ( T ) h ( 2 ) ( T ) h ( 3 ) ( T ) h ( 4 ) ( T ) h ( 5 ) ( T )
Table T
balance-scale44444
breast-cancer66565
cars66666
hayes-roth-data44444
nursery88787
soybean-small24262
tic-tac-toe66586
zoo-data44454
Average5.005.254.635.884.75
Table 3. Experimental results for Boolean functions—depth.
Table 3. Experimental results for Boolean functions—depth.
Number of h ( 1 ) h ( 2 ) h ( 3 ) h ( 4 ) h ( 5 )
Variables n
3 2 2.82 3 1 2.06 3 1 1.89 2 1 2.06 3 1 1.89 2
4 3 3.94 4 2 3.05 4 2 2.97 3 2 3.05 4 2 2.97 3
5 4 4.95 5 4 4.08 5 3 3.99 4 4 4.08 5 3 3.99 4
6 5 5.99 6 5 5.01 6 5 5.00 5 5 5.01 6 5 5.00 5
Table 4. Experimental results for decision tables from Reference [20]—number of realizable nodes.
Table 4. Experimental results for decision tables from Reference [20]—number of realizable nodes.
Decision L ( 1 ) ( T ) L ( 2 ) ( T ) L ( 3 ) ( T ) L ( 4 ) ( T ) L ( 5 ) ( T )
Table T
balance-scale50152344995234499
breast-cancer16118,06115924,226159
cars39665,25039165,250391
hayes-roth-data523175233852
nursery106612,625,955106112,625,9551061
soybean-small64839621,2516
tic-tac-toe244154,311244468,447244
zoo-data17137017284717
Average3051,609,4173041,651,694304
Table 5. Experimental results for Boolean functions—number of realizable nodes.
Table 5. Experimental results for Boolean functions—number of realizable nodes.
Number of L ( 1 ) L ( 2 ) L ( 3 ) L ( 4 ) L ( 5 )
Variables n
3 5 8.41 13 5 12.38 22 5 7.40 12 5 12.38 22 5 7.40 12
4 9 16.26 25 14 43.89 66 8 14.59 25 14 43.89 66 8 14.59 25
5 17 30.42 41 113 201.95 283 17 27.83 39 113 201.95 283 17 27.83 39
6 49 58.94 77 638 1057.16 1406 46 54.13 71 638 1057.16 1406 46 54.13 71
Table 6. Experimental results for decision tables from Reference [20]—number of realizable terminal nodes.
Table 6. Experimental results for decision tables from Reference [20]—number of realizable terminal nodes.
Decision L t ( 1 ) ( T ) L t ( 2 ) ( T ) L t ( 3 ) ( T ) L t ( 4 ) ( T ) L t ( 5 ) ( T )
Table T
balance-scale40143694014369401
breast-cancer9915,5779921,20899
cars29050,26029050,260290
hayes-roth-data362403625936
nursery7599,643,1037599,643,103759
soybean-small445084199634
tic-tac-toe155125,604155401,862155
zoo-data10121710256010
Average2191,230,6102191,267,948219
Table 7. Experimental results for Boolean functions—number of realizable terminal nodes.
Table 7. Experimental results for Boolean functions—number of realizable terminal nodes.
Number of L t ( 1 ) L t ( 2 ) L t ( 3 ) L t ( 4 ) L t ( 5 )
Variables n
3 3 4.70 7 4 8.83 14 3 4.70 7 4 8.83 14 3 4.70 7
4 5 8.63 13 11 31.53 45 5 8.63 13 11 31.53 45 5 8.63 13
5 9 15.71 21 86 145.68 200 9 15.71 21 86 145.68 200 9 15.71 21
6 25 29.97 39 485 770.52 1005 25 29.97 39 485 770.52 1005 25 29.97 39
Table 8. Experimental results for decision tables from Reference [20]—number of working nodes.
Table 8. Experimental results for decision tables from Reference [20]—number of working nodes.
Decision L w ( 1 ) ( T ) L w ( 2 ) ( T ) L w ( 3 ) ( T ) L w ( 4 ) ( T ) L w ( 5 ) ( T )
Table T
balance-scale1008659886598
breast-cancer49241545292645
cars10414,9819914,98199
hayes-roth-data1677147714
nursery2862,980,7192812,980,719281
soybean-small2330212812
tic-tac-toe8827,8678565,10486
zoo-data715172847
Average82378,42679383,28079
Table 9. Experimental results for Boolean functions—number of working nodes.
Table 9. Experimental results for Boolean functions—number of working nodes.
Number of L w ( 1 ) L w ( 2 ) L w ( 3 ) L w ( 4 ) L w ( 5 )
Variables n
3 2 3.70 6 1 3.55 8 1 2.58 4 1 3.55 8 1 2.58 4
4 4 7.63 12 3 12.36 21 3 5.62 9 3 12.36 21 3 5.62 9
5 8 14.71 20 27 56.25 83 8 11.38 15 27 56.25 83 8 11.38 15
6 24 28.97 38 153 286.52 401 19 22.76 29 153 286.52 401 19 22.76 29
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Azad, M.; Chikalov, I.; Hussain, S.; Moshkov, M. Optimization of Decision Trees with Hypotheses for Knowledge Representation. Electronics 2021, 10, 1580. https://doi.org/10.3390/electronics10131580

AMA Style

Azad M, Chikalov I, Hussain S, Moshkov M. Optimization of Decision Trees with Hypotheses for Knowledge Representation. Electronics. 2021; 10(13):1580. https://doi.org/10.3390/electronics10131580

Chicago/Turabian Style

Azad, Mohammad, Igor Chikalov, Shahid Hussain, and Mikhail Moshkov. 2021. "Optimization of Decision Trees with Hypotheses for Knowledge Representation" Electronics 10, no. 13: 1580. https://doi.org/10.3390/electronics10131580

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop