# Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach

## Abstract

**:**

## 1. Introduction

- (1)
- Compute the local scores for the nonempty subsets of $\{{X}^{\left(1\right)},\cdots ,{X}^{\left(N\right)}\}$; for example, if $N=3$, the seven quantities ${Q}_{X}^{n}\left({x}^{n}\right),\cdots ,{Q}_{XYZ}^{n}({x}^{n},{y}^{n},{z}^{n})$ are obtained; and
- (2)
- Find a BN structure that maximizes the global scores among the $M\left(N\right)(\le {3}^{N})$ candidate BN structures; there are at most ${3}^{N}$ DAGs in the case of N variables; for example, if $N=3$, the eleven quantities are computed and a structure with the largest is chosen.

## 2. Preliminaries

#### 2.1. Learning the Bayesian Structure for Discrete Variables and Its Consistency

**Table 1.**Three-variable case: $D\left({P}^{*}\right|\left|P\right)>0$ or $D\left({P}^{*}\right|\left|P\right)=0$: “+” and “0” denote $D\left({P}^{*}\right|\left|P\right)>0$ and $D\left({P}^{*}\right|\left|P\right)=0$, respectively.

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ||

1 | * | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |

2 | + | * | + | + | + | 0 | 0 | + | + | + | 0 | |

3 | + | + | * | + | 0 | + | 0 | + | + | + | 0 | |

4 | + | + | + | * | 0 | 0 | + | + | + | + | 0 | |

5 | + | + | + | + | * | + | + | + | + | + | 0 | |

6 | + | + | + | + | + | * | + | + | + | + | 0 | |

7 | + | + | + | + | + | + | * | + | + | + | 0 | |

8 | + | + | + | + | + | + | + | * | + | + | 0 | |

9 | + | + | + | + | + | + | + | + | * | + | 0 | |

10 | + | + | + | + | + | + | + | + | + | * | 0 | |

11 | + | + | + | + | + | + | + | + | + | + | * |

#### 2.2. Universal Measures for Continuous Variables

**Figure 2.**Quantization at level k: ${x}^{n}=({x}_{1},\cdots ,{x}_{n})\mapsto ({a}_{1}^{\left(j\right)},\cdots ,{a}_{n}^{\left(j\right)})$

**Proposition 1**([10]). For any density function f, such that $D\left({f}_{X}\right|\left|{f}_{j}\right)\to 0$ as $j\to \infty $,

**Proposition 2**([11]). For any generalized density function ${f}_{Y}$,

## 3. Contributions

#### 3.1. The Hannan and Quinn Principle

**Theorem 1.**If $X\perp \phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}\perp Y|Z$:

**Lemma 1**([2]). Let ${\left\{{U}_{k}\right\}}_{k=1}^{n}$ be random variables that obey an identical distribution with zero mean and unit variance, and ${S}_{n}:={\sum}_{k=1}^{n}{U}_{k}$. Then, with probability one,

**Theorem 2.**We define ${R}_{Z}^{n}\left({z}^{n}\right)$, ${R}_{XZ}^{n}({x}^{n},{z}^{n})$, ${R}_{YZ}^{n}({y}^{n},{z}^{n})$ and ${R}_{XYZ}^{n}({x}^{n},{y}^{n},{z}^{n})$ by:

**Proof.**We note two properties:

- ${R}_{XZ}^{n}({x}^{n},{z}^{n}){R}_{YZ}^{n}({y}^{n},{z}^{n})\ge {R}_{XYZ}^{n}({x}^{n},{y}^{n},{z}^{n}){R}_{Z}^{n}\left({z}^{n}\right)$ is equivalent to (7); and
- $\underset{n\to \infty}{lim}\frac{1}{n}log\frac{{R}_{XYZ}^{n}({x}^{n},{y}^{n},{z}^{n}){R}_{Z}^{n}\left({z}^{n}\right)}{{R}_{XZ}^{n}({x}^{n},{z}^{n}){R}_{YZ}^{n}({x}^{n},{z}^{n})}=\underset{n\to \infty}{lim}\frac{1}{n}log\frac{{Q}_{XYZ}^{n}({x}^{n},{y}^{n},{z}^{n}){Q}_{Z}^{n}\left({z}^{n}\right)}{{Q}_{XZ}^{n}({x}^{n},{z}^{n}){Q}_{YZ}^{n}({x}^{n},{z}^{n})}\to I(X,Y,Z)$

#### 3.2. Consistency for Continuous Variables

**Proposition 3**For any generalized density function f:

Algorithm 1 Calculating g^{n}. | |

(A) Input x^{n} ∈ A^{n}, Output ${\widehat{g}}_{X}^{n}\left({x}^{n}\right)$ | |

1. | For each $k=1,\cdots ,K$, ${g}_{k}^{n}\left({x}^{n}\right):=0$ |

2. | For each $k=1,\cdots ,K$ and each $a\in {A}_{k}$, ${c}_{k}\left(a\right):=0$ |

3. | For each $i=1,\cdots ,n$, |

(a) ${A}_{0}=X$, ${a}_{i}^{\left(0\right)}={x}_{i}$ | |

(b) for each $k=1,\cdots ,K$ | |

i. Find ${a}_{i}^{\left(k\right)}\in {A}_{k}$ from ${a}_{i}^{(k-1)}\in {A}_{k-1}$ | |

ii. $log{g}_{k}^{n}\left({x}^{n}\right):=log{g}_{k}^{n}\left({x}^{n}\right)+log\frac{{c}_{i,k}\left({a}_{i}^{\left(k\right)}\right)+1/2}{i-1+|{A}_{k}|/2}-log\left({\eta}_{X}\left({a}_{i}^{\left(k\right)}\right)\right)$ | |

iii. ${c}_{i,k}\left({a}_{i}^{\left(k\right)}\right):={c}_{i,k}\left({a}_{i}^{\left(k\right)}\right)+1$ | |

4. | ${\widehat{g}}_{X}^{n}\left({x}^{n}\right):={\sum}_{k=1}^{K}\frac{1}{K}{g}_{k}^{n}\left({x}^{n}\right)$ |

(B) Input ${x}^{n}\in {A}^{n}$ and ${y}^{n}\in {B}^{n}$, Output ${\widehat{g}}_{XY}^{n}({x}^{n},{y}^{n})$ | |

1. | For each $j,k=1,\cdots ,K$, ${g}_{j,k}^{n}({x}^{n},{y}^{n}):=0$ |

2. | For each $j,k=1,\cdots ,K$ and each $a\in {A}_{j}$ and $b\in {B}_{k}$, ${c}_{j,k}(a,b):=0$ |

3. | For each $i=1,\cdots ,n$ |

(a) ${A}_{0}=X$, ${B}_{0}=Y$, ${a}_{i}^{\left(0\right)}={x}_{i}$, ${b}_{i}^{\left(0\right)}={y}_{i}$ | |

(b) for each $j,k=1,\cdots ,K$ | |

i. Find ${a}_{i}^{\left(j\right)}\in {A}_{j}$ and ${b}_{i}^{\left(k\right)}\in {B}_{k}$ from ${a}_{i}^{(j-1)}\in {A}_{j-1}$ and ${b}_{i}^{(k-1)}\in {B}_{k-1}$ | |

ii. $log{g}_{j,k}^{n}({x}^{n},{y}^{n}):=log{g}_{j,k}^{n}({x}^{n},{y}^{n})+log\frac{{c}_{i,j,k}({a}_{i}^{\left(j\right)},{b}_{i}^{\left(k\right)})+1/2}{i-1+|{A}_{j}\left|\right|{B}_{k}|/2}-log\left({\eta}_{X}\left({a}_{i}^{\left(j\right)}\right){\eta}_{Y}\left({b}_{i}^{\left(k\right)}\right)\right)$ | |

iii. ${c}_{i,j,k}({a}_{i}^{\left(j\right)},{b}_{i}^{\left(k\right)}):={c}_{i,j,k}({a}_{i}^{\left(j\right)},{b}_{i}^{\left(k\right)})+1$ | |

4. | ${\widehat{g}}_{XY}^{n}({x}^{n},{y}^{n}):={\sum}_{j=1}^{K}{\sum}_{k=1}^{K}\frac{1}{{K}^{2}}{g}_{j,k}^{n}({x}^{n},{y}^{n})$ |

**Theorem 3.**With probability one as $n\to \infty $:

**Proof.**From Propositions 2 and 3 for two and three variables and the law of large numbers, we have:

**Theorem 4.**With probability one as $n\to \infty $:

**Theorem 5.**With probability one as $n\to \infty $:

#### 3.3. The Number of Local Scores to be Computed

## 4. Concluding Remarks

## Appendix: Proof of Theorem 1

^{2}/2 − t

^{3}/{6[1 + δ(t)t]

^{2}} with 0 < δ(t) < 1 and:

_{xy})

_{x ∈ X, and y ∈ Y}with and u and v are the column vectors and , respectively. Hereafter, we arbitrarily fix z ∈ Z. Let U = (u[0], u[1], …, u[α − 1]), with u[0] = u and W = (w[0], w[1], …, w[β − 1], with w[0] = w being eigenvectors of and , where E

_{m}is the identity matrix of dimension m.

^{t}uVw = 0, and for $\stackrel{~}{U}$ = (u[1], …, u[α − 1] and $\stackrel{~}{W}$ = (w[1], …, w[β − 1], we have:

^{t}U =

^{t}UU = E

_{α}and W

^{t}W =

^{t}WW = E

_{β}, we obtain:

_{ij}:=

^{t}u[i]Vw[j]. Then, we can see:

_{ij}with i = 1, …, α − 1 and j = 1, …, β − 1: E[r

_{ij}] = 0, and:

_{ij}and the expectation of ${\sigma}_{ij}^{2}$, so that (15) implies:

_{x ∈ X}and w[j] = (w[y,j])

_{y ∈ Y}, then we can check E[Z

_{i,j,k}] = 0 and V[Z

_{i,j,k}] = 1, where expectation E and variance V are with respect to the examples X

^{n}= x

^{n}and Y

^{n}= y

^{n}, and I(A) takes one if the event A is true and zero otherwise. We can easily check:

## References

- Rissanen, J. Modeling by shortest data description. Automatica
**1978**, 14, 465–471. [Google Scholar] [CrossRef] - Billingsley, P. Probability & Measure, 3rd ed.; Wiley: New York, NY, USA, 1995. [Google Scholar]
- Friedman, N.; Linial, M.; Nachman, I.; Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol.
**2000**, 7, 601–620. [Google Scholar] [CrossRef] [PubMed] - Imoto, S.; Kim, S.; Goto, T.; Aburatani, S.; Tashiro, K.; Kuhara, S.; Miyano, S. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. J. Bioinform. Comput. Biol.
**2003**, 1, 231–252. [Google Scholar] [CrossRef] [PubMed] - Zhang, K.; Peters, J.; Janzing, D.; Scholkopf, B. Kernel-based Conditional Independence Test and Application in Causal Discovery. In Proceedings of the 2011 Uncertainty in Artificial Intelligence Conference, Barcelona, Spain, 14–17 July 2011; pp. 804–813.
- Silander, T.; Myllymaki, P. A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, 13–16 July 2006; pp. 445–452.
- Hannan, E.J.; Quinn, B.G. The Determination of the Order of an Autoregression. J. R. Stat. Soc. B
**1979**, 41, 190–195. [Google Scholar] - Suzuki, J. The Hannan–Quinn Proposition for Linear Regression, Int. J. Stat. Probab.
**2012**, 1, 2. [Google Scholar] - Suzuki, J. On Strong Consistency of Model Selection in Classification. IEEE Trans. Inf. Theory
**2006**, 52, 4767–4774. [Google Scholar] [CrossRef] - Ryabko, B. Compression-based Methods for Nonparametric Prediction and Estimation of Some Characteristics of Time Series, IEEE Trans. Inform. Theory
**2009**, 55, 4309–4315. [Google Scholar] [CrossRef] - Suzuki, J. Universal Bayesian Measures. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 644–648.
- Krichevsky, R.E.; Trofimov, V.K. The Performance of Universal Encoding. IEEE Trans. Inf. Theory
**1981**, 27, 199–207. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 1995. [Google Scholar]
- Suzuki, J. Learning Bayesian Network Structures When Discrete and Continuous Variables Are Present. In Proceedings of the 2014 Workshop on Probabilistic Graphical Models, 17–19 September 2014; Springer Lecture Notes on Artificial Intelligence. Volume 8754, pp. 471–486.
- Suzuki, J. Learning Bayesian belief networks based on the minimum description length principle: An efficient algorithm using the B&B technique. In Proceedings of the 13th International Conference on Machine Learning (ICML'96), Bari, Italy, 3–6 July 1996; pp. 462–470.
- De Campos, C.P.; Ji, Q. Efficient Structure Learning of Bayesian Networks using Constraints. JMLR
**2011**, 12, 663–689. [Google Scholar] - Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control
**1974**, 19, 716–723. [Google Scholar] [CrossRef] - Judea, P. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan-Kaufmann: San Mateo, CA, USA, 1988. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Suzuki, J.
Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. *Entropy* **2015**, *17*, 5752-5770.
https://doi.org/10.3390/e17085752

**AMA Style**

Suzuki J.
Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach. *Entropy*. 2015; 17(8):5752-5770.
https://doi.org/10.3390/e17085752

**Chicago/Turabian Style**

Suzuki, Joe.
2015. "Consistency of Learning Bayesian Network Structures with Continuous Variables: An Information Theoretic Approach" *Entropy* 17, no. 8: 5752-5770.
https://doi.org/10.3390/e17085752