3.1. MKM Algorithm
Before learning large-scale sparse BN structure, the nodes in the network should be firstly partitioned. The K-means algorithm is simple and transferable, and it is more efficient than other block algorithms. Therefore, in this paper, a mutual information based K-means algorithm (MKM algorithm) is proposed based on the K-means algorithm.
The conventional K-means algorithm is unreasonable when blocking BN. On the one hand, the K-means algorithm uses Euclidean distance to identify the similarity between the nodes. However, for two nodes in BN, the Euclidean distance measures the similarity between states, which cannot correspond to the physical meaning of the problem we are studying. On the other hand, during the training of the conventional K-means algorithm, the mean of all data samples in each cluster is used as the new cluster center. But this cluster center may not exist in the BN, and it has no practical significance for other nodes to merge to the community of this virtual point.
Given this two unreasonableness, this paper improves the K-means algorithm and proposes an improved K-means algorithm (MKM algorithm) to adapt it to the block task of BN.
Mutual information [
24] is used as the measure of the dependency relationship between nodes.
Mutual information is the entanglement of two pieces of information in a joint distribution, that is
This function expresses the intrinsic dependence between the joint distribution of and relative to the joint distribution under the assumption that and are independent. Hence, mutual information can be used as an effective measure of the relationship between different variables, as well as the nodes correlation in BN. If mutual information is used as a distance measure to replace Euclidean distance, it must satisfy the following basic properties:
- (1)
Non-negative: ;
- (2)
Identity: if and only if ;
- (3)
Symmetry: ;
- (4)
Directivity: .
Actually, mutual information satisfies all these properties:
- (a)
Non-negative: Since the mutual information is from the overall perspective of the random variables X and Y, and the problem is observed in the average sense, the average mutual information amount will not appear negative, .
- (b)
Identity: if and only if X and Y are independent random variables. When X and Y are independent, , therefore .
- (c)
Symmetry: , and just stand on different ground, the amount of information about X extracted by Y is the same as the amount of information about Y extracted from X. From another perspective, .
- (d)
Directivity: , if a Markov chain is formed, the average mutual information between the input message and the output message tends to become smaller as the number of processors increases after the message is processed at multiple levels, that is, , .
It can be seen that mutual information can measure the dependency relationship between nodes in BN. Using mutual information as the metric, we can effectively block the network with practical significance based on the correlation between nodes, while the complexity of MKM is , in which , , represent the number of variables, blocks and iterations, respectively.
As we have seen, the new center node generated by the K-means algorithm through constant iterations is probably not a real node in the cluster. If some anomalous nodes are relatively far from the center node, the recalculated center node may likely deviate from the true center of the cluster. For Bayesian Networks, the new center node will be virtual, which is not applicable. Therefore, this paper draws on the idea of the K-center node algorithm, and repeatedly replaces central nodes with non-central nodes, trying to find a better center node to improve the quality of clustering. This idea can reduce the influence of anomalous nodes on clustering.
This paper uses the cost function of mutual information to measure the quality of clustering, and the total cost is the sum of the mutual information of all nodes in the subset. During iteration, if the value of the cost function of the new subset is greater than the value of the cost function of the original subset, the new center node is used to replace the old center node until the center does not change.
According to the above introduction, the flowchart of the mutual information based K-means algorithm (MKM algorithm) is as
Figure 1.
3.2. BLMKM Algorithm
After finding blocks by the MKM algorithm, consider the sparseness of large-scale networks: variable connections between blocks are sparse, and variable connections within blocks are tight. Therefore, after the BN structure in these blocks is learned, we can first find the undirected edges between blocks through the network skeleton, then assume the direction of the edges between the blocks to obtain the possible graph structure between all blocks, and finally perform exact learning in turn to get optimal BN. The above is the research idea of the BLMKM algorithm, which can reduce the complexity of learning large-scale networks.
This paper references the MMPC algorithm [
25] to obtain the undirected graph skeleton of the network. The MMPC algorithm finds each node’s neighbors (the set of parents and children, that is,
) from the input data and approximates the network structure. As for the node
, the potential neighbor nodes are other nodes except the node
, and according to the maximum and minimum heuristic method, select node
that maximizes the smallest connection with
and add it to
of node
. After getting the preliminary
, given the node
in
, if the node
and the node
in
are independent, then remove
(prove that
and
are not a parent–child relationship). The MMPC algorithm finally returns a 0–1 matrix to represent the network skeleton. We define the network skeleton as
. For larger networks, multiple independence tests may take a lot of time to find a more accurate architecture. The complexity of MMPC is
[
26],
means the number of variables,
represent the set of parents and children sets for all the nodes, and
means the number of internal iterations in MMPC.
After using the MKM algorithm and MMPC algorithm to obtain blocks and the network skeleton, respectively, the schematic diagram of the network at this point is shown in
Figure 2.
This undirected structure is the network skeleton obtained by the MMPC algorithm. Three blue virtual circles represent the blocks obtained by the MKM algorithm: each block is an undirected network, there are more connecting edges (black edges) in the block, and the connections within the block are tight. The connections between three blocks (red edges) are also determined by the network skeleton, and the connections between the blocks are sparse.
Next, the direction of the undirected edges between the blocks are assumed to find the possible graph structures among all the blocks. Assume that there are undirected edges (red edges), there may be combinations of the directions of the edge arrows, and there may be possible graph structures among all the blocks. Store the possible graph structures in turn for later structure learning. This is what we called the combine function.
In
Figure 2, there are two red undirected edges among all the blocks leading to four possible graph structures, which, respectively, are
,
,
,
in
Figure 3.
We need to learn all the possible graph structures in
Figure 3 in sequence to select the best BN by score function. This strategy ensures that the resulting network is globally optimal because the candidate BNs obtained after the structural learning of each possible graph structure are globally optimal, therefore, the best BN can be selected by score function. As shown in
Figure 3, for the first possible graph structure
, when performing structural learning on the block composed of nodes 5, 6, and 7, the score of node 5 needs to consider the score of node 2 as the parent node. Similarly, when performing structural learning on the block composed of nodes 8 and 9, the score of node 8 needs to consider the score of node 7 as the parent node. This ensures that other possible parent nodes outside the block are not missed after blocking. The calculation of the network score is considered globally. At this time, the structure learned is globally optimal, and the network finally found is also globally optimal.
Dynamic programming algorithms are used for structural learning. This algorithm is a kind of exact learning algorithm which divides the BN structure learning process into two steps: finding parent graphs and order graphs, and recursively solving the problem using the MDL score to measure the matching degree of candidate structures and data.
The MDL score considers the compromise between data matching and network complexity. The state’s number of node
is
, the occurrence number of
is
,
represents the number of simultaneous occurrences of
and
. The first term in Equation (5) is based on entropy, and the second term is a penalty term for network complexity. Then the MDL score of a structure can be expressed as follows [
27]:
The MDL score is decomposable [
28], that is, the score of the entire structure can be obtained by summing each variable.
The dynamic programming algorithm starts with an empty network, and then gradually adds leaf nodes recursively until all variables are added to the network, and finally finds the optimal network.
This paper uses order graphs and parent graphs to show how the dynamic programming algorithm recursively decomposes to find the optimal network.
Figure 4 shows the order graph of four variables. The order graph of
nodes has
subsets. All nodes in the
layer have
precursor nodes in the previous layer and layer
has
nodes, where
is the binomial coefficient.
In the order graph, each node selects the precursor nodes that make the subnetwork optimal based on the MDL score. Therefore, when evaluating node
in the order graph, take each
as a leaf node, take
to be a subnetwork, and the score of the new subnetwork is
. The parent set and the score of the optimal leaf node will be kept in the subnetwork. After considering the precursor nodes of all nodes in the order graph, the optimal subnetwork of the corresponding variable set is included. The path from the root node to a single leaf node represents the order in which nodes are added to the network, so it is called the order graph. This order partly determines the network structure, and the rest of the structure is determined by the parent set of each node. To find the optimal parent set for each node, use the parent graph to calculate
.
Figure 5 shows the parent graph of node
.
Each node corresponds to a parent graph.
Figure 5 contains a subset of
other nodes, with a total of
variables. Each variable in the parent graph represents a mapping from a candidate parent set to a subset of these variables; this mapping is the one that optimizes the
score. All nodes of the same layer have the same number of possible parent nodes. Besides, all nodes in the
layer have
precursor nodes in the previous layer, and the
layer has
nodes. To evaluate the nodes in the candidate parent set
of node
, for each
, consider
and
. The best-scoring node is stored as
. By
storing the optimal score of each candidate parent set, after considering all the precursor nodes, it includes all the subsets of variable
and candidate parent set with the optimal score.
In this paper, we use the network skeleton
obtained by the MMPC to prune
Figure 5. Take
Figure 6 as an example, assume that the neighbor nodes of
obtained by the MMPC algorithm are only
and
, which means
. At this time, the parent graph of
should prune all sets containing
. Similarly,
in line 16 of Algorithm 1 will be selected from
, instead of all other nodes except
. So we can use
to prune the structure. The original complexity of dynamic programming is
, with MKM to block the whole network and the network skeleton
obtained by MMPC to prune the structure,
can be reduced to a low level. Even the complexity may degenerate from exponential to polynomial. These methods can significantly reduce time complexity compared to other none-blocking structure learning algorithms.
The flow of the pruned dynamic programming algorithm is shown as Algorithm 1, which refers to the original dynamic programming in [
20] and improves the pruned strategy of it.
Algorithm 1 Pruned Dynamic Programming Algorithm |
procedure |
1. for each do |
2. |
3. end for |
4. for each node do |
5. for each do |
6. |
7. |
8. if then |
9. |
10. end if |
11. if then |
12. end for |
13. end for |
14. |
15. |
end procedure |
|
procedure |
16. for each node do |
17. for each and do |
18. if is null then |
19. |
20. end if |
21. if then |
22. |
23. end if |
24. if then |
25. end for |
26. end for |
27. |
28. |
end procedure |
|
proceduremain |
29. for do |
30. |
31. end for |
end procedure |
In summary, this paper proposes the BLMKM algorithm for BN modeling of high-dimensional data, which can be summarized as the following steps: In the first step, the data is blocked by the proposed MKM algorithm; in the second step, the MMPC algorithm is used to determine the network skeleton; in the third step, the combine function assumes the direction of the edges between the blocks to find the possible graph structures between all the blocks. In the fourth step, the pruned dynamic programming learning is performed on the possible graph structures between all the blocks. Finally, the optimal network is found using the BIC score. The BLMKM algorithm is as Algorithm 2.
Algorithm 2 BLMKM Algorithm |
Input: dataset , number of clusters |
Output: optimal BN and score |
1. |
2. |
3. |
4. for do |
5. |
6. end for |
7. Optimal BN |
The BLMKM algorithm will slightly increase the learning time of the entire network due to learning the skeleton, but before the learning block the network greatly reduces the space consumption and improves the efficiency of the algorithm. Especially for the dynamic programming algorithm, the divide and conquer strategy significantly reduces the size of the network needed to learn, while MMPC helps to prune the parent graph to avoid unnecessary calculation. The BLMKM algorithm proposed in this paper can improve the running speed while ensuring accuracy for BN modeling of high-dimensional data.