# Sequential Pattern Mining Algorithm Based on Text Data: Taking the Fault Text Records as an Example

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

#### 2.1. Sequential Pattern Mining Algorithm

#### 2.2. Text Mining and Similarity Measurement

#### 2.3. Sequential Pattern Mining for Text Data

## 3. Text Similarity Measurement Based on Word Embedding Distance Model

#### 3.1. Text Data Preprocessing

#### 3.2. Analysis of the Classical Text Similarity Measurement Model

#### 3.2.1. Measurement of Text Similarity Based on the Bag-of-Words Model

_{i},

_{j}) refers to the number of times (t

_{i}) a word appears in a given text (d

_{j}). The inverse document frequency (idf

_{i}) calculation formula is as follows:

_{i}in the corpus.

_{i}and B

_{i}represent the components of vector A and B, respectively.

#### 3.2.2. Measurement of Text Similarity Based on the Topic Model

#### 3.3. Word Embedding Distance Model

#### 3.3.1. Model Building

_{i}. Then, the similarity between texts can be measured. We want to use the distance between words to measure text similarity. It is obviously unreasonable to simply use the sum of the distances between the corresponding words as the text distance.

_{i}in text S

_{1}, there is a corresponding word w

_{j}in the text S

_{2}; the distance d

_{ij}between w

_{i}and w

_{j}is the minimum distance between w

_{i}and other words in S

_{2}. Hence, w

_{i}and w

_{j}are called the minimum distance pair. Practically speaking, it means that one needs to calculate the distance between w

_{i}and all the words in S

_{2}, and pick out the minimum distance pair (w

_{i}, w

_{j}). All minimum distance pairs between text S

_{1}and S

_{2}need to be determined, and the distance between all pairs is accumulated as the distance between the two texts S

_{1}and S

_{2}. We will illustrate this process with an example. Assume that S

_{1}contains the words A, B and C, and S

_{2}contains D and E. The distances between A and D as well as A and E are compared, and the smaller distance d

_{1}is obtained. Similarly, d

_{2}and d

_{3}are obtained. The sum of d

_{1}, d

_{2}, d

_{3}is the distance between S

_{1}and S

_{2}(shown in Figure 1).

_{1}and S

_{2}represent dictionaries formed after the segmentation of the two texts, d represents the distance between words, x

_{ij}is a 0–1 variable and M is a big number.

_{Max}and D

_{Min}represent the maximum and minimum text distances in the data sets, respectively.

- (1)
- Obtain the word embedding matrix of texts S
_{1}and S_{2}using the word2vec model, ${X}_{1}\in {\mathbb{R}}^{d\times m}$, ${X}_{2}\in {\mathbb{R}}^{d\times n}$; - (2)
- Calculate the Euclidean distance between words to obtain the distance matrix;
- (3)
- Use this model to obtain the distance and similarity between S
_{1}and S_{2}.

#### 3.3.2. Model Checking

## 4. Sequential Pattern Mining

#### 4.1. Concepts and Definitions

_{i}, ${e}_{i}\in E$, means that ${e}_{i}$ belongs to the events set E.

_{j}.

_{i}and e

_{j}.

_{i}→e

_{j}), represents the frequency of the sequence e

_{i}→e

_{j}in the database.

_{i}→e

_{j}satisfying sup(e

_{i}→e

_{j}) $\ge $ min_sup will be seen as a sequential pattern.

#### 4.2. Algorithm Framework for Sequential Pattern Mining

#### 4.2.1. Similar Events Sets Mining

_{ij}represents the similarity between e

_{i}and e

_{j}. Obviously, when i = j, X

_{ij}= 1.

#### 4.2.2. Calculation of Support Degree Based on an Event Window and SES

_{1}belongs to the SES

_{A}[e

_{1},e

_{3}] and e

_{2}belongs to SES

_{B}[e

_{2},e

_{4},e

_{5}]. If max_win $\ge $ 3, sequences of the same meaning as e

_{1}$\to $e

_{2}are e

_{3}$\to $e

_{4}, e

_{3}$\to $e

_{5}, e

_{1}$\to $e

_{4}, e

_{1}$\to $e

_{5}(Figure 3). Therefore, the support degree of e

_{1}$\to $e

_{2}is five.

#### 4.2.3. Sequence Pattern Mining Algorithm Process

## 5. Experimental Validation and Results

#### 5.1. Experimental Data

#### 5.2. Experimental Results

0.025,window=5,min_count=5,sg=0,hs=1,iter=5).

#### 5.3. Robustness Evaluation of the Algorithm

#### 5.4. The Effect of Threshold Levels on the Mining Results

## 6. Discussion

#### 6.1. Comparison with Existing Works

#### 6.2. Application in Business Activities and Decision

## 7. Conlusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Agrawal, R.; Srikant, R. Mining Sequential Patterns. In Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, 6–10 March 1995. [Google Scholar]
- Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules. In Proceedings of the International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 1995; pp. 487–499. [Google Scholar]
- Srikant, R.; Agrawal, R. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the International Conference on Extending Database Technology, Avignon, France, 25–29 March 1996; pp. 1–17. [Google Scholar]
- Zaki, M.J. Spade: An efficient algorithm for mining frequent sequences. Mach. Learn.
**2001**, 42, 31–60. [Google Scholar] [CrossRef] - Han, J.; Pei, J.; Mortazavi-Asl, B.; Chen, Q.; Dayal, U.; Hsu, M.C. FreeSpan: Frequent pattern-projected sequential pattern mining. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–13 August 2000; pp. 355–359. [Google Scholar]
- Pei, J.; Han, J.; Mortazavi-Asl, B.; Pinto, H.; Chen, Q.; Dayal, U.; Hsu, M.C. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2–6 April 2001; pp. 215–224. [Google Scholar]
- Yan, X.W.; Zhang, J.F.; Xun, Y.L. A parallel algorithm for mining constrained frequent patterns using MapReduce. Soft Comput.
**2017**, 9, 2237–2249. [Google Scholar] [CrossRef] - Xiao, Y.; Zhang, R.; Kaku, I. A new approach of inventory classification based on loss profit. Expert Syst. Appl.
**2012**, 38, 9382–9391. [Google Scholar] [CrossRef] - Xiao, Y.; Zhang, R.; Kaku, I. A new framework of mining association rules with time-window on real-time transaction database. Int. J. Innov. Comput. Inf. Control
**2011**, 7, 3239–3253. [Google Scholar] - Xiao, Y.; Tian, Y.; Zhao, Q. Optimizing frequent time-window selection for association rules mining in a temporal database using a variable neighbourhood search. Comput. Oper. Res.
**2014**, 7, 241–250. [Google Scholar] [CrossRef] - Aloysius, G.; Binu, D. An approach to products placement in supermarkets using PrefixSpan algorithm. J. King Saud Univ. Comput. Inf. Sci.
**2013**, 1, 77–87. [Google Scholar] [CrossRef] - Wright, A.; Wright, T.; Mccoy, A. The use of sequential pattern mining to predict next prescribed medications. Biomed. Inform.
**2015**, 53, 73–80. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Fu, S. Analysis of library users’ borrowing behavior based on sequential pattern mining. Inf. Stud. Theory Appl.
**2014**, 6, 103–106. [Google Scholar] - Liu, X. Study of Road Traffic Accident Sequence Pattern and Severity Prediction Based on Data Mining; Beijing Jiaotong University: Beijing, China, 2016. [Google Scholar]
- Wang, G. Fault Prediction of Railway Turnout Based on Text Data; Beijing Jiaotong University: Beijing, China, 2017. [Google Scholar]
- Fan, Z. Research on the Application of Data Mining in Aeronautical Maintenance; Changchun University of Science and Technology: Changchun, China, 2010. [Google Scholar]
- Ma, R.; Wang, L.; Yu, J. Circllit breakers condition eValuation considering the information in llistorical deflect texts. J. Mech. Electr. Eng.
**2015**, 10, 1375–1379. [Google Scholar] - Salton, G. Developments in automatic text retrieval. Science
**1991**, 253, 974–980. [Google Scholar] [CrossRef] [PubMed] - Ontrup, J.; Ritter, H. Text Categorization and Semantic Browsing with Self-Organizing Maps on Non-euclidean Spaces. In Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
- Greene, D.; Cunningham, P. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 377–384. [Google Scholar]
- Quadrianto, N.; Song, L.; Smola, A.J.; Tuytelaars, T. Kernelized sorting. IEEE Trans. Pattern Anal. Mach. Intell.
**2010**, 32, 1809–1821. [Google Scholar] [CrossRef] [PubMed] - Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. Int. J.
**1988**, 24, 513–523. [Google Scholar] [CrossRef] [Green Version] - Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res.
**2003**, 3, 993–1022. [Google Scholar] - Hinton, G.E. Learning word embeddings of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, USA, 15–17 August 1986. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K. Distributed Representations of Words and Phrases and their Compositionality. Adv. Neural Inform. Proc. Sys.
**2013**, 26, 3111–3119. [Google Scholar] - Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Le, Q.; Mikolov, T. Word embeddings of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Kenter, T.; Rijke, M. Short Text Similarity with Word Embeddings. In Proceedings of the ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015; pp. 1411–1420. [Google Scholar]
- Sakurai, S.; Ueno, K. Analysis of daily business reports based on sequential text mining method. In Proceedings of the IEEE International Conference on Systems, The Hague, The Netherlands, 10–13 October 2004. [Google Scholar]
- Wong, P.C.; Cowley, W.; Foote, H.; Jurrus, E.; Thomas, J. Visualizing Sequential Patterns for Text Mining. In Proceedings of the IEEE Information Visualization, Salt Lake City, UT, USA, 9–10 October 2000; pp. 105–111. [Google Scholar]

**Figure 7.**Relationship between min_sup, max_win and the number of the fault sequence patterns (FSP).

Segmentation System | Custom Dictionary | Mark Part of Speech | Extract Key Words | Support Traditional Chinese | Support 8-bit Unicode Transformation Format (UTF-8) | Identify New Words |
---|---|---|---|---|---|---|

Jieba | $\surd $ | $\surd $ | $\surd $ | $\surd $ | $\surd $ | $\surd $ |

CAS | $\surd $ | $\surd $ | $\times $ | $\surd $ | $\surd $ | $\times $ |

smallseg | $\surd $ | $\times $ | $\times $ | $\surd $ | $\times $ | $\surd $ |

snailseg | $\times $ | $\times $ | $\times $ | $\times $ | $\surd $ | $\times $ |

No. | Chinese Text |
---|---|

1 | [‘1号发动机’// ‘的’// ‘散热器’// ‘渗油’] (engine 1// ’s// radiator// oil seepage) |

2 | [‘左发’// ‘滑油散热器’// ‘出现’// ‘漏油’] (left hair// oil radiator// happen// oil spill) |

NO. | Text | Distance |
---|---|---|

1 | The fourth—hair vibrator ZZG-1 indicates abnormal, internal fault | 0.2302 |

2 | The first 1 hair vibrator does not indicate, internal fault | 0.2302 |

3 | The fourth torque indicates the maximum pressure, internal fault | 0.2813 |

4 | Third vibration amplifier does not indicate, no fault | 0.2991 |

5 | s vibration amplifier overload, the indicator light does not shine, no fault | 0.3268 |

Models | Bag-of-Words Model | Topic Model | Word Embedding Distance Model | |
---|---|---|---|---|

Functions | ||||

Word embedding | $\times $ | $\surd $ | $\surd $ | |

Implication semantic relation | $\times $ | $\surd $ | $\surd $ | |

Suitable for short text | $\times $ | $\times $ | $\surd $ |

Event | Time |
---|---|

C | 2 March 2017 |

B | 11 September 2017 |

A | 2 May 2017 |

C | 8 June 2017 |

$\vdots $ | $\vdots $ |

B | 6 December 2017 |

NO. | Event | Time |
---|---|---|

1 | A | |

2 | C | |

3 | B | |

4 | A | |

$\vdots $ | $\vdots $ | |

N | C |

Event | 1 | 2 | … | n |
---|---|---|---|---|

1 | 1 | ${X}_{12}$ | … | ${X}_{1n}$ |

2 | ${X}_{21}$ | 1 | … | ${X}_{2n}$ |

… | … | … | … | … |

n | ${X}_{n1}$ | ${X}_{n2}$ | … | 1 |

Event | 1 | 2 | … | n |
---|---|---|---|---|

1 | 1 | 0 or 1 | … | 0 or 1 |

2 | 0 or 1 | 1 | … | 0 or 1 |

… | … | … | … | … |

n | 0 or 1 | 0 or 1 | … | 1 |

No. | Included Events |
---|---|

SES_{1} | e_{1}, e_{3}, e_{7}, e_{15}, e_{27} |

SES_{2} | e_{2,} e_{4} |

SES_{3} | e_{1}, e_{3}, e_{7}, e_{27}, e_{35} |

SES_{4} | e_{2}, e_{4} |

$\vdots $ | $\vdots $ |

SES_{n} | e_{6}, e_{18}, e_{33} |

No. | Included Events |
---|---|

SES_{1} | e_{1}, e_{3}, e_{7}, e_{15}, e_{27}, e_{35} |

SES_{2} | e_{2,} e_{4} |

$\vdots $ | $\vdots $ |

No | Event | Time |
---|---|---|

1 | A | |

2 | B | |

3 | A | |

4 | B | |

5 | B |

Number of Max_Win | Included Sequence | Support Degree |
---|---|---|

0 | $\u2460\u2461$ | 2 |

1 | $\u2460\u2461\u2462$ | 3 |

2 | $\u2460\u2461\u2462\u2463$ | 4 |

3 | $\u2460\u2461\u2462\u2463\u2464$ | 5 |

No. | Fault Recordings | Time |
---|---|---|

1 | The motor is not working (with a gasket), the clutch is bad. | 17 February 2014 |

2 | Oil leakage of honeycomb structure of fourth slide oil radiator. | 3 March 2014 |

3 | 2nd oil radiator honeycomb hole oil leakage. | 18 March 2014 |

4 | 3rd oil radiator honeycomb hole oil leakage. | 27 March 2014 |

5 | Small turbine voltage is low, flight vibrations cause poor contact. | 9 April 2014 |

$\vdots $ | $\vdots $ | $\vdots $ |

No | The Previous Fault | Subsequent Fault | Support Degree |
---|---|---|---|

1 | The voice of the high frequency part of the antenna azimuth transmission mechanism is abnormal | Search cannot track objects, self-tuning circuits are bad | 224 |

2 | R116 break, short circuit | No. 5 caller failed | 223 |

3 | The panoramic radar display has failed | Antenna azimuth transmission gear deformation | 221 |

4 | The channel tuning is continuous, the high frequency component is bad | Search cannot track objects, self-tuning circuits are bad | 215 |

5 | Search cannot track objects; self-tuning circuits are bad | Fault of radio frequency relay J1 | 211 |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

Test Order | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|

Parameter Combination | ||||||

A | Min_sim = 0.9 | 100% | 97.80% | 96.90% | 97.30% | 97.40% |

Max_win = 20 | ||||||

Min_sup = 4 | ||||||

B | Min_sim = 0.8 | 100% | 93.30% | 93.50% | 93.10% | 93.70% |

Max_win = 20 | ||||||

Min_sup = 4 | ||||||

C | Min_sim = 0.9 | 100% | 89.30% | 89.60% | 89.20% | 89.80% |

Max_win = 30 | ||||||

Min_sup = 4 | ||||||

D | Min_sim = 0.9 | 100% | 99.10% | 99.30% | 99.10% | 99.20% |

Max_win = 20 | ||||||

Min_sup = 8 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yuan, X.; Chang, W.; Zhou, S.; Cheng, Y.
Sequential Pattern Mining Algorithm Based on Text Data: Taking the Fault Text Records as an Example. *Sustainability* **2018**, *10*, 4330.
https://doi.org/10.3390/su10114330

**AMA Style**

Yuan X, Chang W, Zhou S, Cheng Y.
Sequential Pattern Mining Algorithm Based on Text Data: Taking the Fault Text Records as an Example. *Sustainability*. 2018; 10(11):4330.
https://doi.org/10.3390/su10114330

**Chicago/Turabian Style**

Yuan, Xinglong, Wenbing Chang, Shenghan Zhou, and Yang Cheng.
2018. "Sequential Pattern Mining Algorithm Based on Text Data: Taking the Fault Text Records as an Example" *Sustainability* 10, no. 11: 4330.
https://doi.org/10.3390/su10114330