# Improving Code Completion by Solving Data Inconsistencies in the Source Code with a Hierarchical Language Model

## Abstract

**:**

## 1. Introduction

- To the best of our knowledge, this is the first study to specifically discuss the data inconsistency problem of source code and propose a method to specifically solve this problem.
- The proposed method uses the tree hierarchical structure of source code to combat the inconsistency of tokens.
- The new framework divides the single decoding process of the original language model into the encoding process and the decoding process. This proposed framework can greatly improve the available parameters of the original model and inspire other language models.
- A new tree encoding–decoding mechanism is designed and applied to the hierarchical structures of code.
- Both inner-project and cross-project evaluations are conducted, to compare the performance of models, and an average improvement of 7% is achieved.

## 2. Related Work

#### 2.1. Statistical Models for Code Completion

#### 2.2. Deep Learning Models for Code Completion

#### 2.3. Models for Code Synthesis

#### 2.4. Models for Code Classification

#### 2.5. Hierarchical Language Model for Source Code

## 3. Proposed Method

#### 3.1. Preliminary

#### 3.1.1. Abstract Syntax Tree (AST)

**Concepts of an AST:**The formal definitions of some concepts of an AST are described here.

- Sibling nodes: In an AST, the sibling nodes have the same parent node. They are also considered to be at the same level or at the same hierarchy. For example, in Figure 1, node 5 and node + are sibling nodes, and node 5 and node + are sibling nodes. Node + is the previous sibling of node 5, and node 5 is the next sibling of node +.
- Descendant nodes: For node n, all nodes except node n in the tree rooted at node n, are referred to as descendants of node n. For example, in Figure 1, the descendant nodes of node > are node +, node 5, node a, and node b.
- Ancestor nodes: If node p is the descendant of node q, then node q is referred to as the ancestor of node p.

#### 3.1.2. State-of-the-Art Code Completion Method

**Traditional Language Model Computation Step for Code Completion.**The traditional methods flatten the AST into a token sequence. We assume that the generated token sequence is ${t}_{1}$, ${t}_{2}$, …, ${t}_{n}$. Now, if we want to predict the next token, ${t}_{n+1}$, based on the existing token sequence, the traditional methods compute the probability $P\left({t}_{n+1}\right|{t}_{1},{t}_{2},\dots ,{t}_{n})$. Here, we use a simplest recurrent language model to show, in detail, how $P\left({t}_{n+1}\right|{t}_{1},{t}_{2},\dots ,{t}_{n})$ is computed. The symbol ${e}_{i}$ is the embedding for token ${t}_{i}$. The embedding ${e}_{i}$ for token ${t}_{i}$ is just a vector of shape $[1,m]$, where m is the embedding feature size and is set by a human. The shape $[1,m]$, means that the matrix has 1 row and m columns, that is, it is just a vector with m elements. To ease the description, we set n to $i-1$; now, we want to compute $P\left({t}_{i}\right|{t}_{1},{t}_{2},\dots ,{t}_{i-1})$. With the above definition, the output embedding ${h}_{i}$ for predicting $toke{n}_{i}$ is computed as follows.

#### 3.2. Differences and Insights between Existing Models and the Hierarchical Language Model

#### 3.3. Hierarchical Language Model

#### 3.3.1. Encoding Procedure of HLM

#### 3.3.2. Decoding Procedure of HLM

**Decoding Path of HLM:**The $decoding\phantom{\rule{4pt}{0ex}}path$ of HLM, for node n, is the transfer path from the root to node n. From a node, only the first child of that node or the next sibling of that node can be transferred to. Thus, the candidate transfer paths of the AST in Figure 1 are shown in Figure 4. In Figure 4, the solid arrow represents the transition to the first child, and the dotted arrow represents the transition to the next sibling node. Thus, under this definition, a directed acyclic graph (DAG) was generated, and the transfer path from the root to each node was uniquely determined. In detail, from the root node of the tree, if node n is the descendant of the root node, to reach node n, we must transfer from the root node to the first child of the root node. After reaching a new node, then, if node n is the descendant of the newly reached node, we must transfer to the first child of the newly reached node. Otherwise, we must transfer to the next sibling of the newly reached node. If we continue transferring in this way, we finally reach node n. There are two kinds of transitions: transfer to first child and transfer to next sibling.

**Transition on Decoding Path of HLM:**As described above, the $decoding\phantom{\rule{4pt}{0ex}}path$ consists of a sequence of transitions. In Figure 5, the dotted arrows give an illustration of the path and the transition from the root to node + (the second child of node =). Each transition between nodes on the path is marked as ${t}_{0}$, ${t}_{1}$, ${t}_{2}$, …, ${t}_{5}$. The information flow of a transition represents the accumulated information of previous transitions before that transition. The information that flows for each transition has a fixed data format: ($cell,h$); $cell$ and h are two feature vectors of fixed length. The symbols $cel{l}_{i}$ and ${h}_{i}$, represent the information on transition ${t}_{i}$. Note that for each transition ${t}_{i}$, the source node of that transition is named $sr{c}_{{t}_{i}}$, and the target node of that transition is named $tg{t}_{{t}_{i}}$. For node $sr{c}_{{t}_{i}}$, all descendant nodes are named $sr{c}_{{t}_{i}}^{descendants}$.

**Detailed Decoding Step of HLM:**Then, we iterate the transitions one by one, to compute the accumulated information for predicting node n. At first, the information of transition ${t}_{0}$: $cel{l}_{0}$, ${h}_{0}$ is set to a fixed default value. Then, for each transition ${t}_{i}$, if ${t}_{i}$ is of the “transfer to the first child” type, we use Equation (6) to compute the information for transition ${t}_{i}$. We assume that the token embedding of the source node of the transition ${t}_{i}$ is referred to as ${\tau}_{i}$. The information of transition ${t}_{i}$ is computed by

## 4. Experiment

#### 4.1. UNK Setting

#### 4.2. Datasets

#### 4.3. Baselines

#### 4.4. Hyperparameters

#### 4.5. Termination Condition

#### 4.6. Platform

#### 4.7. Evaluation

## 5. Discussion

## 6. Conclusions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Zeyer, A.; Schlüter, R.; Ney, H. Towards Online-Recognition with Deep Bidirectional LSTM Acoustic Models. In Proceedings of the INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 3424–3428. [Google Scholar]
- Zhou, C.; Sun, C.; Liu, Z.; Lau, F.C.M. A C-LSTM Neural Network for Text Classification. Comput. Sci.
**2015**, 1, 39–44. [Google Scholar] - Rasool, G.; Arshad, Z. A review of code smell mining techniques. J. Softw. Evol. Process
**2015**, 27, 867–895. [Google Scholar] [CrossRef] - Wang, S.; Chollak, D.; Movshovitz-Attias, D.; Tan, L. Bugram: Bug detection with n-gram language models. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, Singapore, 3–7 September 2016; pp. 708–719. [Google Scholar]
- Hindle, A.; Barr, E.T.; Su, Z.; Gabel, M.; Devanbu, P.T. On the naturalness of software. In Proceedings of the ICSE 2012, Zurich, Switzerland, 2–9 June 2012; pp. 837–847. [Google Scholar] [CrossRef]
- Nguyen, T.T.; Nguyen, A.T.; Nguyen, H.A.; Nguyen, T.N. A statistical semantic language model for source code. In Proceedings of the ESEC/FSE’13, Saint Petersburg, Russia, 18–26 August 2013; pp. 532–542. [Google Scholar] [CrossRef]
- Allamanis, M.; Sutton, C.A. Mining source code repositories at massive scale using language modeling. In Proceedings of the MSR ’13, San Francisco, CA, USA, 18–19 May 2013; pp. 207–216. [Google Scholar] [CrossRef][Green Version]
- Tu, Z.; Su, Z.; Devanbu, P. On the localness of software. In Proceedings of the ACM Sigsoft International Symposium, San Jose, CA, USA, 21–25 July 2014; pp. 269–280. [Google Scholar]
- Raychev, V.; Vechev, M.T.; Yahav, E. Code completion with statistical language models. In Proceedings of the PLDI ’14, Edinburgh, UK, 9–11 June 2014; p. 44. [Google Scholar] [CrossRef]
- Nguyen, A.T.; Nguyen, T.N. Graph-Based Statistical Language Model for Code. In Proceedings of the ICSE 2015, Florence, Italy, 16–24 May 2015; Volume 1, pp. 858–868. [Google Scholar] [CrossRef]
- Raychev, V.; Bielik, P.; Vechev, M.T. Probabilistic model for code with decision trees. In Proceedings of the OOPSLA 2016, part of SPLASH 2016, Amsterdam, The Netherlands, 30 October–4 November 2016; pp. 731–747. [Google Scholar] [CrossRef]
- Raychev, V.; Bielik, P.; Vechev, M.T.; Krause, A. Learning programs from noisy data. In Proceedings of the POPL 2016, St. Petersburg, FL, USA, 20–22 January 2016; pp. 761–774. [Google Scholar] [CrossRef]
- Yang, Y.; Jiang, Y.; Gu, M.; Sun, J.; Gao, J.; Liu, H. A language model for statements of software code. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, 30 October–3 November 2017; pp. 682–687. [Google Scholar] [CrossRef]
- White, M.; Vendome, C.; Linares-Vasquez, M.; Poshyvanyk, D. Toward Deep Learning Software Repositories. In Proceedings of the IEEE/ACM Working Conference on Mining Software Repositories, Florence, Italy, 16–17 May 2015; pp. 334–345. [Google Scholar]
- Dam, H.K.; Tran, T.; Pham, T.T.M. A deep language model for software code. In Proceedings of the FSE 2016: Proceedings of the Foundations Software Engineering International Symposium, Seattle, WA, USA, 13–18 November 2016; pp. 1–4. [Google Scholar]
- Hellendoorn, V.J.; Devanbu, P. Are deep neural networks the best choice for modeling source code? In Proceedings of the Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 763–773. [Google Scholar]
- Li, J.; Wang, Y.; Lyu, M.R.; King, I. Code Completion with Neural Attention and Pointer Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4159–4165. [Google Scholar] [CrossRef][Green Version]
- Yang, Y.; Xiang, C. Improve Language Modelling for Code Completion through Learning General Token Repetition of Source Code. In Proceedings of the 31st International Conference on Software Engineering and Knowledge Engineering, SEKE 2019, Lisbon, Portugal, 10–12 July 2019; pp. 667–674. [Google Scholar] [CrossRef]
- Yang, Y. Improve Language Modelling for Code Completion by Tree Language Model with Tree Encoding of Context (S). In Proceedings of the 31st International Conference on Software Engineering and Knowledge Engineering, SEKE 2019, Lisbon, Portugal, 10–12 July 2019; pp. 675–680. [Google Scholar] [CrossRef]
- Liu, F.; Li, G.; Zhao, Y.; Jin, Z. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Virtual, 21–25 December 2020; pp. 473–485. [Google Scholar]
- Liu, F.; Li, G.; Wei, B.; Xia, X.; Fu, Z.; Jin, Z. A unified multi-task learning model for AST-level and token-level code completion. Empir. Softw. Eng.
**2022**, 27, 91. [Google Scholar] [CrossRef] - Wang, Y.; Li, H. Code completion by modeling flattened abstract syntax trees as graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 14015–14023. [Google Scholar]
- Izadi, M.; Gismondi, R.; Gousios, G. Codefill: Multi-token code completion by jointly learning from structure and naming sequences. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 21–29 May 2022; pp. 401–412. [Google Scholar]
- Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Mapping Language to Code in Programmatic Context. arXiv
**2018**, arXiv:1808.09588. [Google Scholar] - Yin, P.; Neubig, G. A syntactic neural model for general-purpose code generation. arXiv
**2017**, arXiv:1704.01696. [Google Scholar] - Drissi, M.; Watkins, O.; Khant, A.; Ojha, V.; Sandoval, P.; Segev, R.; Weiner, E.; Keller, R. Program Language Translation Using a Grammar-Driven Tree-to-Tree Model. arXiv
**2018**, arXiv:1807.01784. [Google Scholar] - Nguyen, T.; Rigby, P.C.; Nguyen, A.T.; Karanfil, M.; Nguyen, T.N. T2API: Synthesizing API code usage templates from English texts with statistical translation. In Proceedings of the ACM Sigsoft International Symposium on Foundations of Software Engineering, Seattle, WA, USA, 13–18 November 2016; pp. 1013–1017. [Google Scholar]
- Gu, X.; Zhang, H.; Zhang, D.; Kim, S. Deep API learning. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, 13–18 November 2016; pp. 631–642. [Google Scholar] [CrossRef]
- Chen, X.; Liu, C.; Song, D. Tree-to-tree Neural Networks for Program Translation. arXiv
**2018**, arXiv:1802.03691. [Google Scholar] - Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1287–1293. [Google Scholar]
- Allamanis, M.; Peng, H.; Sutton, C. A convolutional attention network for extreme summarization of source code. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2091–2100. [Google Scholar]
- Zaremba, W.; Kurach, K.; Fergus, R. Learning to discover efficient mathematical identities. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1278–1286. [Google Scholar]
- Allamanis, M.; Chanthirasegaran, P.; Kohli, P.; Sutton, C. Learning Continuous Semantic Representations of Symbolic Expressions. In Machine Learning Research, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; International Convention Centre: Sydney, Australia, 2017; Volume 70, pp. 80–88. [Google Scholar]
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, 25–31 May 2019; pp. 783–794. [Google Scholar] [CrossRef]

From Projects | Size | Vocabulary | |
---|---|---|---|

Dataset 1 | apache commons lang main | 2.8 MB | 1273 |

Dataset 2 | apache maven | 4.4 MB | 5283 |

Dataset 3 | gocd, apache-incubator-joshua and locationtech-geowave | 7.53 MB | 8030 |

DS | MD | Top-1 | Top-3 | Top-6 | Top-10 | mrr |
---|---|---|---|---|---|---|

1 | RNN | 32.8 | 45.1 | 53.6 | 59.8 | 0.41 |

LSTM | 46.8 | 60.7 | 67.2 | 71.3 | 0.55 | |

HLM | 50.3 | 64.7 | 70.7 | 73.8 | 0.58 | |

2 | RNN | 44.0 | 56.9 | 63.8 | 69.3 | 0.52 |

LSTM | 50.9 | 64.1 | 69.6 | 72.5 | 0.58 | |

HLM | 55.7 | 69.5 | 73.4 | 75.2 | 0.63 | |

3 | RNN | 34.8 | 52.2 | 56.4 | 59.3 | 0.43 |

LSTM | 48.6 | 61.3 | 68.6 | 70.4 | 0.56 | |

HLM | 56.8 | 64.9 | 71.3 | 72.5 | 0.62 |

Time for One Round | Number of Rounds to Converge | Total Time | ||
---|---|---|---|---|

DS1 | LSTM | 1 min | 18 | 18 min |

HLM | 2 min | 7 | 14 min | |

DS2 | LSTM | 6 min | 29 | 174 min |

HLM | 15 min | 9 | 135 min | |

DS3 | LSTM | 22 min | 35 | 770 min |

HLM | 40 min | 11 | 440 min |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yang, Y. Improving Code Completion by Solving Data Inconsistencies in the Source Code with a Hierarchical Language Model. *Electronics* **2023**, *12*, 1576.
https://doi.org/10.3390/electronics12071576

**AMA Style**

Yang Y. Improving Code Completion by Solving Data Inconsistencies in the Source Code with a Hierarchical Language Model. *Electronics*. 2023; 12(7):1576.
https://doi.org/10.3390/electronics12071576

**Chicago/Turabian Style**

Yang, Yixiao. 2023. "Improving Code Completion by Solving Data Inconsistencies in the Source Code with a Hierarchical Language Model" *Electronics* 12, no. 7: 1576.
https://doi.org/10.3390/electronics12071576