1. Introduction
Programming education has become indispensable for fostering engineers who will be active in today’s advanced information society. One of the ways to develop programming skills is through coding exercises with repetitive problemsolving. Therefore, many elearning systems have been developed to support programming education (see, for example, [
1,
2,
3]). However, programming, which requires knowledge of syntax and logical thinking, is a difficult process, and learners still tend to stumble, especially in debugging. Among them, logic errors in compilable code are particularly troublesome. The logic error is a token in the source code that causes unexpected behaviors, which can be detected by runtime errors, wrong output results, and resource limit exceeded. So, since programming learners may have little knowledge and experience in programming, it is not easy for them to correct logic errors in source code. This is a problem faced not only by novice programmers but also by advanced programmers and instructors. If they cannot identify the logic errors in the source code, they will spend more time debugging and may lose their motivation in programming. Therefore, we believe that identifying logic errors in the source code can support programming learners.
To correct logic errors in source code, it is necessary to understand both the specifications of programming tasks and programming languages. The learner then needs to iterate over the debugging process until the program meets the specifications of the given programming task. Generally, in debugging work, logic errors are identified, corrected, and tested repeatedly. It is important for the testing to consider whether the expected results are obtained with acceptable time/space computational resources. Although a number of test cases can enable the learner to confirm the existence of logic errors, it is often difficult to identify their location. Therefore, if the location of logic errors in the source code could be automatically identified and the errors then corrected, this would reduce the time required for debugging. Furthermore, additional support can be provided by identifying the true error from the correction candidates and showing the appropriate editing operations.
In this paper, to support programmers, especially for learners, we propose a debugging model for automatically correcting multiple logic errors in given source codes. This model realizes the detection of logic errors by learning the structure of solution codes stored in the database (DB) of a certain elearning system. This system provides the user with many programming tasks, and the user can submit their solution codes. A set of test cases (and a special validator) is prepared for each programming task as judge data, and the automatic assessment system [
4] rigorously tests whether the solution code meets the specifications of the corresponding task. Solution codes passed by the judge are considered as correct codes and other codes are incorrect with logic errors. The learning cycle using this system can be considered as testdriven development because it tests the source code using the test cases that are already prepared in the system. Our proposed model iteratively attempts to detect and correct errors and test the source code. The model consists of two submodels. The first is the Correct Code Model (CCM), which generates a sequence of tokens of correct code by learning the structure of a set of correct codes. By comparing the given source code with the prediction result, CCM can enumerate multiple correction candidates for logic errors. The second is the Editing Operation Predictor (EOP), which indicates the editing operation for the correction candidate obtained by CCM. To train and evaluate the proposed model, we employed solution codes accumulated in an online judge system [
5] that provides many programming tasks and automatic assessments. The experimental results show that the correction accuracy of the proposed model is, on average, 58.64% higher across 32 tasks compared to the conventional model without iterative trials.
The contributions of this paper are as follows.
Development of a debugging model that iteratively detects and corrects errors and tests the source code.
Development of EOP with CCM that can predict the location of the errors, possible alterations as well as editing operations.
Experiment of the proposed model conducted with real solution codes oriented to programming tasks obtained from an online judge system.
Adaptation to use cases assuming that the user debugs incorrect code using the correction candidates and editing operations predicted by the proposed model. We also discuss the adaptation for both programming education and software engineering.
The rest of this paper is organized as follows.
Section 2 introduces related research on logic errors and explores current issues.
Section 3 describes the proposed model.
Section 4 puts forward an experimental method for verifying the usefulness of the proposed model.
Section 5 describes the experimental results and considerations.
Section 6 concludes this paper.
2. Related Works
In this section, first, we should mention current approaches and methods for identifying and correcting errors focusing on educational scenes. We also consider related technologies in debugging oriented to software engineering and other purposes.
2.1. Approaches and Methods for Educational Scenes
In traditional learning methods in programming and software engineering exercises, learners can use compilers and IDEs to detect errors. However, as mentioned in the Introduction, to correct logic errors, learners need to take steps such as checking the execution results. Therefore, it is a burden not only for learners but also for instructors who have many students in the classroom.
LuxtonReilley et al. report on existing approaches and future challenges in introductory programming [
6]. While many debugging tools have been studied that focus on syntax errors in source code, debugging tools that focus on logic errors have also been increasing. They also reported that although many debugging tools have been developed, the future issue is whether learners can understand the complexity of these debugging tools and debug them efficiently.
In order to help learners who cannot debug the source code using compilers and IDEs completely, many studies have been conducted to enhance the error messages output by them [
7,
8]. These studies reported that by analyzing the source code created by the learner and the error messages produced by the compiler, they were able to reduce the number of syntax errors encountered by the learner. However, since logic errors depend on the specifications of the programming task, they do not have the same rules as the syntax of a programming language. Therefore, debugging of logic errors considering the specifications of programming tasks is a major issue.
To support the debugging of logic errors by novice programmers in educational scenes, many studies have been conducted using source code groups created to meet the specifications of a certain task. Yoshizawa et al. proposed a static analysis method that considers the structure of the source code [
9]. For this, the correct code group is converted into Abstract Syntax Trees (ASTs). The source code to be debugged is also converted to AST. By comparing the structure of the converted AST and the prepared ASTs, the position of the logic error and the type of the logic error can be obtained with high accuracy. In machine learningbased approaches, Teshima et al. proposed a CCM based on the LSTM Language Model (LSTMLM) and correct codes [
10]. LSTMLM can indicate the position of logic errors in a given incorrect code and suggest possible words by learning the structure of the correct code. Rahman et al. improved this model by applying the Attention Mechanism with LSTM (LSTMAttM) [
11,
12]. They also employed a bidirectional LSTM to detect logic errors and to present suggestions for corrections [
13]. The models can also be applied to code completion [
14]. Although the machine learning model used for error detection is different, logic error detection and correction are realized by learning the structure of the correct code.
These methods can predict the location of logic errors in the source code and their proposed fixes, but they cannot suggest to the learner how to correct the source code using this information. Moreover, even if the source code can be corrected using the predicted results, the corrected source code may still contain logic errors. Therefore, these methods cannot guarantee that the learner will be able to fix all the logic errors in the source code.
Our model can provide stepbystep information needed to debug logic errors: the location of the logic errors, its proposed fix, its editing operation, and whether the source code is correctable. This means that learners and instructors are given stepbystep opportunities to consider what kind of debugging they should carry out, depending on their proficiency in programming skills. Therefore, our model can provide pedagogical support that takes into account the learner’s learning process, unlike the direct and immediate debugging support found in common software development tools.
2.2. Technologies in Debugging
Pretrained models have been widely used to solve natural language processing tasks [
15]. OpenAI proposed Generative PreTrained Transformer 3 (GPT3) as stateoftheart language model in 2020 [
16]. The model is capable of generating sentences that are comparable to those written by humans. GPT3 was realized by using a huge number of parameters of the model and a large amount of training data.
In recent years, machine learning techniques have been utilized to solve complex programmingrelated problems [
17]. First, we should emphasize the recent evolution of IDE extensions. Functions such as error highlighting, completion, and refactoring are essential for software development. Such support can be realized by machine learning approaches with data analysis. For example, the latest technology in Visual Studio IntelliCode [
18] has had a major impact on software engineering. In this approach, artificial intelligence uses huge GitHub repositories for learning and provides intelligent completion capabilities that take into account not only listing variables and functions but also other situations.
Many automatic bug fixing methods have been proposed for quickly fixing software bugs [
19]. Drain et al. realizes the detection and fix bugs using the standard transformer [
20] as a language model based on source code extracted from GitHub repositories [
21]. Ueda et al. have proposed the fix method by mining the editing histories in GitHub [
22]. These methods make it possible to provide debugging support by using source codes, bug reports, edit histories, and so on.
Many methods have been proposed to fix source code using the correction candidates predicted by machine learning models. To correct multiple syntax errors in a source code, Gupta et al. proposed Deepfix [
23], which iteratively corrects the errors in the source code. In this method, the source code can be modified linebyline using one correction candidate predicted by the Sequence to Sequence (Seq2Seq) attention network. Hajipour et al. [
24] proposed Samplefix, which iteratively corrects errors using a Seq2Seq model. Each time the source code is modified, these methods use a compiler to verify whether an error still exists. It was shown that the accuracy of correcting syntax errors in the source code can be improved by making repeated corrections. However, although source code verification using a compiler can correct syntax errors, it is difficult to find and correct logic errors.
Gupta et al. proposed a method for predicting logic error locations using the tree convolutional neural network and AST [
25]. Experimental results showed that the percentage of source code that contains true errors in the lines predicted by this model was less than 30% when the number of candidate lines is one, and 80% when the number of candidate lines is 10. However, a large number of detection candidates can make it difficult for the user to find the true logic error. Therefore, there is a tradeoff between accuracy and increased number of detection candidates, as too high a number can cause the user to be overloaded with information. Vasic et al. proposed a program modification method leveraging a joint model using an Long ShortTerm Memory (LSTM) network and an attention mechanism to solve the variable misuse problem [
26]. Publicly available data were used to verify the accuracy of identifying and correcting variable misuse points. However, although variable misuse can be considered as one type of logic error, it cannot be used to identify other types of logic errors.
There is an approach called Fault Localization [
27] that predicts the location of errors in the source code by using tests. The bug position is predicted using execution route information from the test cases. However, although it is possible to predict the location of the error using these methods, they do not provide instructions concerning how to correct the error. Lee et al. presented a method for correcting logic errors based on fault localization to support learners of functional languages [
28].
Matsumoto et al. [
29] considered a hybrid architecture of static analysis [
9] and CCM for logic error detection [
10]. They concluded that CCM is advantageous as it can detect more logic errors. Furthermore, they proposed a method for determining the threshold value for controlling the number of candidates indicated by the CCM to reduce the overload [
30]. A set of incorrect codes was applied for CCM training to determine the threshold. The authors suggested that a threshold suitable for each individual programming task should be determined as this threshold can fluctuate for different tasks.
In summary, many methods have been proposed to detect errors in source code. Various researchers use machine learningbased models to detect errors in source code and use that information to modify the code. However, although there are many methods for correcting syntax errors, there are few debugging models for logic errors.
3. Proposed Model
This section introduces a proposed model using the LSTMLM and Support Vector Machine (SVM). First, we explain CCM used to identify logic errors in the source code. Next, the proposed EOP will be described.
3.1. Overview of Proposed Model
Figure 1 outlines the proposed model for correcting multiple logic errors in a given source code. The model debugs by iteratively identifying and correcting logic errors and testing the modified code. This model consists of CCM and EOP. The EOP predicts the editing operation from the correction candidates obtained by CCM. Based on the obtained operations, EOP seeks to correct the logic error in the source code. Then, the model tests whether the modified source code is correct or not. If the source code is incorrect, it again becomes input data for CCM. In this way, the model can debug source code containing multiple logic errors.
To build the proposed model, a set of source codes oriented to the specification of a programming task is required. CCM is constructed by LSTMLM, which has learned the structure of a set of correct codes to predict the position of logic errors. EOP can predict the editing operation for the correction candidate by learning the editing operation performed between the incorrect code and the corresponding correct code.
3.2. Correct Code Model (CCM)
3.2.1. Long ShortTerm Memory Language Model (LSTMLM)
The LSTMLM is one of the RNNs that can handle timeseries data. LSTMs [
31] have the useful characteristic of being able to prevent gradient disappearance and gradient explosion. The language model [
32] enables sentence generation and machine translation based on the structure of the learned data. An LSTMLM can be constructed by combining the following three layers: an embedding layer, an LSTM layer, and a softmax layer (
Figure 2).
The source code is tokenized based on the vocabulary table. Variables, functions, reserved words, characters, and so on are considered vocabulary entries for each programming language. A unique identifier (ID) is assigned to each vocabulary entry. Since the token sequence is still a string, it is converted into a sequence of numerical values $\mathit{x}=[{x}_{1},{x}_{2},\dots ,{x}_{t},\dots ,{x}_{n}]$ based on the vocabulary table.
The embedding layer embeds a given input sequence $\mathit{x}$ into the vector $\mathit{e}=[{e}_{1},{e}_{2},$$\dots ,{e}_{t},\dots ,{e}_{n}]$. Then, the LSTM layer uses the given vector $\mathit{e}$ as an input to calculate the hidden output $\mathit{h}=[{h}_{1},{h}_{2},\dots ,{h}_{t},\dots ,{h}_{n}]$. The hidden output ${h}_{t}$ obtained from the LSTM layer is converted into a wordbyword probability distribution $\mathit{p}$ by the softmax layer.
CCM is constructed by learning the internal parameters of LSTMLM using a set of correct codes. CCM learns the internal parameter by using the sequence $\mathit{x}$ as the learning data. The probability ${\mathit{p}}_{\mathit{t}}$ outputted by CCM indicates the probability distribution of the next token. By using the function argmax for the probabilities, CCM can output predicted sequence $\widehat{\mathit{x}}=[\widehat{{x}_{1}},\widehat{{x}_{2}},\dots ,\widehat{{x}_{t}},\dots ,\widehat{{x}_{n}}]$ from the input sequence $\mathit{x}$.
3.2.2. Localization of Logic Errors by CCM
Algorithm 1 localizes logic errors in a given source code by comparing the input sequence
$\mathit{x}$ with the prediction result obtained by CCM. If the sequence
$\mathit{x}$ follows the correct code, the token
${x}_{t+1}$ at the time
$t+1$ and the token
${\widehat{x}}_{t+1}$ predicted by CCM are equal. On the other hand, if the sequence
$\mathit{x}$ is an incorrect code, the token
${x}_{t+1}$ at the time
$t+1$ and the token
${\widehat{x}}_{t+1}$ predicted by CCM may be different. This means that the token
${x}_{t+1}$ of the sequence is unlikely to appear as a sequence of tokens in the set of correct codes. Therefore, the token detected by Algorithm 1 and its position are likely to be logic errors. The procedure
$ListCorrection$ in the Algorithm 1 shows the positions of three tokens which are correction candidates:
${x}_{t}$,
${x}_{t+1}$, and
${\widehat{x}}_{t+1}$.
Algorithm 1 $listCorrection$ = Localization of logic errors ($\mathit{x}$).

$\mathit{p}\leftarrow CCM\left(\mathit{x}\right)$ 
$listCorrection\leftarrow \left[\right]$ 
for$t=0\phantom{\rule{0.166667em}{0ex}}\dots \phantom{\rule{0.166667em}{0ex}}\mathit{x}.\mathrm{length}\mathrm{do}$ 
${\widehat{x}}_{t+1}\leftarrow \mathrm{arg}\phantom{\rule{3.33333pt}{0ex}}\mathrm{max}\left({\mathit{p}}_{t}\right)$ 
if ${x}_{t+1}\ne {\widehat{x}}_{t+1}$ then 
$listCorrection.\mathrm{append}\left(\right[\mathrm{arg}\phantom{\rule{3.33333pt}{0ex}}\mathrm{min}\left({\mathit{p}}_{t}\right),{x}_{t},{x}_{t+1},{\widehat{x}}_{t+1},t,t+1,t+2$])

$listCorrection$.sort() by probabilities

return$listCorrection$ 
3.3. Editing Operation Predictor (EOP)
If the token ${\widehat{x}}_{t+1}$ predicted by Algorithm 1 and the original token ${x}_{t+1}$ do not match, the source code needs to be edited using the predicted token ${\widehat{x}}_{t+1}$ for the position of the token ${x}_{t}$ and the token ${x}_{t+1}$. The editing operations are considered as follows using the three tokens.
 insert
insert the predicted token ${\widehat{x}}_{t+1}$ between token ${x}_{t}$ and the next token ${x}_{t+1}$.
 delete
delete the next token ${x}_{t+1}$ between token ${x}_{t}$ and ${\widehat{x}}_{t+1}$.
 replace
replace the next token ${x}_{t+1}$ with the predicted token ${\widehat{x}}_{t+1}$.
Editing operations can be predicted by using the tokens ${x}_{t}$, ${x}_{t+1}$ and ${\widehat{x}}_{t+1}$. This assumes that the programmer can edit the source code if he/she knows the token and position to modify. EOP predicts the editing operation from the structure of the source code $\mathit{x}$ and the three tokens ${x}_{t}$, ${x}_{t+1}$ and ${\widehat{x}}_{t+1}$.
Figure 3 shows how the model edits the source code based on the correction candidates obtained from Algorithm 1. The correction candidate with the highest possibility of a logic error is selected from the correction candidates list obtained from Algorithm 1. The selected correction candidate is converted into feature vectors using the vocabulary table used for tokenizing. Four feature vectors of the same size as the vocabulary table are created. They are a vector for the position of token
${x}_{t}$, a vector for the position of token
${x}_{t+1}$, a vector for the position of token
${\widehat{x}}_{t+1}$, and a total number of each vocabulary in the source code. The concatenation of these four vectors is used as the EOP input.
Construction of EOP
Editing operations of source code can be classified into three categories: insertion, deletion, or replacement of tokens. Therefore, we considered the prediction of edit operations for source code as a multiclassification problem. To construct the EOP, we use SVM [
33], which is often used to solve multiclassification problems.
To learn the internal parameters of EOP, feature vectors and teacher labels are extracted using a set of source codes obtained from the DB. The submission logs of all users are extracted from the set of source codes, and a set of pairs, each of which consists of a correct and an incorrect code, are employed as learning data. The extracted pair corresponds to an editing process from an incorrect code to a correct code. By calculating the Shortest Editing Script (SES) that provides the editing operations between two source codes, it is possible to obtain the editing operations for the candidates to be corrected. The SES is calculated from the incorrect and correct codes by dynamic programming. Incorrect code and correct code are converted into an ID sequence in advance using the vocabulary table. By comparing the ID sequence of the incorrect code with the position of the SES, it is possible to label what kind of editing operation the token in the SES can perform on the editing position. EOP learns these feature vectors and labels for editing operations using SVM.
3.4. Iterative Trials
The source code can be automatically corrected by combining CCM and EOP with Algorithm 1. If the source code modified by the model does not meet the specifications of the given programming task, it may need to be corrected again. In this model, the source code is tested using test cases to check whether the modified source code contains logic errors. If the modified source code passes the test, the debugging process is terminated assuming that all logic errors in the source code have been corrected. On the other hand, if the modified source code does not pass the test, the source code will be corrected again. If the target source code is written in a programming language that requires a compiler (e.g., C language), the model uses the compiler to convert it to an executable code before testing.
4. Experiment
To train the proposed model and conduct an experiment to verify its usefulness, we used source codes in C language that were written by programming learners for 32 programming tasks. We used a 64bit Windows 10 computer with an Intel Core i99900K CPU (3.60 GHz), 32 GB RAM, and Nvidia GeForce RTX 2070 SUPER GPU.
The source codes used in the experiment were obtained from Aizu Online Judge (AOJ) [
4,
34,
35], which is an elearning system for programming.
Table 1 shows details of the specifications of the 32 selected programming tasks. AOJ is a system that appeals to people of different abilities, from beginners to experts, and is currently used by more than 85,000 individuals. The system supports around 15 programming languages such as C/C++, Java, and Python. Users can freely select and challenge programming tasks from around 2500 tasks. They can submit their solution code to AOJ, and then it is automatically judged by the AOJ system. This is so the user can know whether the source code they created meets the specifications of the selected programming task. About 5 million source codes submitted by users and their concomitant judgment results have been collected in AOJ’s DB.
To focus on solution codes that include logic errors, we excluded those that could not be properly compiled because of syntax errors or warnings. Moreover, source codes including functions and function macros defined by users were excluded. Comments, tabs, and spaces in the source codes were also deleted to remove unnecessary tokens.
4.1. Training and Training Accuracy
The source code used for training was tokenized to the sequence
$\mathit{x}$ based on
Table 2. In LSTMLM, if the sequence
$\mathit{x}$ becomes long, memory leak may occur due to the increase in internal parameters. Therefore, we applied Hotelling’s theory [
36] to the length of the sequence
$\mathit{x}$ and statistically determined outliers. When the chisquare value of the length of each sequence exceeds 99% of the
${\chi}^{2}$ distribution with one degree of freedom, the source code is regarded as an abnormal value. Source codes judged to be outliers were rejected from the training data using the significance probability (
pvalue)
${\chi}_{0.99}^{2}\left(1\right)=0.00016$. This means that source codes were rarely rejected.
CCM was constructed using Tensorflow (2.4.0) [
37]. To train CCM, we set the batch size to 4, the number of hidden neurons to 256, and the number of epochs to 100 as hyperparameters. We selected categorical entropy as the loss function. To prevent overfitting of CCM, we used the Adam optimizer with four parameters based on the recommendation of [
38]: learning rate
$\alpha =0.001$,
${\beta}_{1}=0.9$,
${\beta}_{2}=0.999$, and
$\u03f5=1e8$. As another countermeasure against overfitting, we set the dropout rate to be 0.5 based on the recommendation of [
39].
Table 3 shows the accuracy of the CCM trained using the correct codes of 32 programming tasks. Task ID is the Problem ID of 32 programming tasks in AOJ, respectively. Number of training data is the number of correct codes used for training, excluding duplicate correct codes. Number of training data shows that solution codes accumulated in AOJ are available for each programming task.
EOP was constructed using Scikitlearn (0.23.2) [
40]. Tenfold cross validation was used to evaluate the performance of EOP. Number of training data was divided into 10 parts, nine of which were used for training, and one was used as test data. From this process, 10 models were constructed and the accuracy of each of them was calculated.
Table 4 shows the accuracy of the trained EOP. Here, Task ID is the Problem ID of the 32 programming tasks. Number of training data is the number of features based on the editing information between the incorrect code and the corresponding correct code in the 32 programming tasks. We excluded duplicate data from number of training data. Training accuracy and test accuracy of EOP are averages of 10 models obtained using kfold cross validation.
Since the training accuracy and verification accuracy is 80% or more in many programming tasks, the EOP is considered to have sufficient predictive performance for editing operations using the editing position and its token. This means that the EOP can likely be effectively used to correct incorrect code.
4.2. Evaluation
To verify the usefulness of the proposed model, we compared it with the conventional model without iterative trials. In the conventional model, only one trial of correction is performed using the correction candidates predicted by CCM. It is necessary to define metrics for evaluating the usefulness of the proposed model. We evaluated the performance of the proposed model by focusing on the correction accuracy of logic errors, the number of trials, and the execution time. We defined the correction accuracy as the ratio of the source codes in which all logic errors are corrected in the experimental data. We defined the number of corrections as the average number of corrections until the correct code is obtained in the experimental data. We defined the execution time as the time until the given source code becomes correct by the proposed model. However, if the model could not correct the given source code, it was not included in this result.
Table 5 shows the experimental data used to evaluate the proposed model. Here, targets is the number of incorrect codes that contain logic errors for each programming task. We selected the target codes with an edit distance of five or less and iterated until the code was corrected. Moreover, we categorized the source codes by the number of tokens that cause logic errors. The proposed model tries to correct logic errors until the given code is corrected. If the source code cannot be edited using the proposed model, modifications may be repeated infinitely as long as there are correction candidates. To avoid this situation, we set a termination condition in the proposed model. The model terminates its trials when the number of correction candidates indicated by Algorithm 1 becomes 0 or the number of iterations of the model exceeds 30.
5. Experimental Results and Discussion
5.1. Results
Table 6 shows the correction performance of the proposed model. The proposed model improves the correction accuracy of source code in all programming tasks. To ensure that the result was not due to a statistical chance, statistical tests were performed on the correction accuracy of the conventional and proposed models. The calculated
pvalue is 6.27 × 10
${}^{16}$, which satisfies the general significance level of
pvalue < 0.05, indicating that these results are not chance results. The average correction accuracy of the conventional model for each programming task was 13.90%. On the other hand, the average correction accuracy of the proposed model was 72.55%. The average correction accuracy of the proposed model is 58.64% higher than the conventional model without iterative trials. This shows that the accuracy of correcting logic errors can be improved substantially by using the proposed model that introduces iterative trials. This means that the proposed model was able to correct hidden logic errors that could not be detected in the first trial.
We compared the number of edit operations in the proposed model with the number of edit operations by the user. The number of edit operations by the user is the edit distance between the experimental data created by each user and the corresponding correct code. The average number of edit operations of the proposed model is slightly larger than that of the user. This means that the correction candidates indicated by the proposed model include correction candidates that do not need to be edited.
The experimental results show the average, minimum, and maximum execution time of the proposed model for each programming task. The average execution time for all programming tasks is less than 1.5 (s). At the shortest, a given code can be corrected within 0.2 (s). At the longest, a given code can be corrected in 10.07 (s). These results show that the proposed model can debug logic errors in the source code within a reasonable timeframe.
Table 7 shows the correction accuracy for each number of logic errors in the source code for all programming tasks. The proposed model can correct all logic errors in the source code, even if there are multiple logic errors. This means that any logic errors that could not be detected in a first attempt were corrected in a later attempt. Therefore, this shows that the proposed model, which leverages iterative trials, is suitable as a debugging model for correcting multiple logic errors.
5.2. Limitations
These results show that the correction performance of the proposed model is high. However, focusing on detection performance and the number of trials, there is still room for improvement. We defined detection performance as the percentage of source codes in which a true logic error exists among the correction candidates obtained in the first trial. In the proposed model, a correction candidate most likely to be a logic error is selected and corrected. This means that if the top
k correction candidates include true logic errors, it will be easier to correct those errors.
Table 8 shows the detection performance of the proposed model. Task ID and Targets are the ID of each programming task and the number of experimental data. This shows the detection performance when the number of correction candidates is narrowed down to the top
k. Where top
$k=\infty $, this corresponds to all the correction candidates enumerated by the proposed model.
The detection performance of the proposed model is 95.54% when the top $k=\infty $. On the other hand, when the correction candidates are narrowed down to the top one, the true logic error can sometimes be missed. This means that true logic errors are less likely to appear when the correction candidates are narrowed down to the top k. The probabilities obtained from CCM are sufficient as a metric for detecting true logic errors in the source code. However, the probabilities may not be sufficient as a metric for selecting a correction candidate. The correction candidates indicated by CCM are likely to be logic errors, but they are not always logic errors. To navigate this issue, one approach could be to narrow down the correction candidates by analyzing what kind of logic errors are likely to occur in each programming task.
5.3. UseCases
Figure 4 shows an example of logic error correction using the proposed model for a sample programming task. The selected programming task is to find and output the cube of a given integer. An example of incorrect code shown in the upper left of the figure outputs the square of the given integer. The logic error in this source code is “x*x” on the fifth line. Therefore, if “*x” is inserted in the source code, the source code will become correct.
The table on the upper right in
Figure 4 lists the prediction results for one application of the proposed model to the incorrect code. The candidate with the lowest probability of occurrence for the next word is the position of “)”, and the proposed revision is “*”. This demonstrates that the model can correctly predict this logic error. In addition, the proposed model predicts “insert * between x and )” from this information. The table at the bottom right of the figure shows the correction candidates when the proposed model is applied a second time to the source code that has already been corrected once. The candidate with the lowest probability of appearance is the position of “)”, and the proposed amendment is “x”. From the obtained information, the overall prediction by the proposed model is thus “insert x between * and)”. Therefore, it is possible to correct all logic errors in this source code by applying the proposed model.
In contrast, for the conventional model without iterative trials, the source code is modified using only the correction candidates obtained from the first trial. In the first trial, “insert *” between “x" and “)” is performed, as in the case of the proposed model. However, since “x” is not present in the correction candidates in this first trial, not all logic errors in the source code can be corrected. Therefore, the conventional model can only correct some and not all of the logic errors in the incorrect code.
The aim of debugging support in programming education is to provide learners with hints for correcting their source codes. However, giving too many hints to the learner can be problematic. Yi et al. [
41] reported that novice programmers do not know how to modify programs efficiently using hints for correcting errors in the source code. This means that the hints must be adjusted according to the skills of the learner. Therefore, we suggest that it is possible to help individuals learn how to debug by showing not only correction candidates but also their editing operations.
In the proposed model, the source code is automatically modified based on the results obtained from each machine learning model. The modification of the source code can be replaced as a process to be performed by the user. Gradual hints related to the correction candidates and their editing operations by the proposed model can give the learners opportunities to think. Therefore, the quality of hints can be controlled by displaying information obtained from our model according to the programming proficiency of the learner.
Novice programmers need debugging support in many situations. There are situations where such individuals do not know what or how to debug when they get the verdict that the source code is incorrect. The proposed model can show novice programmers whether the source code can be corrected by iterative modification. When the source code can be corrected, the correction candidates, the editing operations, and the number of edits obtained in each trial can be presented as hints to these individuals. They can then use these hints to correct the source code. On the other hand, debugging support for intermediate programmers can be achieved by giving partial hints obtained from the proposed model. For example, it may be sufficient to disclose only the position of the logic error as a hint. In this way, the intermediate learner needs to think about the editing operation for the given position. Instructors and other expert programmers could, for example, be provided with feedback from the proposed model with all details at the very beginning, and then use this to help students at their discretion depending on their assessment of individual needs. To apply the proposed model and obtained feedback to educational sites, we should carefully consider learning efficiency. Generally, the feedback should not be direct and immediate supports such as those of conventional IDEs so that we can provide learners with chances to think and try to resolve the problem by themselves. The degree of such support should be controlled by learners or instructors according to their experience and learning modes.
The proposed model can also be used for software development. Generally, software consists of modules, packages, and subroutines. Implementations of these subroutines carry out numerical calculations and algorithmlevel implementations to meet the specifications of a given task. It is necessary to use testing to verify whether the implemented subroutine is correct for the corresponding specification. If there is a programming task with the same or similar specifications as the subroutine implemented in a certain educational system, our model can be employed to modify the source code.
6. Conclusions
In this paper, we proposed a debugging model for correcting logic errors in a given source code. The model could correct multiple logic errors by repeatedly identifying and correcting errors, and testing the source code. In the experiment, to verify the advantage of the proposed model, 32 programming tasks and the corresponding solution codes in an online judge system were applied. By comparing the proposed model with another model without iterative trials, the results showed that the correction accuracy of the proposed model improved by 58.64% on average. In addition, this model can suggest operations for fixing code depending on the features around the detection location in the process of the iterative trials. The proposed model can also control the granularity of hints according to the proficiency of programmers and learners. Therefore, the proposed model takes into account educational effectiveness and can be applied to elearning systems that support education not only in programming, but also in related subjects.
In future work, to improve the correction performance of our model, we would like to analyze what logic errors are likely to occur in each programming task. By comparing the results of this analysis with the candidates, we expect to increase the likelihood that the correcting candidates contain true logic errors.
Author Contributions
Conceptualization, T.M., Y.W. and K.N.; methodology, T.M. and Y.W.; software, T.M.; validation, T.M.; formal analysis, T.M.; investigation, T.M. and Y.W.; resources, T.M.; data curation, T.M.; writing—original draft preparation, T.M.; writing—review and editing, T.M., Y.W. and K.N.; visualization, T.M.; supervision, Y.W. and K.N.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Japan Society for the Promotion of Science (JSPS) under KAKENHI grant number 19K12252.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
References
 Staubitz, T.; Klement, H.; Renz, J.; Teusner, R.; Meinel, C. Towards practical programming exercises and automated assessment in Massive Open Online Courses. In Proceedings of the 2015 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Zhuhai, China, 10–12 December 2015; pp. 23–30. [Google Scholar]
 Crow, T.; LuxtonReilly, A.; Wuensche, B. Intelligent tutoring systems for programming education: A systematic review. In Proceedings of the 20th Australasian Computing Education Conference, Brisbane, Australia, 30 January–2 February 2018; pp. 53–62. [Google Scholar]
 Wasik, S.; Antczak, M.; Badura, J.; Laskowski, A.; Sternal, T. A survey on online judge systems and their applications. ACM Comput. Surv. (CSUR) 2018, 51, 1–34. [Google Scholar] [CrossRef] [Green Version]
 Watanobe, Y.; Intisar, C.; Cortez, R.; Vazhenin, A. NextGeneration Programming Learning Platform: Architecture and Challenges. In SHS Web of Conferences; lEDP Sciences: Les Ulis, France, 2020; Volume 77, p. 01004. [Google Scholar]
 Watanobe, Y. Aizu Online Judge. Available online: https://onlinejudge.uaizu.ac.jp/ (accessed on 30 March 2021).
 LuxtonReilly, A.; Albluwi, I.; Becker, B.A.; Giannakos, M.; Kumar, A.N.; Ott, L.; Paterson, J.; Scott, M.J.; Sheard, J.; Szabo, C. Introductory programming: A systematic literature review. In Proceedings of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education, Larnaca, Cyprus, 2–4 July 2018; pp. 55–106. [Google Scholar]
 Becker, B.A.; Goslin, K.; Glanville, G. The effects of enhanced compiler error messages on a syntax error debugging test. In Proceedings of the 49th ACM Technical Symposium on Computer Science Education, Baltimore, MD, USA, 21–24 February 2018; pp. 640–645. [Google Scholar]
 Denny, P.; LuxtonReilly, A.; Carpenter, D. Enhancing syntax error messages appears ineffectual. In Proceedings of the 2014 Conference on Innovation & Technology in Computer Science Education, Uppsala, Sweden, 22–24 June 2014; pp. 273–278. [Google Scholar]
 Yoshizawa, Y.; Watanobe, Y. Logic Error Detection System based on Structure Pattern and Error Degree. Adv. Sci. Technol. Eng. Syst. J. 2019, 4, 1–15. [Google Scholar] [CrossRef]
 Teshima, Y.; Watanobe, Y. Bug detection based on LSTM networks and solution codes. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 3541–3546. [Google Scholar]
 Rahman, M.M.; Watanobe, Y.; Nakamura, K. Source Code Assessment and Classification Based on Estimated Error Probability Using Attentive LSTM Language Model and Its Application in Programming Education. Appl. Sci. 2020, 10, 2973. [Google Scholar] [CrossRef]
 Rahman, M.M.; Watanobe, Y.; Nakamura, K. A Neural Network Based Intelligent Support Model for Program Code Completion. Sci. Program. 2020, 2020. [Google Scholar] [CrossRef]
 Rahman, M.M.; Watanobe, Y.; Nakamura, K. A Bidirectional LSTM Language Model for Code Evaluation and Repair. Symmetry 2021, 13, 247. [Google Scholar] [CrossRef]
 Terada, K.; Watanobe, Y. Code Completion for Programming Education based on Recurrent Neural Network. In Proceedings of the 2019 IEEE 11th International Workshop on Computational Intelligence and Applications (IWCIA), Hiroshima, Japan, 9–10 November 2019; pp. 109–114. [Google Scholar]
 Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; Huang, X. Pretrained models for natural language processing: A survey. Sci. China Technol. Sci. 2020, 63, 1872–1897. [Google Scholar] [CrossRef]
 Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are fewshot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
 Allamanis, M.; Barr, E.T.; Devanbu, P.; Sutton, C. A survey of machine learning for big code and naturalness. ACM Comput. Surv. (CSUR) 2018, 51, 1–37. [Google Scholar] [CrossRef] [Green Version]
 Visual Studio IntelliCode. Available online: https://visualstudio.microsoft.com/services/intellicode/ (accessed on 30 March 2021).
 Cao, H.; Meng, Y.; Shi, J.; Li, L.; Liao, T.; Zhao, C. A Survey on Automatic Bug Fixing. In Proceedings of the 2020 6th International Symposium on System and Software Reliability (ISSSR), Chengdu, China, 24–25 October 2020; pp. 122–131. [Google Scholar]
 Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
 Drain, D.; Wu, C.; Svyatkovskiy, A.; Sundaresan, N. Generating BugFixes Using Pretrained Transformers. arXiv 2021, arXiv:2104.07896. [Google Scholar]
 Ueda, Y.; Ishio, T.; Ihara, A.; Matsumoto, K. Devreplay: Automatic repair with editable fix pattern. arXiv 2020, arXiv:2005.11040. [Google Scholar]
 Gupta, R.; Pal, S.; Kanade, A.; Shevade, S. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
 Hajipour, H.; Bhattacharya, A.; Fritz, M. SampleFix: Learning to Correct Programs by Sampling Diverse Fixes. arXiv 2019, arXiv:1906.10502. [Google Scholar]
 Gupta, R.; Kanade, A.; Shevade, S. Neural Attribution for Semantic BugLocalization in Student Programs. In Proceedings of the 2019 Conference on Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 11861–11871. [Google Scholar]
 Vasic, M.; Kanade, A.; Maniatis, P.; Bieber, D.; Singh, R. Neural program repair by jointly learning to localize and repair. arXiv 2019, arXiv:1904.01720. [Google Scholar]
 Wong, W.E.; Gao, R.; Li, Y.; Abreu, R.; Wotawa, F. A survey on software fault localization. IEEE Trans. Softw. Eng. 2016, 42, 707–740. [Google Scholar] [CrossRef] [Green Version]
 Lee, J.; Song, D.; So, S.; Oh, H. Automatic diagnosis and correction of logical errors for functional programming assignments. Proc. ACM Program. Lang. 2018, 2, 1–30. [Google Scholar] [CrossRef] [Green Version]
 Matsumoto, T.; Watanobe, Y. Towards hybrid intelligence for logic error detection. In Advancing Technology Industrialization through Intelligent Software Methodologies, Tools and Techniques, Proceedings of the 18th International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques (SoMeT_19), Kuching, Malaysia, 23–25 September 2019; IOS Press: Amsterdam, The Netherlands, 2019; Volume 318, p. 120. [Google Scholar]
 Matsumoto, T.; Watanobe, Y. Logic Error Detection Algorithm Based on RNN with Threshold Selection. In Knowledge Innovation through Intelligent Software Methodologies, Tools and Techniques, Proceedings of the 19th International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques (SoMeT_20), Kitakyushu, Japan, 22–24 September 2020; IOS Press: Amsterdam, The Netherlands, 2020; Volume 327, p. 76. [Google Scholar]
 Hochreiter, S.; Schmidhuber, J. Long shortterm memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
 Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
 Vapnik, V. Pattern recognition using generalized portrait method. Autom. Remote Control 1963, 24, 774–780. [Google Scholar]
 Watanobe, Y. Development and Operation of an Online Judge System. IPSJ Mag. 2015, 56, 998–1005. [Google Scholar]
 Aizu Online Judge. Developers Site (API). Available online: http://developers.uaizu.ac.jp/index (accessed on 30 March 2021).
 Hotelling, H. The Generalization of Student’s Ratio. Ann. Math. Stat. 1931, 2, 360–378. [Google Scholar] [CrossRef]
 Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: LargeScale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 30 March 2021).
 Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
 Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
 Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikitlearn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
 Yi, J.; Ahmed, U.Z.; Karkare, A.; Tan, S.H.; Roychoudhury, A. A feasibility study of using automated program repair for introductory programming assignments. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 740–751. [Google Scholar]
 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).