Predicting GPA of University Students with Supervised Regression Machine Learning Models

Falát, Lukáš; Piscová, Terézia

doi:10.3390/app12178403

Open AccessArticle

Predicting GPA of University Students with Supervised Regression Machine Learning Models

by

Lukáš Falát

^* and

Terézia Piscová

Faculty of Management Science and Informatics, University of Zilina, Univerzitna 8215/1, 010 26 Zilina, Slovakia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(17), 8403; https://doi.org/10.3390/app12178403

Submission received: 13 January 2022 / Revised: 31 January 2022 / Accepted: 2 February 2022 / Published: 23 August 2022

(This article belongs to the Special Issue Data Analytics and Machine Learning in Education)

Download Versions Notes

Abstract

:

The paper deals with predicting grade point average (GPA) with supervised machine learning models. Based on the literature review, we divide the factors into three groups—psychological, sociological and study factors. Data from the questionnaire are evaluated using statistical analysis. We use confirmatory data analysis, where we compare the answers of men and women, university students coming from grammar schools versus students coming from secondary vocational schools and students divided according to the average grade. The differences between groups are tested with the Shapiro–Wilk and Mann–Whitney U-test. We identify the factors influencing the GPA through correlation analysis, where we use the Pearson test and the ANOVA. Based on the performed analysis, factors that show a statistically significant dependence with the GPA are identified. Subsequently, we implement supervised machine learning models. We create 10 prediction models using linear regression, decision trees and random forest. The models predict the GPA based on independent variables. Based on the MAPE metric on the five validation sets in cross-validation, the best generalization accuracy is achieved by a random forest model—its average MAPE is 11.13%. Therefore, we recommend the use of a random forest as a starting model for modeling student results.

Keywords:

machine learning; prediction; statistical modeling; education; GPA; random forest; linear regression; student

1. Introduction

The abilities and knowledge of each student are evaluated during his studies with grades or percentages that tell us how good the student is in the study. This evaluation gives us the GPA and according to it we consider students to be good, bad, talented or lazy. Students’ learning outcomes and achievements during their university studies can be influenced by many factors, such as learning patterns, talents, interpersonal relationships, motivation and many others. The student’s results can subsequently affect life after graduation and can even help with finding a job. However, it is not always important what academic results the student has achieved during their studies, and for many employers, the experience and what they can do is much more important. Not surprisingly, many people with a worse academic result or without a university degree are very successful in their lives.

The school system in Slovakia has several levels. The first level is pre-school education and concerns children aged 5. This is followed by an elementary school (primary education). After completing 9 years of primary school, students choose a secondary school. There are three types of secondary schools in Slovakia—grammar school, secondary professional school (SOS) and vocational school (SOU). Vocational schools are mostly intended for students who do not achieve good results in primary school. Other students choose between a professional school and a grammar school. If the student is clear in his/her future profession, he/she can choose a specific field and attend a secondary professional school. Since most students who are 15 years old do not have a clear idea of their future profession, they opt for a grammar school. Grammar school is a type of secondary school where general knowledge and skills from various fields such as languages, mathematics, natural sciences and humanities are developed. The best students from primary school usually continue in grammar schools. In secondary schools, the evaluation is as follows. Students are graded 1–5, with 1 being the best and 5 the worst. The comparison of students is carried out on the basis of averaging grades from subjects. After graduating from high school, students can continue at a university in Slovakia or abroad. Many Slovak students continue their studies at a university in the Czech Republic. Slovak higher education has three levels—bachelor’s, master’s and doctoral studies. The evaluation system at Slovak universities is implemented based on the ECTS scale, which is defined from grade A (best) to grade FX (unsuccessful). The student receives credits for successful completion of the course (A–E). To obtain a degree, a student needs to obtain a certain number of credits (e.g., in a bachelor’s degree, a student needs to obtain 180 credits). At universities, the evaluation of students is carried out according to the GPA, which is based on grades A–E as follows: A = 1.0, B = 1.5, C = 2.0, D = 2.5, E = 3.0, FX = 4.0.

Our research had two main objectives:

The first main objective was to identify the factors that influence students’ learning results. With the findings from statistical analysis, schools could identify what makes some students more successful than others. These findings could be used to increase the success level of all students in the faculty. To meet this goal, we defined 40 hypotheses and evaluated them using statistical hypothesis testing.
The second main objective was to predict GPAGPA of university students. The application of these models could be useful for the management of faculties or universities, e.g., in the process of allocating elective courses, in the process of admission to universities or in the process of identification of excellent students.

It is relatively common in the field of education to examine the factors that influence learning outcomes through correlation analysis. However, although Becker et al. [1] point out the importance of predictive analytics in the field of education, predicting learning outcomes is not entirely common in this field. Below are some studies that deal with prediction in the field of education.

Linear regression models are common applications of predictive analytics in the field of education. The study by [2], for example, is dedicated to predicting students’ academic success. For this purpose, authors use a model of multiple linear regression, i.e., the model uses multiple independent variables. Although the model identified factors that are important in predicting academic success (e.g., stress, time pressure, classroom communication), the model was able to explain only 16% of the variability of the dependent variable. Another way to apply linear regression in education is reported by Esmat and Pitts [3]. The authors predict the success of students in an undergraduate exercise science program. To accomplish this, they create a model of multiple linear regression which is quantified using the ordinary least squares method. Based on the results of the regression analysis, they identify the factors that are the best predictors in the required major courses. They see the potential of their procedures in “when examining methods to improve retention of students, progression, minimizing repeat attempts at courses, and improving graduation rates” [3]. Some authors use hierarchical regression models. Huberts et al. [4] present a three-level hierarchical regression model for student grades to predict student success or failure. To estimate parameters of their model they use Bayesian estimation; more specifically, they use Markov Chain Monte Carlo (MCMC) methods based on Gibbs sampling procedure. To evaluate their model, they compare it with the benchmark model—simple one-level linear regression model. Krurei-Mancuso et al. [5] construct hierarchical linear regression models to predict first-year college student success via psychologic factors. Using their models, they investigate what is the effect of the CLEI scales for predicting GPAGPA. Tinajero et al. [6] predict academic success of Spanish university students using hierarchical regression models. As independent factor they chose perceived social support. The following variables were considered as dependent variable: GPA for first year at university, GPA for third year and change in GPA over time.

Although machine learning applications in this field are also known (e.g., Bir and Ahn [7] use logistic regression models to identify factors that influence students’ persistence and make it possible to predict students’ persistence), there are not many of these studies. Most focus on statistical methods. Limited research exists in applying machine learning methods in education. In our research we construct machine learning models to predict GPAGPA.

The paper is divided into eight chapters. In the first section, we perform an exhaustive analysis of studies and discuss what factors affects the learning outcomes of students. In the second section, we present a very short theoretical background about the methods we use later in the paper. Section 3 describes the data and presents the results—we identify factors influencing the GPAs and we present our created prediction models, which are able to predict the student’s GPA. Section 4 discusses the results and Section 5 summarizes the paper.

1.1. Literature Review

One of the big factors influencing a student’s success can be motivation and whether the student has a talent for learning. While some students may not need to make a great effort to achieve good academic results, others need to work hard to achieve good academic results. However, except for these, there are many factors which influence academic success. In this section we present studies related to the factors which influence academic performance.

1.1.1. Psychological Factors of Academic Success

We have identified several research studies that consider psychological aspects to be significant in student success. The authors [5,8,9] state that the decisive factor influencing a student’s success is motivation and examine whether students are motivated to gain new experiences and to become sufficiently qualified. The authors focus on students who are still motivated to learn and examine the factors that are part of their motivational strategies. The authors claim that motivation, support and access to education are among the most important factors influencing studies. They also study whether students’ academic goals and their satisfaction with life are related.

In their study, Han et al. [10] created two models to find out what affects a student’s grades. One model involved only academic training and the other one also included non-cognitive factors such as motivation and a sense of belonging. The second model described the success of students better. Based on the results, one can claim that non-cognitive factors also affect the student’s success.

A study by Basto et al. [11] and Oreški et al. [12] examined what causes academic failure of students. The authors found that reluctance to study, moving, lack of sleep, age or status have a major impact.

Several studies [13,14] have shown that metacognitive strategies (thought processes) have a significant impact on students’ academic performance. Other factors which affect student performance are social interaction with other participants or the online environment. Moreover, setting goals has indirect impact on academic achievement. Authors also confirmed that motivation and student satisfaction have a positive connection with student results.

In a study by Burger et al. [15], successful students were mainly associated with motivated balance and effective study behavior. This study confirmed the importance of understanding and accurately solving the problem of student success.

1.1.2. Study Factors of Academic Success

According to a study by Novaková et al. [16] there is a relationship between the percentage of students’ participation in lectures throughout the semester, the results of the written part of the exam and the overall results of the exam.

The authors Shulruf et al. [17] and Birr and Ahn [7] examined the influence of school factors on student success. They found that organizational factors (the way students are taught) have an impact on success. Other important factors were type of secondary education, ability to cope with the academic work and satisfaction with academic life. The authors sought to improve students’ endurance and academic performance. The same issue was addressed in the study by Sustekova et al. [18], which aimed to examine the impact of the type of secondary school on university results in the subject of computer science. The authors found that the type of high school affects students’ results.

In a study by Esmat et al. [3], the authors studied student success according to how a student progresses in their studies and stays in school. The aim of the study was to examine the admission process for the university program. They examined the students over a period of six years. Finally, authors found out that students who attended preparatory courses before the start of their studies had better prospects to succeed during their studies than students who had not attended the preparatory courses. A similar issue was investigated in a study by Oppenheimer et al. [19], whose authors found that students who completed summer preparatory programs had higher success rates than those who did not.

Mitra et al. [20] examined a set of factors that are the basis for success. Factors included learning style and learning analytics along with demographic and academic backgrounds. Predicting student success based on these factors had 95 % accuracy.

Bou-Sospedra et al. [21] examined in their study aspects of students’ effective learning from the perspective of students, teachers and the family. Every group preferred a different learning style to increase student success. The relationship between different learning strategies and the student’s success in the exam was also addressed in a study by Nettekoven et al. [22]. The authors found that success is influenced by quantitative (number of solved exercises) but also qualitative (used teaching materials) factors.

According to Pechac and Slantcheva-Durst [23], coaching is a promising approach to student support and is also linked to student success. The authors examined specific factors of coaching. The study by Xhomar [24] dealt with the lecturer’s support and individual work of the student and their influence on academic results. The paper states that the individual work of the student also influences their success, while the work of the lecturer does not.

In a study by Moravec et al. [25] the authors investigated the impact of the use of e-learning teaching tools on student outcomes. The results of students who had an e-learning tool available were compared with those who did not. The authors found that the use of such tools improves student outcomes.

In a study by Huberts et al. [4] authors predicted student success based on academic results at high school. They found that the most important factors influencing success were study materials, attendance and education of parents.

Goegan and Daniels [26] examined students’ academic achievement, namely average grades, knowledge and skills and overall satisfaction. They found that students’ academic abilities had an impact on academic averages but not on overall satisfaction.

Gurr et al. [27] dealt with the creation of successful schools. They studied how leaders influence the development of the school. The authors state that the way the school is run is an important factor for the success of students.

1.1.3. Sociological Factors of Academic Success

Veselina et al. [28] state that an important factor, which influences and can increase success, is collaboration and an effective learning environment. They also state that competences that students acquire in the areas of communication and cooperation are important for students.

Oreški et al. [12] state that the academic success, failure and early school leaving of current students and graduates, and the age, status and position of students at enrollment proved to be the most important factors influencing students’ success.

In a study by Ackerman-Barger et al. [29], the authors describe a model of collective influence to increase students’ academic achievement. As part of the study, workshops were organized where students had the opportunity to collaborate with other academic organizations, colleagues and stakeholders. These workshops included active learning exercises, expert lectures, group discussion and structured event planning.

Dam [30] investigated the role of the family in student success. Students who have no family problems and are supported by their families and students who have some family problems were compared. Authors found out that the success of students who do not have family problems is higher. Based on these results one can claim that the family influences success during their studies.

According to a study by Tinajero et al. [6], social support is a key factor influencing the academic performance of university students. Data were obtained from students during the first and third years of study. The study examined students’ perceptions of social support and their academic results.

A study by Schmidt [31] says that friendship is an important factor in the study. Creating study groups is a good way to strengthen and expand the learning process. The study also discusses creating study rooms and spaces where students would have the opportunity to get to know each other, develop relationships and create a group identity. These results are also confirmed by a study by Bipp et al. [32], where the authors report that antisocial students achieve worse learning outcomes and have a higher risk of dropping out.

According to a study by Skendzic et al. [33] it is possible to determine the influence of social networks such as Facebook on student evaluation and results. The authors state that there is a negative correlation coefficient between the time spent on the social network and the student’s success.

Marbouti et al. [34] describe in their study the factors influencing the success of computer engineering students. The average of students and factors such as dormitory life, form of study and degree of study were examined. Moreover, another important factor was whether the student works while studying. Surprisingly, authors found out that if the student was working, his average might be better.

Anderson et al. [35] describe in their study whether a transfer to another school has an effect on a student’s success. The aim was to ensure a successful transition to university study programs. They also found that it was beneficial for students to participate in solving real problems and learn to work in teams.

A study by Gansemer-Topf et al. [36] examined the success of weaker students. The authors found that children for whom at least one parent has a degree, women and people who are socially and academically involved have a better chance of obtaining a degree.

In his study, Aydin [2] also examines, in addition to study factors, emotional, social and cognitive development. He found that students’ success is influenced by stress, time constraints and classroom communication.

The aim of the study by Nunez et al. [37] was to identify the factors contributing to the success of students and analyze their academic results from the first year at the school.

2. Materials and Methods

Machine learning as a research discipline in the field of artificial intelligence has emerged in recent years. According to this method of learning, we distinguish between supervised learning and unsupervised learning. As we focused on supervised learning in this research, we briefly present supervised methods we later used to create prediction models.

2.1. Multinomial Linear Regression

Linear regression is a supervised regression method for predicting quantitative response. The main goal of linear regression is to describe the dependence between variables [38]. In our study, we used multinomial linear regression. Multinomial linear regression, which is an extension of simple linear regression, models a dependent variable Y. This dependent variable depends not only on one but on several other independent variables X. We therefore extended the model of simple linear regression by several variables. Let us have p different regressors; then the multinomial linear regression model is defined as:

Y \approx β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{j} X_{j}

(1)

where

X_{j}

represents the value of the jth predictor and

β_{j}

are coefficients. We interpret the value of coefficient as the increment size effect of X_i on Y while all other variables remain unchanged. Residual sum of squares (RSS) which is used to optimize the model can be defined as

\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(2)

where

{\hat{y}}_{t}

is the predicted value of the dependent variable and

y_{t}

is the actual value of the dependent variable. We find the values of coefficients

β_{0}, β_{1}, \dots, β_{j}

using the least squares method so that the sum of the squares of the error estimate is minimal [38].

2.2. Decision Trees

Decision tree is a universal method of supervised machine learning. Decision trees can be used for regression as well as classification problems. In other words, decision trees can model both numerical and categorical dependent variables [39]. The principle of this method is the nonlinear division of data space using splits. The inner nodes of the tree further contain a condition that splits the data into other nodes. At the bottom of the tree there are leaf nodes that contain the response, i.e., the value of dependent variable. There are several algorithms for construction decision trees, e.g., CART, C4.5, SPRINT, ID3 or SLIQ Algorithm [40]. The decision tree has many advantages, e.g., selection of appropriate significant independent variables is performed automatically, i.e., unimportant variables do not affect the result. Multicollinearity also has no effect on the quality of the output.

One should bear in mind that the decision tree should not be too large. Otherwise, overfitting can occur. In other words, the tree would have a high accuracy rate on the training set, but low accuracy rate on the test data. We can prevent overfitting by setting the depth of the decision tree or by trimming the decision tree [38].

2.3. Random Forest

Like decision trees, random forests can be used for classification as well as for regression problems. Random forests eliminate some problems of decision trees, e.g., tree instability. According to Breiman [41], a random forest is defined as a set of trees

T_{1}, T_{2}, \dots, T_{s}

, whose regression (or classification) functions can be expressed in the form

{d (x, Θ_{k}), k = 1, \dots, S}

(3)

where x is the vector of predictor values,

Θ_{1}

,…,

Θ_{S}

are random equally distributed random vectors and k is the number of trees in the forest [42].

As stated before, random forests contain improvements over decision trees, e.g., the use training data for forests are bootstrap selections from the whole dataset. Observations that are in this selection are used to create the trees and estimate the training error. In the test set, these estimates are also called out-of-bag estimates. The accuracy of random forests is increased by allowing the trees to grow to a great depth, not pruning and maintaining tolerable variance by combining the results of the trees. To reduce the correlation between the trees and to avoid overfitting, a random selection of

m_{0}

from the M predictors available is performed. The tree then looks for the best splits only among the variables, which is based on

m_{0}

set. Another advantage of random forest is that it also works on smaller datasets, as well as the ability to work well for sets that contain many predictors. Moreover, random forests are quite easy to learn and to tune. Random forests can be used to solve many problems such as classification and prediction, measuring the significance of variables and the effect of variables on prediction, clustering or detection of outliers [42].

3. Results

3.1. Data

To collect data, a questionnaire was created. The main objective of the questionnaire was to find out what factors influence the success of students. The reason for that was to identify what could help students to achieve better GPAGPA. The respondents for our research were the students of the Faculty of Management and Informatics at University of Zilina. We focused on students in the study program of informatics in the third year, who already had experience with studying at the faculty. Students answered questions about their learning outcomes, satisfaction with their studies and also various questions focused on factors such as the number of hours of sleep, the method of learning, extracurricular activities, etc. The structure of the questionnaire was as follows:

Demographic questions: The first part contained questions regarding basic information about each student, e.g., age, gender, in which year they are currently, whether they have completed their studies, field of study, etc.
Psychological questions: The second part focused on psychological factors that could affect the GPA of students. These issues mainly included issues related to motivation, e.g., why the student decided to study, why he chose the given school, faculty and field of study. We also tried to find out whether the student enjoys the field of study and whether he is satisfied with the choice.
Study questions: We included questions about the student average at high school and what was their student average at college in completed years. This part also included questions such as whether the student learns on an ongoing basis, what study materials they use to study for an exam or how often they attend lectures and seminars. There were questions about students’ views on online learning as well.
Sociological issues: This part contained questions related to the student’s socialization. These questions included, e.g., how many siblings students have, how many members the household in which they live has, whether they study with their classmates or alone, whether they are more comfortable working in groups in the seminars.
Questions about the faculty: The last part focused on factors related to the faculty. There were questions about what grades the student had from specific subjects, but also questions about their opinion on the study. In addition, this section included questions about watching videos on YouTube to help students in their studies.

Data collection was performed through Google Forms and lasted approximately two weeks in January 2021. Students were addressed via the social network Facebook. The questionnaire was sent to Facebook groups, which included current students of the faculty but also graduates. A total of 79 responses were recorded.

3.2. Exploratory Data Analysis

Most respondents were aged 23 years. As for the structure, 71% of respondents were men and 29% were women. Since the questionnaire was intended mainly for students in the field of informatics, we can see that up to 91% of the respondents were the students in the field of informatics and only a few were students in the field of management and computer engineering. Surprisingly, up to 57% of respondents did not live in a dormitory.

3.2.1. Main Findings from the Exploratory Data Analysis

Findings from the psychological part:

-: Most students decided to study at university mainly because of employment and better earnings in future professions.
-: Most students chose their field of study at the faculty due to job application and good earnings.
-: In total, 24% of respondents fully agreed with the statement: “My field of study is fun and fulfilling to me” and up to 44% tended to agree. Altogether, only 9% disagreed.
-: As for the question, if a student would choose that field of study again, most students (78% in total) would choose the field of study again, while only 22% would choose another field. This may be due to the fact that the given field of study is not fun for them, or they find it too difficult.

Findings from the sociological part:

-: Almost half of respondents (40%) spend two-to-three hours a day on social networks and 23% spend from one to two hours a day on social networks. In total, 86% of students spend time on social networks in the range of one-to-four hours.
-: Among students, the most popular leisure activity is watching movies and TV series (up to 80% of students do it). More than half of students said they play computer games and read books in their leisure time. Less popular activities include playing board games or engaging in group sports. Surprisingly, up to 48% of respondents play individual sports. 4% of respondents said they did not have any free time.

Findings from the study section:

-: In total, 51% of the surveyed students said they study only before the final exam or midterm exam. Furthermore, 47% stated they study continuously throughout the semester. Only 2% of the surveyed students stated they do not study at all.
-: The study materials that students most often use to obtain the best understanding of the topic are materials from previous years (87%), their own notes or materials from classmates. Students also often use the internet or YouTube videos to study. Only 35% of students use teacher materials and only 20% use scripts and textbooks.
-: The participation of students in seminars is much higher than participation in lectures. While 77% of respondents always attend all seminars, only 23% always attend all lectures. As many as 53% of respondents attend only lectures of important subjects.

Findings from the distance education section:

-

During the current situation, students and teachers have had to get used to teaching online. The following findings relate to teaching through the so-called distance learning:

○: In total, 45% of students found distance learning more comfortable.
○: In total, 52% of respondents had approximately the same grades during distance learning; 20% improved slightly and 15% slightly worsened.
○: As many as 90% of students considered the possibility of recording seminars and lectures to be a great advantage of online study.
○: Most students (82%) lacked social contact with classmates during online studying.
○: Another big disadvantage was that 62% of respondents were not able to concentrate at home.

3.2.2. Main Findings from the Structured Exploratory Analysis

Subsequently, a structured exploratory analysis was performed. The data were divided into several groups. The first division was the division of dataset to men and women. Next, we divided data into groups according to their average grade. Finally, we performed a division of students between grammar school graduates and secondary vocational school graduates.

Findings from the psychological part:

-: Students at secondary vocational schools stated more strongly the acquisition of new knowledge as a reason for studying at a university (81% of respondents), compared to the grammar school students (64% of respondents).
-: In total, 73% of men study due to new knowledge and 86% of men study for better earnings. A greater percentage of women (43%) than men (27%) study because of work experience.
-: In total, 81% of grammar school students and 69% of secondary vocational school students chose their field of study because of future employment; however, most secondary vocational school students (81%) chose their field of study because they enjoy it.
-: More women (83%) than men (75%) chose the field of study due to future employment.
-: In total, 67% of respondents with an average grade A would choose their field of study again. Furthermore, 63% of respondents with an average grade E would probably choose the field of study again.
-: In total, 65% of women and 70% of men state that they totally agree or rather agree with the claim that they enjoy their field of study.

Findings from the sociological part:

-: Students with a grade A and C spend the least time on social networks.
-: In total, 39% of men spend from one to two hours per day on social networks, while most women (39%) spend two to three hours a day on social networks.
-: In total, 83% of respondents with an average grade A spend their free time by reading books. On the other hand, 75% of respondents with an average grade E spend their free time playing computer games. The most popular activity among all respondents is watching movies and TV series.

Findings from the study section:

-: More women have an average grade B (26%) compared to only 16% of men.
-: In total, 12% of students from secondary vocational school have grade A compared to only 6% of grammar school students; 25% of grammar school students have a grade B and only 12% of respondents from a secondary vocational school have an average grade B.
-: Grammar school students achieve slightly better results in mathematical subjects, e.g., Mathematical Analysis 1 and Probability and Statistics, than vocational school students. Students from secondary vocational school, on the other hand, are slightly better in other subjects such as Algorithms and Data Structures.
-: Students with an average grade A, unlike other students, more often use teacher materials, scripts and textbooks.
-: Students with grade A had 100% attendance in all seminars. The attendance of other students was also relatively high.
-: As for lectures, most students with an A grade (67%) attend all lectures. Other students participate more in lectures of important subjects.

Findings from the online study section:

-: During distance learning, students with an average grade A still have approximately the same grades.
-: During distance learning, for 55% of men, the grades remained roughly the same. Up to 13% of women and only 2% of men have significantly better grades.

3.3. Confirmatory Data Analysis

Prior to confirmatory and correlation analysis, we preprocessed all data. The answers to the question of gender and boarding school were replaced by the values 0 and 1. For questions where the answers expressed a degree of agreement or satisfaction, the answers were replaced by a scale of 1–5 or 1–4. If the student did not write the number, this answer was replaced by the average of all values. If the student wrote or selected a range of values from the options, the used value was the arithmetic mean of these values. All data that contained the answer, e.g., “More than 5” or “5 and more” have been replaced by 5.

Testing for Mean Differences of Two Independent Samples

For selected variables we tested whether there was a difference in the responses of the two different groups. The data of these variables were divided into groups according to whether the girls or boys answered or according to whether the graduate of a secondary vocational school or grammar school answered. We first used the Shapiro–Wilk test, which determined whether the data had a normal distribution. According to the result, we used either parametric t-test or nonparametric Mann–Whitney test. Results are reported in Table 1. The tests were performed in R language.

We found out that there was a difference in how much time women and men spend per day on social networks. The responses of men and women as to whether they would choose their faculty again did not differ. The answers of grammar school students and secondary vocational school graduates also did not differ. We also found that between grammar school and vocational school students, and between men and women, there was no statistically significant difference between grades of individual subjects except for Informatics 2. In the subject Informatics 2, there was a statistically significant difference in grades between both men and women, as well as between graduates of grammar schools and secondary vocational schools. We did not confirm the hypothesis that grammar school students were better in mathematical subjects. There were no statistically significant differences in the other variables.

3.4. Correlation Data Analysis

In this section, we tried to meet the first main objective of our research. We tried identifying the factors that affect the GPA variable. We defined 40 hypotheses, which were subsequently tested using statistical hypothesis testing. The hypotheses were defined as follows:

H1:

There is a significant dependency between GPA and grade of study

H2:

There is a significant dependency between GPA and the question: “The perspective of my field of study is more important to me than whether I enjoy my field”

H3:

There is a significant dependency between GPA and whether the student considers their field of study as fun and fulfilling

H4:

There is a significant dependency between GPA and whether the student would choose the same field of study again

H5:

There is a significant dependency between GPA and whether the student would choose the same faculty for their university studies again

H6:

There is a significant dependency between GPA and whether the student studies continuously during the whole semester

H7:

There is a significant dependency between GPA and how often they attend seminars

H8:

There is a significant dependency between GPA and how often they attend lectures

H9:

There is a significant dependency between GPA and whether the student considers online studying to be more comfortable

H10:

There is a significant dependency between GPA and whether the student’s grades improved after switching to distance learning

H11:

There is a significant dependency between GPA and grade from the subject Algorithms and Data Structures 1

H12:

There is a significant dependency between GPA and grade from the subject Database Systems

H13:

There is a significant dependency between GPA and grade from the subject Informatics 2

H14:

There is a significant dependency between GPA and grade from the subject Mathematical Analysis 1

H15:

There is a significant dependency between GPA and grade from the subject Discrete Optimization,

H16:

There is a significant dependency between GPA and grade from the subject Probability and Statistics

H17:

There is a significant dependency between GPA and how much videos from Probability and Statistics on YouTube helped the student understand the topic

H18:

There is a significant dependency between GPA and whether the student is male or female

H19:

There is a significant dependency between GPA and whether they live at dormitory

H20:

There is a significant dependency between GPA and GPA at high school

H21:

There is a significant dependency between GPA and type of high school

H22:

There is a significant dependency between GPA and whether they study with their dormitory roommates

H23:

There is a significant dependency between GPA and preparatory courses they attended before studying at the faculty

H24:

There is a significant dependency between GPA and to what extent student uses consultations with the teacher

H25:

There is a significant dependency between GPA and a type of seminar work that suits the student

H26:

There is a significant dependency between GPA and part of the day with their best focus

H27:

There is a significant dependency between GPA and how they rate the course in math practice

H28:

There is a significant dependency between GPA and how they rate the course in programming practice

H29:

There is a significant dependency between GPA and number of watched videos where the teacher from the faculty explained topics from Algebra and from Probability and Statistics

H30:

There is a significant dependency between GPA and age of a respondent

H31:

There is a significant dependency between GPA and number of siblings of a respondent

H32:

There is a significant dependency between GPA and number of household members

H33:

There is a significant dependency between GPA and GPA in the first year of your university studies

H34:

There is a significant dependency between GPA and GPA in the second year of your university studies

H35:

There is a significant dependency between GPA and GPA at high school

H36:

There is a significant dependency between GPA and number of people the student usually studies with

H37:

There is a significant dependency between GPA and number of hours the student spends on social networks per day

H38:

There is a significant dependency between GPA and number of cups of coffee the student drinks per day

H39:

There is a significant dependency between GPA and number of hours of sport during the week

H40:

There is a significant dependency between GPA and number of hours the student sleeps per day.

To determine which variables affect the dependent variable (student’s GPA), we performed the correlation analysis. We later used this information to construct linear regression models. To determine the correlation between the dependent variable and all independent variables, the data were divided into groups:

The first group was categorical ordinal variables, which were variables expressing the degree of agreement and satisfaction or variables that depend on order, such as students’ grades in individual subjects.
The second group consisted of variables that were categorical nominal. For these variables, no answer was more valuable than the other, e.g., gender or type of high school.
The third group was a group of numerical variables. All variables of numeric type were located here.

For each type of variable, a different statistical test was used to determine the level of dependency. When testing the dependency between the dependent variable and the independent variable, the Shapiro–Wilk test was first used to determine if the dependent variable had a normal distribution. Depending on whether the dependent variable had a normal distribution, we used a parametric test or a nonparametric test. A function was created in the R language that returned the p-value of this test. We accepted the null hypothesis of a normal data distribution if the p-value was higher than the level of significance (in our case, α = 0.05). In all cases, nonparametric tests were used to determine the dependency.

3.4.1. Correlation between Numerical Variable and Ordinal Variable

Pearson’s correlation test was used to determine the correlation between the numerical dependent variable and the categorical ordinal variables. In this test, the null hypothesis stated that there was no dependence between the variables and the alternative hypothesis stated that there was a significant dependence.

A function was created in the R language. The input parameter was a dataset, in which the first variable was dependent, and all the other variables were ordinal independent variables. The function calculated the Pearson’s test between the dependent variable and each independent ordinal variable. The function returned a table with the p-values of the test and the column name of the variable. If the p-value was smaller than the significance level (α = 0.05), we rejected the null hypothesis, and hence, the independent variable statistically significantly affected the students’ learning outcomes. In Table 2 we see that the dependency proved to be statistically significant for the following variables: pFriAgain, sContinuousLearning, sLectures, fAaDS1, fDS, fInf2, fMatA1, fDO, fPS.

3.4.2. Correlation between Numerical and Categorical Nominal Variable

To determine the correlation between a numerical dependent variable and categorical ordinal variables we used ANOVA test. In the R language, we used the aov() function to perform this test. The null hypothesis stated there was no dependency between variables. If the p-value was less than α = 0.05, there was a dependence between the variables. In Table 3 we see the results. We can see that statistically significant dependence was found between the GPA and the variable fConsultations and fVideosAlgPS.

3.4.3. Correlation between Numerical and Numerical Variable

Pearson’s correlation test was most appropriate to determine the correlation between the numerical dependent variable and the numerical dependent variables. As seen in Table 4, we rejected the null hypothesis and accepted the alternative hypothesis (there is a significant correlation) for the following variables: sAverageFirst, sAverageSecond, sAverageHighSchool, sGroupSize, sSocialNetworks, and sSleep.

3.5. Supervised Machine Learning Models

The second main objective of our research was to predict the GPA using other variables and factors. To meet this objective, supervised machine learning models were created. We wanted the models to achieve the highest possible accuracy for predicting the GPA. We therefore implemented several regression prediction models based on linear regression, decision tree and random forest. The partial task was to identify variables that statistically significantly affect the GPA through a causal relationship in our models.

3.5.1. Validation

To verify the accuracy of the models we used the ex-post testing as well as cross-validation methodology. The ex-post validation methodology was implemented as follows: we randomly divided the data into a training and test set. The training set contained the data on which the model was trained. On the test set the model predicted the value of a dependent variable without knowing the data.

To ensure greater objectivity we also performed the cross-validation methodology. The reason for this was that we evaluated our models based on the average results from five different validation sets. Using this, we tried to avoid subjective evaluation which would be probable if we used only one validation set. The cross-validation was implemented as follows: The training set was divided into five folds, on which cross-validation was performed. We then recorded the results from every performed cross-validation iteration and calculated the final error value as the average of all five errors (James et al., 2013). Except for higher objectivity, another purpose of cross-validation was to find optimal values of hyperparameters in decision tree and random forest models. For linear regression models, the purpose was to determine the general, more objective accuracy of our models. For cross-validation, 60 observations were used (these observations were randomly selected—with a seed of 1000 in R), i.e., there were 12 observations in each fold. The remaining 19 observations were used for final ex-post testing on the test set. Finally, a model evaluation was performed.

As for data manipulation, after dividing the data, we created two datasets. The first dataset contained all variables except for the variables that contained multiple responses. The second dataset included variables that were found in the previous chapter to affect students’ learning outcomes.

3.5.2. Model Evaluation

Although very often regression models (especially linear regression) are evaluated indirectly, by defining the model only by statistically significant variables, we decided to implement direct evaluation of our models. For this purpose, we used error metrics. In other words, the statistical significance of the variables in the created prediction models was not important to us, but the main criterion for evaluating the models was the predictive accuracy. The evaluation of the models therefore consisted of a comparison between the actual value and the value predicted by our models. We used the accuracy metrics based on residual characteristics. We calculated the residuals as follows

e_{t} = y_{t} - {\hat{y}}_{t}

(4)

This error was defined as the difference between the predicted value of

{\hat{y}}_{t}

and the actual value of

y_{t}

. Using residuals, we then calculated the residual accuracy metrics. We used mean square error (MSE) and mean absolute percentage error (MAPE) metrics to determine the accuracy of our created models. We calculated the MSE as follows:

M S E = \frac{1}{n} S S E = \frac{1}{n} \sum_{t = 1}^{n} {(y_{t} - {\hat{y}}_{t})}^{2} = \frac{1}{n} \sum_{t = 1}^{n} e_{t}^{2}

(5)

where

{\hat{y}}_{t}

was the predicted value,

y_{t}

was the actual value and n was the number of observations in the defined set of observations. We calculated MAPE as follows:

M A P E = \frac{1}{n} \sum_{t = 1}^{n} \frac{| e_{t} |}{y_{t}} \times 100

(6)

We always calculated the MSE and MAPE error on both the training set and the test set. During cross-validation, MAPE was a criterion function on the validation set, according to which we optimized our models. We considered the best model to be the model in which the MAPE metric was the lowest on the validation set.

3.5.3. Created Models

We implemented several multi-aspect supervised machine learning models. The aim of these models was to use the widest possible range of variables for the most accurate prediction of the student’s GPA. Since our goal was to make the most accurate predictions of the student’s GPA, we decided to evaluate the models not from a statistical but from a predictive point of view. As stated before, we therefore implemented the so-called direct validation of our models through error metrics.

Hyperparameters and independent model variables were optimized for the best results. The objective of the model optimization was to minimize the error. In the case of linear regression models, the model optimization was performed by the methodology of selection of suitable variables (feature selection). For decision tree and random forest models, cycle was used, which gradually changed the size of the minbucket or nodesize hyperparameters during cross-validation phase. After cross-validation, final validation was performed with such hyper-parameters where the lowest average percentage error was recorded.

Linear Regression Models

We created several models of linear regression. We used different approaches for features selection. In all models, the dependent variable was the student’s GPA. We modelled the dependent variable using other, independent variables. All mentioned models were created in the R programming language. As stated before, for higher credibility of our results, the linear regression models were also validated through cross-validation. Finally, the model was trained on the training set data and validated on the final validation set data.

First, a LIN1 linear regression model was created. This model predicted the dependent variable using such independent variables where a correlation with the dependent variable from correlation data analysis was identified. The LIN2 linear regression model was created almost in the same way; however, multicollinearity between independent variables was removed. The Variance Inflation Factor was used to identify these variables. Subsequently, all variables with a VIF value of more than 5 were removed. In this way, the variables fVideosAlgPS and sAverageSecond were removed. We also used standard feature selection procedure for selecting variables. We used forward regression as well as backward regression. In forward regression, the variables were added to the linear regression model gradually, and then the model, which was the best in terms of the AIC criterion, was selected. During backward regression, the variables were gradually removed from the model. In Table 5 we can see the results of our linear regression models.

As seen in Table 5, all models achieved good results. From the results of cross-validation, on average, model LIN4 (implemented through backward regression) was the most accurate on validation sets.

Decision Tree Models

Regression decision tree (RDT) was also used to predict the dependent variable. Two regression decision tree models were created. The first regression decision tree (RDT1) used all variables as inputs. The second model (RDT2) used as inputs only variables in which we found that they statistically significantly affect the student’s GPA.

In general, in the decision tree model, the accuracy of the prediction is affected by the minbucket parameter, which limits the minimum number of observations in a node. To obtain the best results, we tested the value of minbucket experimentally using cross-validation. We performed 15 experiments with minbucket, testing minbucket values from 1 to 15. Using results from five-fold-cross-validation, we selected the best value for the minbucket parameter (i.e., the value where the average MAPE error on the five cross-validation sets was minimal). Table 6 shows the results of the cross-validation of the RDT1 model—the best results were achieved at minbucket = 3.

In a similar way, the minbucket optimization was performed with the model RDT2. The regressors for RDT2 were only variables correlated with the dependent variable. Based on the results of another five-fold-cross-validation procedure, the optimal value of minbucket was determined to be minbucket = 2. Table 7 shows the results from five-fold-cross-validation of the RDT2 model.

Random Forest Models

Finally, we created random forest models. To implement random forest models, we used the randomForest library in R. We also defined the number of trees using the ntree parameter. In our experiments, we set the number of trees ntree = 200. In addition, we also tested the nodesize hyperparameter, which had the same function as the minbucket parameter. Based on performed experiments, we determined the most appropriate nodesize value. We constructed two random forest models. Model RF1 used as inputs all independent variables, while the model RF2 used as inputs only variables that were statistically significantly correlated with the dependent variable. Table 8 shows the results of performed five-fold-cross-validation for the RF1 model. The theoretical assumption, that the number of observations in the node does not play a significant role in the random forest, was confirmed.

In addition to the RF1 random forest model, we also created an RF2 random forest. At this time, the inputs contained only variables that were found to significantly affect the student’s GPA. We performed another five-fold-cross-validation on the data of the training set and experimentally found the best value of nodesize parameter. Table 9 shows the results of the cross-validation for RF2 model.

In cross-validation, the lowest percentage error of the RF2 model was achieved with the parameter nodesize = 2. This percentage error in the test set was slightly higher than in model RF1.

3.5.4. Comparative Analysis

The comparative analysis clearly summarizes the results from the previous sections. The procedure of the final comparison of models was as follows: The implemented models were optimized using cross-validation. In the case of linear regression models, the optimization concerned the appropriate selection of independent variables. For the decision tree and random forest models, the optimization concerned the selection of the minbucket and nodesize parameters. Table 10 summarizes the results of the cross-validation. In decision trees and random forest models, only the models with the optimized value of hyperparameter are stated.

The models RF1, RF2 and backward linear regression model LIN4 achieved the best generalization accuracy on the cross-validation set. In other words, these models achieved, on average, the lowest error rate when testing on multiple validation sets composed of new data (i.e., out-of-bag validation). The worst performers were vif models, which used fewer variables.

Since cross-validation was used to find the optimal value of the hyperparameter, especially in the random forest and decision tree models, the final training of the model (60 observations) and the final testing on the test set, which included 19 observations, was performed. This was to show the true predictive power of the implemented models on our specific test set. Table 11 shows the results on the final test set.

RDT2 tree model was more accurate than the RDT1 model and its percentage error in the final testing was 9%. A good sign is that in the final test set, none of the models achieved an error higher than 10%. Linear regression model implemented through backward regression was most accurate. Other models of linear regression were also very accurate.

4. Discussion

The assumption that the random forest model is generally a better predictor than the decision tree model was confirmed. The random forest model generates many trees that are not mutually correlated, and this contributes to results that are more reliable and ultimately more accurate than simple decision trees.

As stated in the previous section, we decided to construct models with all independent variables. The reason for that was that the quantification of decision tree and random forest models does not presuppose nonlinearity of independent variables. An interesting finding is that the random forest and decision tree models achieved comparable results, even if the input dataset contained all independent variables, not only those that were not identified in the correlation analysis as statistically significantly correlated with the dependent variable. In other words, the regression decision tree, which used all variables in the prediction, was as accurate as the tree which used only variables with statistical dependence. However, an interesting finding is that the RDT1 model, which included an input set of independent variables, i.e., even those for which no significant correlation was found, delivered slightly worse MAPE results on cross-validation sets. This could be due to the decision tree creation algorithm, where some statistically insignificant variables in the tree creation process (when selecting a suitable split variable) reach the local optimum and to some extent worsen the overall predictive power of the model. For this reason, we recommend choosing only correlated variables when constructing decision tree models. A similar situation occurred when quantifying random forest models. In this case, based on several experiments performed, it can be stated that the selection of correlated or all variables as potential inputs to the random forest model plays a negligible effect on the accuracy of the model.

4.1. Significant Factors Influencing the GPA

Based on the correlation analysis, we identified the following significant factors influencing the student’s GPA:

Motivation to study again on the faculty (variable: pFriAgain: Would you choose the faculty for your university studies again?)
Regular studying during the whole semester (variable: sContinuousLearning, answer to the questions: Do you study continuously during the whole semester?)
Lecture attendance (variable: sLectures, answer to the question: How often do you attend lectures?)
Grade from the subject Algorithm and Data Structures 1 (variable: fADS1, answer to the question: What was your grade in the subject Algorithms and Data Structures 1?)
Grade from the subject Algorithm and Database Systems (variable: fDS, answer to the question: What was your grade in the subject Database Systems?)
Grade from the subject Informatics 2 (variable: fInf2, answer to the question: What was your grade in the subject Informatics 2?)
Grade from the subject Mathematical Analysis 1 (variable: fMatA1, answer to the question: What was your grade in the subject Mathematical Analysis 1?)
Grade from the subject Discrete Optimization (variable: fDO, answer to the question: What was your grade in the subject Discrete Optimization?)
Grade from the subject Algorithm and Data Structures 1 (variable: fPaS, answer to the question: What was your grade in the subject Probability and Statistics?
Use of consultation during studies (variable: fConsultations, answer to the question: To what extent do you use consultations with the teacher?)
Watching specific teaching videos on YouTube (variable: fVideosAlgPS, answer to the question: Have you watched videos on YouTube, where the teacher from the faculty explains topics from Algebra and from Probability and Statistics? How many have you seen?)
GPA in the first year of college (variable: sAverageFirst, answer to the question: What is your GPA in the first year of college?)
GPA in the second year of college (variable: sAverageSecond, answer to the question: What is your GPA in the second year of college?)
GPA at high school (variable: sAverageHighSchool, answer to the question: What is your GPA at high school?)
Group of students a student usually studies with (variable: sGroupSize, answer to the question: How big is the group of students you usually study with?)
Number of hours in a day a student uses social networks (variable: sSocialNetworks, answer to the question: How many hours a day do you spend on social media? (Facebook, Instagram, YouTube))
Number of hours a student sleeps per day (variable: sSleep, answer to the question: How many hours a day do you sleep?)

In our models, the most often seen variables sAverageFirst and sAverageSecond were identified as the determinants of the overall GPA, i.e., we predict the most accurate student average if there are variables indicating the GPA the student has achieved in the completed years. The GPA that the student achieved in high school also proved to be significant, as well as the grades from specific subjects that the student has completed at the faculty. In addition, in the implementation section, we also implemented two models (RDT1 and RF1), which included all the variables that were available, i.e., even those where no statistically significant correlation was found with the GPA variable. An interesting finding is the fact that after quantifying the RDT1 decision tree with all independent variables, the model also identified as significant variables two variables that were not statistically correlated with the GPA. Specifically, these were the variables number of household members (dHouseholdMembers) and number of cups of coffee a student drinks in a day (sCoffee).

It is no surprise that not all variables identified in section correlation analysis proved to be statistically significant in the quantified models. This could be due to the fact that some of these independent variables were correlated. In this case, it was not necessary to include both variables in the model to explain the dependent variable, but for a sufficient explanation of the dependent variable, it was sufficient to include only one of the two mutually correlated variables.

4.2. Research Limitations

In our research one can find some research limitations. The first research limitation is the subjectivity of our results. Even though random forest models are considered to be more accurate models than linear regression, in our case linear regression models performed more accurate predictions on the final test set than random forest or decision tree models. Even though this result is not usual, we believe this could be for two reasons. The first reason may be a coincidence—linear regression models were simply lucky and better captured data from the specific test set. As in this case there was only one test set, there is some probability that linear regression could achieve higher accuracy on this set. The second reason may be that the final test set contained data that could be modelled by a linear function. Since linear regression is a linear model, it could model linear data better than decision tree or random forest models. However, to avoid the subjectivity of our results, in addition to standard ex-post testing, we also performed five-fold-cross validation. The reason for that was a higher objectivity of our results. In this procedure we divided training data into five folds, created five datasets for training and five datasets for validation of our models. To obtain more objective results we then calculated the average error on the validation set. In this case the results of our models in the final test set changed compared to the results from the ex-post testing. While the random forest and decision tree models achieved the best results in cross-validation, it was not the case in the final test. This means that the decision tree and the random forest models are better in general, but this may not be the case for a specific test set that has been randomly selected from all data. In this one case, the linear regression models achieved better results. Nevertheless, based on higher objectivity, we would recommend choosing a random forest as the default model for studying average predictions for different datasets. Its generalization ability was better than that of linear regression and decision tree models.

The second limitation is the method selection. We are aware of the fact that there are many more methods for regression problems in machine learning. However, we tried to use only some of them. We selected some basic methods (linear regression, decision trees) as well as more advanced methods (random forest). There is chance that if we selected other supervised machine learning methods, we could achieve higher accuracy. For example, we could use deep neural networks and support vector machines or conditional random field models. However, we believe that even with our selected methods the accuracy was relatively high, and the models were able to model the GPA.

The width of our data (number of variables) is the third limitation. Even though we collected quite a number of variables from our respondents, we believe there are other variables (factors) which influence the GPA significantly. Unfortunately, it is not possible to collect all possible data from respondents. However, we believe that by obtaining additional information from students we could improve our models and make the prediction of the average more accurate. For example, these new variables could include grades from specific school-leaving subjects, the success rate percentage in the written part of the school-leaving examination or grades from several subjects that the student has successfully completed at the faculty.

The fourth limitation of our research is the size of our dataset (number of observations). Our sample consisted of answers of 79 respondents (university students). It is obvious that the results would not be the same if our dataset was different. However, it is questionable whether the increased size of our dataset would change our results. We tried to generalize our findings on the whole population by using statistical hypothesis testing and using more validation sets using cross-validation procedure.

Composition of our dataset is the fifth research limitation. We sampled students of the Faculty of Management Science and Informatics and most of our respondents were the students of the field Informatics. It is possible that if we had respondents who are studying different fields of study, the results could be different. However, this is only a hypothesis.

Finally, we believe that by extending our dataset by unstructured data we could achieve higher accuracy. For example, we could use face images of students, use convolutional neural network and then improve the accuracy of our models. Moreover, we could use some sort of text data from our students and use text analytics to improve our models.

5. Conclusions

The evaluation of student’s abilities and knowledge is defined by the GPA. According to it, we consider students to be good, bad, talented or lazy. Students’ learning outcomes and their achievements are influenced by many factors, such as learning patterns, talents, interpersonal relationships, motivation and many others. If we were able to identify factors which influence the GPAs of specific students, we could nudge students for better results. This effect would have many positive benefits for both the school and the student. Therefore, the first main objective of our research was to identify the factors that influence students’ learning results. Moreover, if we were able to predict GPAs, management of schools could optimize the functioning of the university, e.g., in the enrolling process for optional subjects or in the admission process to the university. Therefore, the second main objective of our research was to predict the GPA of students.

We performed literature review analysis of the current state. Analyzing studies from many authors, we identified factors that influence the student’s results. We divided the identified factors into psychological, sociological and study factors. Using these findings, we then designed a questionnaire and collected data. Data were collected from students of the Faculty of Management and Informatics at University of Zilina.

To become familiar with the data, we first used the basic and structured exploratory analysis. We compared the results of the answers of different groups of respondents. We also tested the differences in means between the groups using statistical tests. For this purpose, we used the Shapiro–Wilk test and the non-parametric Mann–Whitney U-test. To meet our first research objective (to identify the factors that influence students’ learning results), we performed correlation analysis, where we examined the statistically significant influence of factors on the dependent variable—the student ‘s GPA. We used Pearson test and ANOVA test. Based on the correlation analysis we identified factors with a statistically significant dependence with the GPA. The identified factors were as follows: motivation to study again on the faculty, regular studying during the whole semester, lecture attendance, grade from the subject Algorithm and Data Structures 1, grade from the subject Algorithm and Database Systems, grade from the subject Informatics 2, grade from the subject Mathematical Analysis 1, grade from the subject Discrete Optimization, grade from the subject Algorithm and Data Structures 1, use of consultation during studies, watching specific teaching videos on YouTube, GPA in the first year of college, GPA in the second year of college, GPA at high school, group of students a student usually studies with, number of hours in a day a student uses social networks, number of hours a student sleeps per day, number of household members, number of cups of coffee a student drinks in a day.

To meet our second main objective (to predict the GPAs of students), we implemented supervised machine learning models in R programming language. The assumption was that these models would be able to predict the GPA based on other, independent variables. We created 10 models, using methods of linear regression, decision trees and random forests. The models predicted the GPA of a student studying at the faculty using independent variables. One random forest model and one decision tree model used all variables as inputs. Based on the MAPE metric on test set, the model created by the backward regression procedure of linear regression provided the best results. However, the random forest model achieved the best average accuracy on five validation sets in cross-validation procedure. In other words, the random forest model RF1 achieved the best generalization accuracy on the new data. Therefore, we recommend the use of a random forest as a starting model for modeling learning outcomes based on other independent variables.

Author Contributions

Conceptualization, L.F.; methodology, L.F. and T.P.; software, T.P. and L.F.; validation, T.P. and L.F.; investigation, T.P. and L.F.; resources, T.P.; data curation, T.P.; writing—original draft preparation, T.P. and L.F.; writing—review and editing, L.F.; visualization, T.P. and L.F.; supervision, L.F.; project administration, L.F.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are available on Figshare—https://figshare.com/articles/dataset/data1_csv/18319514 (accessed on 1 February 2022).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Becker, J.; Hall, L.S.; Levinger, B.; Sims, A.; Whittington, A. Student Success and College Readiness: Translating Pre-Dictive Analytics into Action. Strategic Data Project, SDP Fellowship Capstone Report. 2014. Available online: http://sdp.cepr.harvard.edu/files/cepr-sdp/files/sdp-fellowship-capstone-student-success-college-readiness.pdf (accessed on 1 February 2022).
Aydin, G. Personal Factors Predicting College Student Success. Eurasian J. Educ. Res. 2017, 17, 93–112. [Google Scholar] [CrossRef]
Esmat, T.A.; Pitts, J.D. Predicting success in an undergraduate exercise science program using science-based admission courses. Adv. Physiol. Educ. 2020, 44, 138–144. [Google Scholar] [CrossRef]
Huberts, L.C.E.; Schoonhoven, M.; Does, R.J.M.M. Multilevel process monitoring: A case study to predict student success or failure. J. Qual. Technol. 2020, 52, 127–143. [Google Scholar] [CrossRef]
Krumrei-Mancuso, E.J.; Newton, F.B.; Kim, E.; Wilcox, D. Psychosocial Factors Predicting First-Year College Student Success. J. Coll. Stud. Dev. 2013, 54, 247–266. [Google Scholar] [CrossRef]
Tinajero, C.; Martínez-López, Z.; Rodríguez, M.S.; Páramo, M.F. Perceived social support as a predictor of academic success in Spanish university students. An. Psicol. 2019, 36, 134–142. [Google Scholar] [CrossRef]
Bir, D.D.; Ahn, B. Factors Predicting Students’ Persistence and Academic Success in an Aerospace Engineering Program. Int. J. Eng. Educ. 2019, 35, 1263–1275. [Google Scholar]
Spakovska, K.; Miklosikova, M.; Vanek, M.; Ballarin, M. Factors Affecting the Motivation and Success of Technical University Students. In Proceedings of the Geoconference on Ecology, Economics, Education and Legislation, Sgem, Albena, Bulgaria, 17–26 June 2014; Volume Iii, pp. 597–604. [Google Scholar]
Shieh, C.-J.; Yeh, S.-P. Key Success Factors in Cultivating Students’ Learning Motivation. Int. J. Eng. Educ. 2014, 30, 326–332. [Google Scholar]
Han, C.-W.; Farruggia, S.P.; Solomon, B.J. Effects of high school students’ noncognitive factors on their success at college. Stud. High. Educ. 2022, 47, 572–586. [Google Scholar] [CrossRef]
Basto, M.; Abreu, T.; Vilhena, E.; Goncalves, J.; Carvalho, M. An exploratory PLS study of academic suc-cess/failure. Revista Contemporanea Educacao 2020, 15, 83–105. [Google Scholar] [CrossRef]
Oreški, D.; Hajdin, G.; Kliček, B. Role of Personal Factors in Academic Success and Dropout of IT Students: Evidence from Students and Alumni. Tem J. Technol. Educ. Manag. Inf. 2016, 5, 371–378. [Google Scholar] [CrossRef]
Puška, E.; Ejubović, A.; Đalić, N.; Puška, A. Examination of influence of e-learning on academic success on the example of Bosnia and Herzegovina. Educ. Inf. Technol. 2021, 26, 1977–1994. [Google Scholar] [CrossRef]
Rasheed, H.M.W.; He, Y.; Khalid, J.; Khizar, H.M.U.; Sharif, S. The relationship between e-learning and academic performance of students. J. Public Aff. 2020, 20, e2492. [Google Scholar] [CrossRef]
Burger, A.; Naude, L. In their own words-students’ perceptions and experiences of academic success in higher education. Educ. Stud. 2020, 46, 624–639. [Google Scholar] [CrossRef]
Novakova, A.; Brozek, M. Study of Students’ Presence in Lectures Influence on their Examination Results. In Proceedings of the 11th International Scientific Conference Engineering for Rural Development, Jelgava, Latvia, 24–25 May 2012; pp. 650–654. [Google Scholar]
Shulruf, B.; Hattie, J.; Tumen, S. Individual and school factors affecting students’ participation and success in higher education. High. Educ. 2008, 56, 613–632. [Google Scholar] [CrossRef]
Sustekova, D.; Kontrova, L.; Biba, V. The Influence of the Type of Secondary School and the Weekly Allocation of Informatics on Results of Students’ Achievements in Computer Science—Case Study. Eur J Contemp. Educ. 2020, 9, 621–633. [Google Scholar]
Oppenheimer, S.B.; Mills, J.I.; Zakeri, A.; Payte, T.R.; Lidgi, A.; Zavala, M. An Approach to Improving Student Success in Science, Technology, Engineering, and Mathematics (STEM) Career Pathways. Ethn. Dis. 2020, 30, 33–40. [Google Scholar] [CrossRef]
Mitra, S.; Le, K. The effect of cognitive and behavioral factors on student success in a bottleneck business statistics course via deeper analytics. Commun. Statistics-Simul. Comput. 2019, 48, 2779–2808. [Google Scholar] [CrossRef]
Bou-Sospedra, C.; González-Serrano, M.H.; Jiménez, M.A. Estudio de los estilos de enseñanza-aprendizaje desde la perspectiva de los tres agentes educativos: Alumnos, docentes y familiares (Study of teaching-learning styles from the perspective of the three educational agents: Students, teachers and families). Retos 2020, 39, 330–337. [Google Scholar] [CrossRef]
Nettekoven, M.; Ledermüller, K. Analyzing student’s learning behavior: Critical factors for success. In Proceedings of the Inted2011: International Technology, Education and Development Conference, Valencia, Spain, 7–9 March 2011; pp. 5956–5965. [Google Scholar]
Pechac, S.; Slantcheva-Durst, S. Coaching Toward Completion: Academic Coaching Factors Influencing Community College Student Success. J. Coll. Stud. Retent. Res. Theory Pract. 2021, 23, 722–746. [Google Scholar] [CrossRef]
Xhomara, N. Individual study work and lecturer support as predictors of students’ academic success. Int. J. Knowl. Learn. 2020, 13, 169–184. [Google Scholar] [CrossRef]
Moravec, T.; Štěpánek, P.; Valenta, P. The Influence of Using E-learning Tools on the Results of Students at the Tests. Procedia Soc. Behav. Sci. 2015, 176, 81–86. [Google Scholar] [CrossRef]
Goegan, L.D.; Daniels, L.M. Academic Success for Students in Postsecondary Education: The Role of Student Characteristics and Integration. J. Coll. Stud. Retent. Res. Theory Pract. 2019. [Google Scholar] [CrossRef]
Gurr, D.; Longmuir, F.; Reed, C. Creating successful and unique schools: Leadership, context and systems thinking perspectives. J. Educ. Adm. 2020, 59, 59–76. [Google Scholar] [CrossRef]
Veselina, N.; Snejana, D. How student collaboration influence on student success. In Proceedings of the ICVL 2018, Alba Iulia, Romania, 26–27 October 2018; pp. 130–135. [Google Scholar]
Ackerman-Barger, K.; DeWitty, V.P.; Cooper, J.; Anderson, M.R. An Innovative Approach to Advancing Academic Success for Underrepresented Nursing Students Using the Collective Impact Model. Nurs. Educ. Perspect. 2020, 41, 299–300. [Google Scholar] [CrossRef] [PubMed]
Dam, H. The Family Factor on the Student’s Success of School. J. Divin. Fac. Hitit Univ. 2008, 7, 75–99. [Google Scholar]
Schmidt, S.J. The importance of friendships for academic success. J. Food Sci. Educ. 2020, 19, 2–5. [Google Scholar] [CrossRef]
Bipp, T.; Kleingeld, A.; Snijders, C. Aberrant personality tendencies and academic success throughout engi-neering education. J. Personal. 2020, 88, 201–216. [Google Scholar] [CrossRef]
Skendzic, A.; Kovacic, B.; Valencic, D. The influence of social networks on student’s evaluation and results. In Proceedings of the 36th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 20–24 May 2013; pp. 711–715. [Google Scholar]
Marbouti, F.; Ulas, J.; Thompson, J. Factors Influencing Computer Engineering Student Success. In Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE), Cincinnati, OH, USA, 16–19 October 2019; pp. 1–7. [Google Scholar]
Anderson, M.F.; Pérez, L.C.; Jones, D.; Zafft, C. Success factors for students transferring into undergraduate engineering degree programs. In Proceedings of the 2011 Frontiers in Education Conference (FIE), Rapid City, SD, USA, 12–15 October 2011; p. F2F-1. [Google Scholar]
Gansemer-Topf, A.M.; Downey, J.; Genschel, U. Overcoming Undermatching: Factors Associated with Degree Attainment for Academically Undermatched Students. J. Coll. Stud. Retent. Res. Theory Pract. 2018, 22, 402–424. [Google Scholar] [CrossRef]
Nunez, F.; Arcos-Vargas, A.; Usabiaga, C. Success factors for first-year students in engineering degrees. Proposals for improving the academic performance. DYNA 2019, 94, 272–277. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin, Germany, 2017; ISBN 978-1-461-47137-0. [Google Scholar]
Sosnovshchenko, A. Machine Learning with Swift; Packt Publishing: Birmingham, UK, 2018; ISBN 9781787121515. [Google Scholar]
Priyama, A.; Guptaa, R.; Ratheeb, A.; Srivastava, S. Comparative Analysis of Decision Tree Classification Algorithms. Int. J. Curr. Eng. Technol. 2013, 3, 334–337. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Hricova, J. Metody Konstrukce Klasifikátorů Vhodných Pro Segmentaci Zákazníků; Univerzita Karlova: Prague, Czech Republic, 2013. [Google Scholar]

Table 1. Comparison of selections of two independent samples.

Variable	Groups	p-Value	Significant
grade	sex (Men vs. Women)	0.82868	✗
grade	high school (vocational vs. grammar)	0.72870	✗
GPA	sex (Men vs. Women)	0.62718	✗
GPA	high school (vocational vs. grammar)	0.91689	✗
pEnjoyField	sex (Men vs. Women)	0.48462	✗
sSocialNetworks	sex (Men vs. Women)	0.04281	✓
pFriAgain	sex (Men vs. Women)	0.57432	✗
pFriAgain	high school (vocational vs. grammar)	0.28716	✗
fADS1	sex (Men vs. Women)	0.93858	✗
fADS1	high school (vocational vs. grammar)	0.29945	✗
fDS	sex (Men vs. Women)	0.52692	✗
fDS	high school (vocational vs. grammar)	0.58013	✗
fInf2	sex (Men vs. Women)	0.01143	✓
fInf2	high school (vocational vs. grammar)	0.02241	✓
fMatA1	sex (Men vs. Women)	0.50800	✗
fMatA1	high school (vocational vs. grammar)	0.58014	✗
fPS	sex (Men vs. Women)	0.47345	✗
fPS	high school (vocational vs. grammar)	0.22013	✗
fDO	sex (Men vs. Women)	0.16697	✗
fDO	high school (vocational vs. grammar)	0.23402	✗

Acronyms as follows: grade—average grade at college; pEnjoyField—Do you enjoy your field of study?; sSocialNetworks—How many hours per day do you spend on social networks?; pFriAgain—Would you choose the same faculty again if you could change your decision? fADS1—grade from the subject Algorithms and Data Structures 1, fDS—grade from the subject Database Systems; fInf2–grade from the subject Informatics 2; fMatA1—grade from the subject Mathematical Analysis 1; fPS—grade from the subject Probability and Statistics; fDO—grade from the subject Discrete Optimization.

Table 2. Correlation between dependent variable (GPA) and categorical ordinal variables.

Variable	p-Value
dGrade	8.574630 × 10⁻¹
pPerspectiveEmployment	1.351942 × 10⁻¹
pFieldEnjoy	2.786545 × 10⁻¹
pFieldAgain	7.809487 × 10⁻²
pFriAgain	8.856353 × 10⁻³
sContinuousLearning	1.788001 × 10⁻²
sSeminars	2.056409 × 10⁻¹
sLectures	1.801961 × 10⁻²
sOnlineMoreComf	4.560202 × 10⁻¹
sOnlineGrades	9.486936 × 10⁻²
fADS1	2.860071 × 10⁻⁶
fDS	1.113868 × 10⁻⁴
fInf2	5.971734 × 10⁻⁴
fMatA1	1.627070 × 10⁻²
fDO	9.961179 × 10⁻⁶
fPS	1.620201 × 10⁻⁴
fVideosHelped	8.609253 × 10⁻¹

Acronyms as follows: dGrade—grade of study; pPerspectiveEmployment—The perspective of my field of study is more important to me than whether I enjoy my field; pFieldEnjoy—My field of study is fun and fulfilling; pFieldAgain—Would you choose this field of study again?; pFriAgain—Would you choose the faculty for your university studies again?; sContinuousLearning—Do you study continuously during the whole semester?; sSeminars—How often do you attend seminars?; sLectures—How often do you attend lectures?; sOnlineMoreComf—Do you consider online studying to be more comfortable?, sOnlineGrades—Have your grades improved after switching to distance learning?; fADS1—grade from the subject Algorithms and Data Structures 1; fDS—grade from the subject Database Systems; fInf2—grade from the subject Informatics 2; fMatA1—grade from the subject Mathematical Analysis 1; fDO—grade from the subject Discrete Optimization, fPS—grade from the subject Probability and Statistics; fVideoHelped—How much did videos from Probability and Statistics on YouTube help you understand the topic?

Table 3. Correlation between GPA and nominal variables.

Variable	p-Value
dSex	0.6445772990
dDormitory	0.3276207182
sHighSchool	0.3817681990
sHighSchoolType	0.7749678426
sRoommateLearning	0.4693267659
fPreparatoryCourses	0.9171159973
fConsultations	0.0001257365
fTypeWork	0.3472054095
fBestFocus	0.4180869036
fPracticumMath	0.3231354458
fPracticumProgr	0.2651187880
fVideosAlgPS	0.0082646657

Acronyms as follows: dSex—What is your sex?; dDormitory—Do you live at dormitory or not?; sHighSchool—GPA at high school; sHighSchoolType—type of high school; sRoommateLearning—If you live at dormitory, do you study with your roommates?; fPreparatoryCourses—What preparatory courses did you attend before studying at the faculty?; fConsultations—To what extent do you use consultations with the teacher?; fTypeWork—What type of seminar work suits you best?; fBestFocus—Which part of the day is your best focus?; fPracticumMath—If you have completed a course in math practice, how do you rate this course?; fPracticumProgr—If you have completed a course in programming practice, how do you rate this course?; fVideosAlgPS—Have you watched videos on YouTube, where the teacher from the faculty explains topics from Algebra and from Probability and Statistics? How many have you seen?

Table 4. Correlation between GPA and numerical variables.

Variable	p-Value
dAge	1.489211 × 10⁻¹
dSiblings	2.610907 × 10⁻¹
dHouseholdMembers	3.462971 × 10⁻¹
sAverageFirst	2.368003 × 10⁻¹⁹
sAverageSecond	3.293248 × 10⁻²³
sAverageHighSchool	6.416135 × 10⁻³
sGroupSize	1.400622 × 10⁻²
sSocialNetworks	5.113872 × 10⁻³
sCoffee	8.288364 × 10⁻¹
sSport	1.855124 × 10⁻¹
sSleep	3.030258 × 10⁻²

Acronyms as follows: dAge—age of a respondent; dSiblings—number of siblings of a respondent; dHouseholdMembers—number of household members; sAverageFirst—GPA in the first year of your university studies; sAverageSecond—GPA in the second year of your university studies; sAverageHighSchool—GPA at high school; sGroupSize—number of people you usually study with; sSocialNetworks—number of hours you spend on social networks per day; sCoffee—number of cups of coffee you drink per day; sSport—number of hours of sport during the week; sSleep—number of hours you sleep per day.

Table 5. Error metrics of linear regression models, average results on 5-fold cross-validation.

	Train Sets		Test Sets
Model	Avg MSE	Avg MAPE [in %]	Avg MSE	Avg MAPE [in %]
LIN1	0.02209241	6.14795342	0.11141661	14.87208277
LIN2	0.03921313	8.20007161	0.12469694	15.53585540
LIN3	0.02209241	6.14795342	0.11141661	14.87208277
LIN3v	0.03921313	8.20007161	0.12469694	15.53585540
LIN4	0.02921812	7.09180468	0.06782395	11.91675271
LIN4v	0.04322373	8.65998242	0.13475872	16.23674155

LIN2: regressors filtered after VIF, LIN3: forward regression, LIN3v: forward regression with regressors filtered after VIF, LIN4: backward regression, LIN4v: backward regression with regressors filtered after VIF.

Table 6. RDT1 model error metrics, 5-fold cross-validation.

	Train Sets		Test Sets
Minbucket	Avg MSE	Avg MAPE [in %]	Avg MSE	Avg MAPE [in %]
1	0.01101541	4.357704	0.1626867	17.12107
2	0.01465093	5.105888	0.1289060	15.28378
3	0.02221390	5.937367	0.1136131	14.27289
4	0.03621173	7.959550	0.1384408	15.65354
5	0.04756119	8.912896	0.1359141	15.93231
6	0.05120883	9.295309	0.1336549	15.67859
7	0.06278420	10.166809	0.1579339	16.47122
8	0.07063393	11.119335	0.1459742	15.58582
9	0.09828617	13.099079	0.1573543	15.66941
10	0.10388070	13.654471	0.1697236	16.36678
11	0.10388070	13.654471	0.1697236	16.36678
12	0.11220728	14.186691	0.1901627	17.63549
13	0.11220728	14.186691	0.1901627	17.63549
14	0.11233876	13.940644	0.1693422	16.58075
15	0.11233876	13.940644	0.1693422	16.58075

Table 7. RDT2 model error metrics, 5-fold cross-validation.

	Train Sets		Test Sets
Minbucket	Avg MSE	Avg MAPE [in %]	Avg MSE	Avg MAPE [in %]
1	0.01299025	4.895281	0.1109121	14.60411
2	0.01670686	5.748583	0.1001298	13.52104
3	0.02569819	6.632120	0.1045627	13.68382
4	0.03665586	7.921788	0.1398148	15.73569
5	0.04813336	9.112613	0.1380231	16.14253
6	0.05125344	9.369087	0.1400916	16.15284
7	0.06278420	10.166809	0.1579339	16.47122
8	0.07063393	11.119335	0.1459742	15.58582
9	0.09828617	13.099079	0.1573543	15.66941
10	0.10388070	13.654471	0.1697236	16.36678
11	0.10388070	13.654471	0.1697236	16.36678
12	0.11220728	14.186691	0.1901627	17.63549
13	0.11220728	14.186691	0.1901627	17.63549
14	0.11233876	13.940644	0.1693422	16.58075
15	0.11233876	13.940644	0.1693422	16.58075

Table 8. Random forest RF1—accuracy error metrics, 5-fold cross-validation.

	Train Sets		Test Sets
Nodesize	Avg MSE	Avg MAPE [in %]	Avg MSE	Avg MAPE [in %]
1	0.08695614	11.50494	0.08595961	11.49746
2	0.08749527	11.51510	0.08215867	11.51712
3	0.09158489	11.60859	0.08529704	11.53510
4	0.08874726	11.55299	0.08671523	11.42037
5	0.08875384	11.42312	0.08212290	11.12689
6	0.08836823	11.46728	0.08269044	11.40530
7	0.08814417	11.55690	0.08692935	11.66649
8	0.08624920	11.17409	0.08113720	11.24279
9	0.08702532	11.33934	0.08559722	11.44480
10	0.08766777	11.62171	0.08683290	11.64031
11	0.08797640	11.50625	0.09115274	11.82005
12	0.09239382	11.78369	0.08494978	11.38928
13	0.09173662	11.71659	0.08616982	11.32459
14	0.09357773	11.86118	0.09020726	12.14479
15	0.09108029	11.75776	0.08853183	11.65980

Table 9. RF2 random forest—accuracy error metrics, 5-fold cross-validation.

	Train Sets		Test Sets
Nodesize	MSE	MAPE [in %]	MSE	MAPE [in %]
1	0.07778958	11.03145	0.07763078	11.62394
2	0.07671013	11.03411	0.08017573	11.57939
3	0.07418463	10.80416	0.08000238	11.93554
4	0.07515196	10.96224	0.08058369	11.84307
5	0.07596767	10.86967	0.07839428	11.69903
6	0.07547489	11.28770	0.08239564	11.91910
7	0.07736840	10.96301	0.07882811	11.65105
8	0.08007958	11.17961	0.07992665	11.88414
9	0.07885461	11.18626	0.07884320	11.77858
10	0.08145756	11.22768	0.08252397	12.04561
11	0.08104617	11.25166	0.08065937	11.87481
12	0.08211288	11.36819	0.08366853	12.01931
13	0.08280258	11.31227	0.08165990	11.84619
14	0.08603487	11.54173	0.08693637	12.34537
15	0.08243315	11.42266	0.08893866	12.50105

Table 10. Resulting error characteristics of our regression models (5-fold cross validation).

	Train Sets		Test Sets
Model	MSE	MAPE [in %]	MSE	MAPE [in %]
LIN1	0.02209241	6.14795342	0.11141661	14.87208277
LIN2	0.03921313	8.20007161	0.12469694	15.53585540
LIN3	0.02209241	6.14795342	0.11141661	14.87208277
LIN3v	0.03921313	8.20007161	0.12469694	15.53585540
LIN4	0.02921812	7.09180468	0.06782395	11.91675271
LIN4v	0.04322373	8.65998242	0.13475872	16.23674155
RDT1	0.02221390	5.93736708	0.11361307	14.27289437
RDT2	0.01670686	5.748583	0.1001298	13.52104036
RF1	0.08875384	11.42311961	0.08212290	11.12689471
RF2	0.07671013	11.03411	0.08017573	11.57939

LIN2: regressors filtered after VIF, LIN3: forward regression, LIN3v: forward regression with regressors filtered after VIF, LIN4: backward regression, LIN4v: backward regression with regressors filtered after VIF, RDT1: regression decision tree with all data as inputs, RDT2: regression decision tree with correlated inputs, RF1: random forest with all data as inputs, RF2: random forest with only correlated inputs.

Table 11. Resulting error characteristics of models (train and test set).

		Train Set		Test Set
Rank	Model	MSE	MAPE [in %]	MSE	MAPE [in %]
1	LIN4	0.03294952	7.55477630	0.02099940	4.66910893
2	LIN3	0.02755033	6.85268699	0.01992249	5.65941975
3	LIN1	0.02755033	6.85268699	0.01992249	5.65941975
4	LIN2	0.04532379	8.96011573	0.04690230	7.74740457
5	LIN3v	0.04532379	8.96011573	0.04690230	7.74740457
6	LIN4v	0.05132918	9.36851912	0.05857849	7.99286348
7	RDT2 (minbucket = 2)	0.02103747	6.39241377	0.09312832	8.99162523
8	RDT1 (minbucket = 3)	0.03096613	6.85644788	0.09081706	9.34168190
9	RF2 (nodesize = 2)	0.07511018	11.10445503	0.06689458	9.61943009
10	RF1 (nodesize = 5)	0.07920765	11.05388887	0.07428699	9.83121416

LIN2: regressors filtered after VIF, LIN3: forward regression, LIN3v: forward regression with regressors filtered after VIF, LIN4: backward regression, LIN4v: backward regression with regressors filtered after VIF, RDT1: regression decision tree with all data as inputs, RDT2: regression decision tree with correlated inputs, RF1: random forest with all data as inputs, RF2: random forest with only correlated inputs.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Falát, L.; Piscová, T. Predicting GPA of University Students with Supervised Regression Machine Learning Models. Appl. Sci. 2022, 12, 8403. https://doi.org/10.3390/app12178403

AMA Style

Falát L, Piscová T. Predicting GPA of University Students with Supervised Regression Machine Learning Models. Applied Sciences. 2022; 12(17):8403. https://doi.org/10.3390/app12178403

Chicago/Turabian Style

Falát, Lukáš, and Terézia Piscová. 2022. "Predicting GPA of University Students with Supervised Regression Machine Learning Models" Applied Sciences 12, no. 17: 8403. https://doi.org/10.3390/app12178403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting GPA of University Students with Supervised Regression Machine Learning Models

Abstract

1. Introduction

1.1. Literature Review

1.1.1. Psychological Factors of Academic Success

1.1.2. Study Factors of Academic Success

1.1.3. Sociological Factors of Academic Success

2. Materials and Methods

2.1. Multinomial Linear Regression

2.2. Decision Trees

2.3. Random Forest

3. Results

3.1. Data

3.2. Exploratory Data Analysis

3.2.1. Main Findings from the Exploratory Data Analysis

3.2.2. Main Findings from the Structured Exploratory Analysis

3.3. Confirmatory Data Analysis

Testing for Mean Differences of Two Independent Samples

3.4. Correlation Data Analysis

3.4.1. Correlation between Numerical Variable and Ordinal Variable

3.4.2. Correlation between Numerical and Categorical Nominal Variable

3.4.3. Correlation between Numerical and Numerical Variable

3.5. Supervised Machine Learning Models

3.5.1. Validation

3.5.2. Model Evaluation

3.5.3. Created Models

Linear Regression Models

Decision Tree Models

Random Forest Models

3.5.4. Comparative Analysis

4. Discussion

4.1. Significant Factors Influencing the GPA

4.2. Research Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI