1. Introduction
Quality education provides the basis for equality in society. One of the most basic public services is high-quality education [
1]. Quality of education is vital for every citizen [
2]. For this, educational institutes should be focused to improve the academic performance of every student individually. To achieve overall academic success, students need to perform well in all courses [
3]. It is quite difficult for educators to keep track of their students’ academic performance and improve their performance in each course [
4]. As they cannot manage individual course-wise records manually, so they are unable to improve students’ performance when it is required to meet the demands of each student on different attitudes [
5]. Thus, a technical automated system is required which should provide detailed information on students’ progression, which should input exam results, assignments, and performance of other activities in the course [
6]. Researchers are working on a variety of statistical and machine learning models for predicting the impact of academic factors on different students [
7].
Educational Data Mining (EDM) is a field that has been used to analyze academic data [
8,
9]. EDM has various applications during the development of a system as it uses different computational methods to detect patterns to analyze large-scale data. The prediction of student results in terms of academic performance, ratings, or grades is a well-known application of Educational Data Mining [
10,
11,
12,
13]. Predictive modeling techniques have been considered for students’ academic performance [
14,
15], in this regard, classification techniques come out to be the most effective for this problem.
Data mining can be used in different fields to improve overall efficiency by using pattern analysis [
16]. It is possible only by extraction of valuable information [
17] from a stored dataset that is undiscovered until now. This extracted useful information will be used later, as it will help in resolving the issues that were previously faced during the development of the structural model. Today, data mining and its applications in the education sector have gained importance more than before. Thus, we can define Educational Data Mining as ‘the process to transform raw data from the educational system into useful information that can be used later by the stakeholders for further applications’ [
18]. In the end, it will assist the educational institutes to review, improve, and strengthen students’ learning process. It is obvious that to enhance the environment of any educational institute, the most important thing is to understand the learning process of students. The sheer understanding of this process has several advantages like optimization of learning outcomes for students and making the system strong enough to support weak students [
19]. As a result, the rate of students failing their courses and dropping out will be decreased [
20].
In the literature on geography education, spatial thinking is more closely tied to spatial skills, aptitudes, and ideas [
21,
22]. Choosing the best route to commute to work or school is only one example of how spatial thinking is used on a daily basis. Geographical Information systems (GIS), in particular, can improve spatial thinking because they make it possible to analyze geospatial data and find hidden patterns inside the data [
23,
24]. GIS is a system that is used for the management, storage, and analysis of geospatial data. GIS-based applications are mostly available online easily. These applications can be used by anyone for processing data concerning its spatial features [
25].
Geospatial data is comprised of both location and characteristics of spatial features. To define a lane, for example, a reference is made to its position (i.e., where it is) and its attributes (e.g., length, name, speed limit, and direction). A GIS enables the user to handle road data and many other geospatial data, thereby separating them from non-spatial data business management systems. In addition to geospatial data, a GIS contains hardware, software, people, and organization. GIS hardware includes computers, printers, plotters, digitizers, scanners, Global Positioning System (GPS), and mobile devices. GIS software, either commercial or open-source, contains programs and applications for data management, data interpretation, data display, and other tasks to be performed by a computer [
26].
Student dropout in the educational sector is a very important issue in higher education and If students’ dropout rate is high then it will surely waste the resources of the institution and will also affect its credibility whenever an institutional evaluation will be performed [
27]. Consequently, it is the need of the hour to propose a model that will output the estimation of the final result of the students by making use of their previous records to reduce the rate of dropout. This will also enhance the quality of education. For this, all the faculty members, administration, and educational system of the institutions should take this responsibility to design better outlines of learning and establish useful systems which will enhance learning opportunities for the students [
28].
Hence, in this paper, firstly, we identified the student performance risk factors and semester behavioral factors from the literature in order to predict their performance. Secondly, we conducted three experiments to meet the objectives. In the first experiment, we defined the student performance prediction by using a scientific analysis technique, which is the Fuzzy Delphi Method (FDM) for screening and shortlisting the student performance key factors. In the second experiment, we incorporated all identified risk factors for predicting student academic performance. Finally, in the third experiment, we used the machine learning feature engineering technique, which is the Variance Threshold (VT) for predicting student academic performance. The main objective of this paper is to first find and use the spatial locational factors and semester behavioral factors for predicting student academic performance. Later on, to identify the key factors that have the most impact on student performance. The last objective is to analyze the spatial data in terms of spatial statistical analysis.
2. Related Work
The emergence of Education Data Mining (EDM), the latest discipline which has been used for over a decade in the development, study, and application of computerized methods in pattern detection, has helped exponentially in the analysis of vast educational data [
29] that would otherwise be difficult due to large volume of existing data. The prediction of student results, where the aim is to evaluate the untold gain of function, information, ratings, or grades, is considered one of the experienced and famous applications of educational data mining (EDM) [
30]. One of the historical student data findings, a predictive model for the success of student performance, is a highly recommended technique to investigate the relevant problems of students [
31,
32].
To increase the overall efficiency of a system, data mining (DM) can be applied in different fields. This can be achieved by extracting valuable and specific knowledge previously undiscovered from a stored data set [
33]. In this way, the information learned will help solve several challenges and develop the current structure [
34].
The use of DM in education is of increasing importance. In fact, for college learners, conventional DM techniques can be applied to educational data for the results. EDM is defined as the process [
18] used to transform raw data collected by education systems into useful information that teachers can use to take corrective action and answer research questions. Thus, EDM assists education centers to review and strengthen students’ learning processes. In enhancing an institution’s educational environment, understanding students’ knowledge-based learning should have played a huge part in developing skills. Such awareness results in many advantages, such as optimizing learning outcomes for students and the ability to prepare outcomes for the support of weak students. The number of students dropping out or failing classes would decreased as a consequence [
35].
Estimating student performance is not an easy task and it is also important for both students and teachers to be aware of student performance. Early estimation is helpful for students and teachers. Teachers can play an important role for students and keep them aware of students dropping out of their course or subject in university. Teachers can also help students who need extra support [
35].
Student dropout rates of academic students, which is one of the significant problems in higher education, affects the resources of the university and eventually affects the institutional evaluation process [
27]. It is necessary to propose an evaluation model for the estimation of results for academic students. This will give support to the academic quality process and reduce dropout rates. We should give priority to education and communication in our societies. It is the responsibility of teachers and all education systems and their administrators to develop better outlines of learning and establish systems to expand learning opportunities [
28].
We need to identify weak students among the whole class through their performance predictions using different techniques to provide the proper attention and to prevent them from dropping out of their studies [
36]. Therefore, to support students’ dropout rate, early warning systems need to be made and considered [
37]. Due to incomplete and faulty information systems of educational organizations, student behavioral characteristics are used for student performance predictions [
38].
Locational features also have some impact on students’ performance. In this regard, the geographic location of public schools has been considered [
39], and the study concluded that the geographic location of public schools does affect the performance of academic students. Another study concluded that the rural graphical location of the teaching side was associated with high students satisfaction [
40]. Yet another study on location highlighted their findings by concluding that school resources vary across graphical locations, and communities having small rural areas have the lowest socioeconomic profile [
41], lower student academic performance, shortage of education staff, and industrial material, while schools in the neighborhood of towns have high resources, more availability of teaching staff, and higher students’ satisfaction [
42]. Another study [
43] concluded that students are not performing well in remote areas as compared to their country counterparts. A study [
44] further concluded that students who study in city location schools achieve significantly greater marks as compared to other geographic areas.
Student dropout rates also cause financial loss for both students and their education sectors. It affects graduation rates and lowers employment opportunities in highly qualified positions. If an institution loses a student, it decreases the retention rate of the university. Education for Sustainable Development is an important factor for making societies better, higher education guarantees any society to produce future professionals and leaders [
28]. That is why the anticipation of good performance on the part of students is a significant study area as it can make students aware of their expected results before final exams.
This prediction will be an alert for weaker students that they have to put in extra effort than before, in order to achieve better results than predicted. If we apply this theory based on an institutional perspective, observe how, by performing different prediction techniques, these affected students will be identified, and as a result, their teachers can provide their full attention to ease their studies and keep them safe from dropping out. With the help of these predictions, we can make early warning systems to decrease student dropout rates [
37].
4. Proposed Methodology and Results
The proposed methodology includes three experiments for the prediction of student performance. The first experiment used the Fuzzy Delphi Method output for the prediction of student performance. The second experiment is applied to all the datasets having all academic features, locational features, and behavioral features. The last experiment applies different feature selection techniques to the complete dataset followed by a trial that tried to predict student performance.
4.1. Methodology
In the methodology phase, multiple steps are encountered for data processing. In the data preprocessing phase, we performed different steps, starting with data cleaning. In the data cleaning step, a record was removed because we did not have their semester information. A record contains the students’ data for a semester, and because of the unavailability of some important features, we had to remove that record before using the data in the experiments. Data binning was performed on some columns containing numeric values, such as the feature ‘Past Performance’. There are some categorical columns, having values ‘Yes’ and ‘No’, directly normalized to 1 and 0 in the data normalization. In the last phase of data preprocessing, label encoding is performed on such columns, where a count of unique column value is more than two. Our dataset has a multi-label class (Good-Performer, Avg-Performer, Bad-Performer). To synthesize the dataset, SMOTE (Synthetic Minority Oversampling Technique) is used. Before applying SMOTE to the dataset, the dataset had 799 records and 47 features, and after applying SMOTE, dataset records increased to 1428 with the same 47 features (
Figure 6).
We performed three experiments to predict the academic performance of students (
Figure 7); for all experiments we performed the same preprocessing steps (Data bining, Label encoding, Data normalization, and Data Synthesis using SMOTE). The aim of the first experiment (Exp-1) is to consider only those features that have higher importance according to experts and to obtain their consensus on the sustainability of the presented item in the questionnaire. In this experiment, we have used the Fuzzy Delphi method (FDM) which is a scientific analysis technique to consolidate consensus agreement within the panel of experts. FDM was used to shortlist these 47 features. After taking consensus from 42 experts, we were left with 26 features for experiment one (Exp-1). Later, after performing the preprocessing steps (Data bining, Label encoding, Data normalization, and Data Synthesis using SMOTE), we had two datasets for experiment one (Exp-1). For without-SMOTE, the dataset had 799 records and 66 features, and for with-SMOTE, the dataset had 1428 records and 66 features.
For the second experiment (Exp-2), the aim is to consider all the features that have been called important by the literature. In this research, we collected all 47 important features, and we used all these features in this experiment. After performing different preprocessing, we obtained two datasets for experiment two (Exp-2) as well. For without-SMOTE, the dataset had 799 records and 116 features, and for with-SMOTE, the dataset had 1428 records and 116 features.
Lastly, the third experiment (Exp-3) aims to consider the features that have higher importance according to machine learning features selection techniques. In this experiment, we applied three machine learning feature engineering techniques. Feature engineering techniques Select K-Best, Variance Threshold, and L1 Based techniques are applied in this experiment. After obtaining the results from these three-feature engineering techniques in the pilot experiment, we analyzed that the variance threshold feature engineering techniques produced much better results compared to the other two techniques. The accuracies achieved with variance threshold are much better than the other two techniques, as can be seen from
Figure 8.
As a result, we used the method of Variance threshold. Here again, we have two datasets in the third experiment (Exp-3) after performing different preprocessing steps. For without-SMOTE, the dataset has 799 records and 43 features, and the with-SMOTE dataset has 1428 records and 37 features. Here, features vary in both methods (without-SMOTE and with-SMOTE) because the variance threshold feature selection technique is applied separately to both data sources (without-SMOTE data and with-SMOTE data).
Figure 9 elaborates the dataset properties of all the experiments.
In this research, we used different machine learning and deep learning algorithms for the prediction of academic student performance. These algorithms are Naïve Bayes (NB), Decision Tree (DT), Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Random Forest (RF), and Support Vector Machine (SVM). Hyperparameter tuning in the shape of Grid Search CV is also applied in all these experiments, using both without-SMOTE and with-SMOTE. Different data evaluation techniques have been applied to evaluate the performance of the models. These evaluation techniques are Accuracy, Precision, Recall, F1-Score, and ROC (Receiver Operating Characteristic) curve.
Spatial statistical analysis techniques have also been applied to find the spatial behavior of the dataset. We used two methods: Multivariate Clustering and Average Nearest Neighbor.
4.2. Results
We carried out three experiments to test our methodology and to compare the performance of different proposed experiments with state-of-the-art machine learning and deep learning classification algorithms.
4.2.1. Experiment 1 (Exp-1)
Experiment 1 is performed on Fuzzy Delphi output. The input features that are shortlisted by applying FDM are used for student academic performance prediction. In this experiment, the Support vector machine (SVM) achieved the best accuracy compared to Decision tree (DT), Long short-term memory (LSTM), Multi-layer perceptron (MLP), Naïve Bayes (NB), and Random Forest (RF). This experiment is performed without SMOTE and with SMOTE. SMOTE is used for data balancing. With SMOTE, we can see that SVM obtains a higher accuracy of 89.5 as compared to all other models especially Random Forest and Long short-term memory, as shown in
Figure 10.
Regarding precision, recall, and F1-score, here again, the support vector machine (SVM) performs well with SMOTE; the multi-layer perceptron obtains good results with SMOTE as compared with all models especially Random Forest (RF) and deep learning model long short term memory (LSTM) (
Table 2).
After applying hyperparameter tuning by using grid search CV, we also applied hyperparameter tuning by considering both data sources (without-SMOTE data, and with-SMOTE data). We retrieved the best hyperparameters for each used machine learning algorithm of both data sources (
Table 3).
4.2.2. Experiment 2 (Exp-2)
Experiment 2 is performed on the full dataset, which has 799 records and 47 features. In this experiment, Random Forest (RF) obtained higher accuracy using without-SMOTE as compared to Decision tree (DT), Long short-term memory (LSTM), Multi-layer perceptron (MLP), Naïve Bayes (NB), and Support vector machine (SVM). This experiment is performed without SMOTE and with SMOTE. With SMOTE, we can see that LSTM achieved a higher accuracy of 90.9 as compared to other models, as shown in
Figure 11.
With regard to precision, recall, and f1-score, LSTM performed better as compared to other models in both data sources (without-SMOTE data, with-SMOTE data) (
Table 4).
Detailed hyperparameter tuning of LSTM was performed in this experiment. Firstly, the number of neurons of a single LSTM layer was identified with an epoch value of 70 and batch-size value of 10. Then, the number of LSTM layers was identified by providing the best number of neurons with the same epochs and batch-size values. In the next phase, we added and found the best number of dense layers (fully connected layers) in the LSTM model, with the best-identified LSTM layers and neurons, and with the same epochs and batch-size values. In the fourth phase, the best number of neurons in the dense layer(s) were identified. In the fifth and sixth phases, several epochs and batch sizes were identified by providing them with the best values of LSTM layers, LSTM neurons, Dense layers, and Dense neurons. This experiment was performed with both data sources (without-SMOTE data and with-SMOTE data), but we are visualizing the working of the with-SMOTE data, as it achieved the highest accuracy (
Figure 12).
LSTM loss and accuracy measure were visualized with both the test and train values of the dataset, and ROC curve were plotted to evaluate the LSTM model (
Figure 13). The ROC Curve shows good results with regard to the LSTM model. The hyperparameter of this experiment is mentioned in
Table 5.
4.2.3. Experiment 3 (Exp-3)
Experiment 3 is performed on the full dataset, which has 799 records and 47 features in the beginning. In this experiment, different machine learning feature selection techniques have been applied. Variance Threshold provided the best performance in the pilot experiment with all machine learning models as compared to other feature engineering techniques (Select K Best and L1-based feature engineering). For this, in experiment 3, we have considered this feature selection technique.
In this experiment, Support vector machine (SVM) achieved higher accuracy using without-SMOTE as compared to Decision tree (DT), Long short-term memory (LSTM), Multi-layer perceptron (MLP), Naïve Bayes (NB), and Random Forest (RF). This experiment was performed without SMOTE and with SMOTE. With SMOTE, we can see that RF achieved a higher accuracy of 88.8 as compared with other models (
Figure 14).
With regard to precision, recall, and f1-score, here again, the Random Forest achieved the highest accuracy with SMOTE, as shown in
Table 6.
After applying hyperparameter tuning by using grid search CV, we retrieved the best hyperparameters for each model, as shown in
Table 7. Because we applied hyperparameter tuning on both data sources (without-SMOTE data, and with-SMOTE data), we obtained the best hyperparameters for each model for both data sources (
Table 7).
4.3. Results Comparison
We have performed three different experiments on our research problem and accuracies of the experiments have been compared. Using Without-SMOTE data, we achieved the highest accuracy with experiment 2 by Random Forest (RF); using with-SMOTE data, the highest accuracy was also achieved with experiment 2 by Long Short Term Memory (LSTM), as shown in
Figure 15 and the compared results in
Table 8.
4.4. Significance of Features (P-Value)
In our research context, eleven features have a significant
p-value (value < 0.05). Features like past performance, society status, and semester behavior have a lot of impact on students’ performance (
Figure 16).
5. Conclusions and Future Works
In the current study, our main concern was to predict students’ academic performance at an early stage of the semester so that early predictions will make students aware of their expected results, and early warning systems can be made to support student dropout rates. This study also performed extensive literature and tried to find the importance of key factors that can play important role in the academic student’s performance predictions. Moreover, this study has focused on the locational factors of the students that can play important role in the student’s academic life. By the locational features, we can find out the areas or regions that are lacking or that need an uplift, so that proper educational facilities can be provided to them.
Our study was focused on finding and working on important key features, from which we can predict student performance at early stages. We tried to collect all key factors that had been highlighted by relevant articles. Later, we took consensus from the educational experts to give each factor a score by using a 5-point Likert scale. To obtain the threshold of the scores, we applied Fuzzy Delphi Method to the scores. We used all the factors whose scores exceeded the threshold value. The purpose of this process was to highlight all important factors that have effects on student performance. Hence, future researchers do not need to find the importance of educational factors again. Educational Institutes also work on these factors and predict early student performance so that they too can minimize the dropout rates.
Another main aspect of our study was to consider locational factors, as it was one of our goals to find out the importance of some locational features (e.g., student location (urban, rural), access to school distance, society status, and geo coordinates of student location). We also performed some GIS analytics to discover the areas or regions that are lacking or that need an uplift, so that proper educational facilities can be provided to them.
Our study predicted student academic performance with high accuracy at early stages and highlighted key factors that are affecting student performance. Thus, these features are important not only to predict the academic performance of students but also to decrease dropout rates, increase graduation rates, drop the financial loss of both students and educational sectors, and most importantly provide high employment rates.
In our findings, the deep learning model LSTM achieved the highest accuracy of 90.9 compared with state-of-the-art machine learning algorithms. LSTM performed better when features and dataset records were large in number. Student academic performance prediction is performed with three experiments: (a) using the Fuzzy Delphi Method; (b) using educational key factors; and (c) using Machine Learning feature engineering techniques. The SMOTE data synthesizing tool was considered a significant method to deal with the unbalanced nature of the dataset for student performance prediction in all the experiments. According to this research, we can conclude that all the factors considered in the study have a higher correlation. We will obtain maximum accuracy if we use all factors for the prediction of students’ academic performance. In this specific research context, we can conclude that the scientific analysis technique FDM obtains better accuracy as compared to the machine learning feature engineering technique (Variance K Threshold). Along with past performance and social status, the semester behavior factors have much impact on students’ performance (T-Test). Spatial statistical analysis provided us the spatial information about the results (Performance areas and spatial correlation of features with Average Nearest Neighbor).
Estimation of student performance demands the attention of both students and teachers towards student performance where teachers can play an integral part by helping those students who need extra support and keeping them aware of dropping out of their course or subject. Student dropout rate is a potential problem that causes financial loss to both students and education sectors, affects graduation rates, and lowers the employment opportunities in highly qualified positions. Thus, we proposed a model based on previous records of students that will help estimate the final result of students and help reduce the rate of student dropout.
Substantial work has been performed on the prediction of the performance of academic students using some data mining and machine learning models, where less importance was given to the locational features for prediction. In this research, we have combined both geospatial and machine learning tools for creating a relationship between students’ location factors with their academic performance. The main purpose was to establish a GIS-based system that takes the geographic location of students, evaluates various educational strategies based on machine learning techniques, and generates results using multiple input data, for the prediction of their academic performance.
This study focused on finding the key factors, predicting the performance, and clustering the students in different areas based on their data class. Here, we especially used the geospatial locations of students. This study could not consider geo-socio-demographics features as we did not have the data. In the future, we can extend this work and predict student academic performance by considering their geo-locational attributes. As we have the coordinates data of students in our dataset, we can increase the amount of the dataset, and with the coordinates data of the student, we can extract their geo-locational features or area-specific features (i.e., number of schools, universities, hospitals, etc.) to find the socio-demographics features of any region. This method of working on the geo-locations of the students will also help us to look at student performance prediction with a new and broader perspective.