This section presents experiments conducted on three tasks in the education domain: knowledge mapping [
30], exercise difficulty prediction [
11], and student performance prediction [
31]. The knowledge mapping task is a multi-class classification task, while the exercise difficulty prediction is a regression task, representing distinct tasks within the domain of educational exercise modeling. The knowledge mapping task accurately delineates the knowledge points encompassed by each exercise, allowing for the tracking of a student’s engagement with specific knowledge points and identification of their weak areas. Exercise difficulty prediction, on the other hand, enables a more rational assembly of exercise sets of varying difficulties based on individual student learning profiles, with the goal of enhancing overall learning outcomes. Finally, the student performance prediction task serves as a valuable tool for analyzing the student learning situation, facilitating personalized educational services through tailored exercise sets. These three tasks stand as quintessential and representative challenges in the field of educational exercise modeling, and their effective utilization represents a pivotal step in advancing the field of intelligent education. The exercise characterization modules in these tasks are replaced with the comparison models selected in the experiments. The primary goal is to compare the effectiveness of these models in extracting and fusing heterogeneous features across the three tasks. Additionally, this experiment visualizes the vectors generated by the multimodal characterization model using T-SNE [
32] to analyze their ability to capture the semantic information of the input exercise. This analysis serves to verify the effectiveness and advancement of the proposed MIFM model in the task of multimodal exercise characterization.
4.1. DataSet
The experimental dataset consisted of a high school chemistry exercise obtained from the educational website
https://www.jyeoo.com using crawling techniques. The code of the web crawler process was implemented using various libraries and tools, including Requests, BeautifulSoup, Scrapy, and Selenium, among others. Throughout the data collection process, the web crawler strictly adheres to the website’s crawling protocol, limiting data crawling within the boundaries defined by the website’s terms of use. Additionally, it is important to note that the obtained data are exclusively utilized to investigate the model proposed in this paper, without engaging in any unauthorized or illicit activities. This approach ensures both the integrity of the data collection process and compliance with legal and ethical standards. The experimental dataset mainly includes three types of exercises: multiple choice, judgment, and quiz exercises. The collected information includes exercise text, exercises with diagrams, exercise scores, exercise difficulty, exercise types, knowledge concepts, and other relevant information. Information pertaining to exercise difficulty ranges from 1 to 10, categorizing the difficulty level of each exercise into ten distinct grades. A higher numerical value indicates a greater level of difficulty for the respective exercise. The final format for the difficulty data of the exercises is as follows: (Exercise Number, Exercise Text, Knowledge Concept Number, Exercise Attached Diagram Path, Difficulty Level (1–10)). During the crawling process, the issue of duplicate the data retrieval was encountered. To address this, the experiment utilizes the Glove word-embedding model [
27] mentioned earlier to obtain word embeddings for the crawled exercises. Subsequently, a simple cosine calculation is applied to these word embeddings to determine the similarity between the exercises. Exercises with a similarity score exceeding 0.5 are excluded from the database insertion process. Additionally, for exercises containing formulas, a third-party tool, MML2OMML.xsl, is employed to convert the formulas into MathML format. This conversion facilitates further processing of the formula within the exercises. Ultimately, the dataset of exercises, obtained through a process of duplicate data removal, formula conversion, and punctuation elimination, comprises a total of 186,525 exercises. The dataset comprises two types of exercises: text-only exercises and exercises containing both text and image data. The distribution of exercises with different difficulty levels and exercises containing different knowledge points is shown in
Figure 4.
Figure 5 illustrates the distribution of exercises with varying lengths after processing.
4.2. Experimental Evaluation Metrics
In the experimental process, various evaluation metrics are employed to assess the model performance across different tasks. For the knowledge mapping task, the model performance is evaluated using ACC, Precision, Recall, and F1 values. In the exercise difficulty prediction task, MAE, RMSE, and PCC are utilized to evaluate the model’s predictive accuracy. In the student performance prediction task, MAE, RMSE, ACC, and AUC values are employed to evaluate the model’s performance. Among them, ACC, Precision, Recall and F1 evaluation metrics are relatively common and easy to calculate, so the calculation of other evaluation metrics and related concepts will be introduced.
MAE (Mean Absolute Error) measures the average absolute difference between the true value and the predicted value, and it is calculated using Equation (
10):
where
n is the number of samples,
is the true value, and
is the predicted value.
RMSE (Root Mean Square Error) calculates the square root of the mean of the squared differences between the true value and the predicted value. It is computed using Equation (
11):
where
n is the number of samples,
is the true value, and
is the predicted value.
AUC (Area Under the ROC Curve) measures the performance of a binary classifier, and it is determined by calculating the area under the receiver operating characteristic curve. Equation (
12) demonstrates its calculation:
where
denotes the ordinal number of the ith sample,
M and
N are the number of positive and negative samples, respectively, and
represents the summation of the ordinal numbers of the positive samples.
PCC (Pearson Correlation Coefficient) quantifies the linear relationship between two variables
X and
Y. It is computed using Equation (
13):
where
and
denote the mean and standard deviation of variable
X, respectively,
denotes the sample, and
denotes the prediction label.
4.4. Experiment and Analysis
In order to assess the effectiveness of the model MIFM, comparative experiments were conducted on three educational tasks. The first task, knowledge concept mapping, is a multi-category task where the input data are in the format of (e, kc), where ‘e’ represents the exercise, including its text of the exercise and accompanying diagram, and ‘kc’ denotes the knowledge concept associated with the exercise. In this task, the comparison model replaces the data characterization module of the classification model. The exercise vector is obtained through the exercise characterization model, and then the classification task assigns the exercise to its corresponding knowledge concept. Evaluation metrics such as ACC, Precision, Recall, and F1 are primarily used to assess the performance of the model in the knowledge concept-mapping task.
The second task, exercise difficulty prediction, involves a regression task aimed at estimating the difficulty level of an exercise on a scale of 1 to 10. The input data for this task follows the format (e, diff), where ‘e’ represents the exercise and ‘diff’ represents the exercise difficulty, with . The comparison model replaces the data encoding module in the exercise difficulty prediction model to compare prediction results. In this task, model performance is evaluated using metrics such as MAE, RMSE, and PCC.
The last task is student performance prediction, which aims to predict student performance on each exercise based on student answer records. Similar to the previous tasks, the selected comparison model replaces the exercise characterization module in the task model. Model performance is then evaluated using evaluation metrics such as MAE, RMSE, ACC, and AUC.
In this experiment, the HAN model is utilized for exercise knowledge concept mapping, the TACNN model for exercise difficulty prediction, and the EERNN for student performance prediction. These models serve as the baseline models for each respective task. To evaluate the performance of the exercise characterization models, the exercise characterization module in the aforementioned task models is replaced, and the model’s performance on the three tasks is compared. The selected exercise characterization models include the following:
ELMo: This model is a pre-trained language model based on LSTM for feature extraction, generating dynamic word embeddings [
33]. It employs a Bi-LSTM network to extract features from input text. When the input data include other modalities besides text, only the text data are considered for feature extraction, while other modal data are ignored.
BERT: This model is a text-based pre-trained model [
34], implemented internally using a stack of transformer models. It generates text characterization vectors that capture rich contextual semantic information through self-supervised learning. Similar to ELMo, it accepts only textual data input and disregards data from other modalities.
m-CNN: This model is an enhanced multimodal characterization model based on CNN that can handle heterogeneous data input [
35]. It employs a multimodal convolutional approach to fuse exercise text and accompanying exercise images, generating multimodal characterization vectors for the exercise.
MIFM: The model proposed in this paper is a multimodal information fusion model for exercise characterization. It accepts data input from both text and image modalities, using a dual-stream architecture with different modal encoders. Features are extracted from the heterogeneous input data, followed by feature fusion using a cross-modal attention mechanism. The model outputs a multimodal characterization vector that combines text and image features, resulting in a comprehensive exercise characterization fusing both heterogeneous features.
For the three aforementioned tasks and the selected comparison models, if it includes the same model/network as MIFM proposed in this paper, the relevant parameters of the same modules are tuned to be the same in order to ensure the rigor of the experimental results. When the input exercise comprises only text data, the vectorization process is performed solely on the text data. Conversely, if the exercise contains both text and image data, feature extraction and fusion are carried out to create the exercise’s characterization vector. Some of the comparison models chosen for the experiment only support unimodal data. In such cases, feature extraction is conducted solely on the required input data type. However, if the comparison model is multimodal, data from both modalities are obtained concurrently. Subsequently, experiments are conducted for each educational task to assess the performance of the selected comparison model.
Table 2 presents the experimental results of each model on the knowledge mapping task. Upon examining the performance of each model across different metrics, it is evident that the original model, ELMo, and BERT, being unimodal feature extraction models, can only extract features from the text data input and do not accommodate data from other modalities. Consequently, the exercise characterization vectors generated by these models struggle to capture the complete semantic information of the exercises, resulting in subpar overall performance. In contrast, the m-CNN model possesses multimodal feature extraction and fusion capabilities. It accepts input data from both text and image modalities, extracting text and image features and fusing them using multimodal convolution to create exercise characterization vectors with fused heterogeneous features. However, m-CNN employs a unimodal feature extractor to extract multimodal features, which can lead to parameter confusion and information loss during both feature extraction and fusion processes. As a result, the performance improvement achieved by this model is limited. On the other hand, the MIFM model proposed in this paper adopts a dual-stream architecture and distinct modality-specific feature extractors. It separately extracts features from data of different modalities and employs cross-modal fusion to integrate heterogeneous features. This approach yields significant improvements across all four performance metrics.
Table 3 and
Table 4 display the performance of each model on the exercise difficulty prediction task and the student–answer performance prediction task, respectively. A comparison of the metrics for these two tasks reveals that models with multimodal data extraction and fusion capabilities outperform the unimodal models across various evaluation metrics. The effectiveness of the feature extraction and fusion method directly influences the semantic richness of the exercise characterization vectors generated by the multimodal models, thereby impacting the performance of downstream tasks.
As shown in
Figure 6, the visualization experiment selects 300 exercises encompassing five different knowledge concepts. Through the exercise multimodal characterization model, which fuses heterogeneous exercise data (exercise knowledge concepts, exercise text, and exercise example images), the corresponding characterization vectors are obtained. These vectors are then visualized using T-SNE, a technique that enhances the randomized nearest neighbor embedding algorithm to visualize high-dimensional data. It accomplishes this by converting data similarity into joint probabilities and minimizing the dispersion between joint probabilities of different dimensional embedding data. Consequently, each data point is assigned a position in two- or three-dimensional space. The visualization results reveal that the multimodal characterization vectors of exercises sharing the same knowledge concept are closely grouped together, while exercises with different knowledge concepts are scattered across the visualization graph. Some exercises with similar content intersect on the graph. These visualization outcomes demonstrate that the multimodal characterization model MIFM effectively extracts corresponding features from the input heterogeneous data and produces multimodal exercise characterization vectors that highly preserve the semantic information of the original data.
The performance of the MIFM model and the selected comparison models across three distinct educational tasks, as well as the T-SNE vector visualization results, affirm the effectiveness of the proposed multimodal information fusion-based exercise characterization model, MIFM. The model excels in heterogeneous feature extraction and information fusion operations, highlighting the significance of fusing heterogeneous data (exercise knowledge concepts and exercise accompanying images) for expressive exercise vectors. The experimental findings indicate that image data in multimodal datasets also contains rich semantic information. By extracting image features and fusing them with text feature characterization, a global characterization vector representing multimodal data is obtained. The utilization of multimodal characterization vectors with fused heterogeneous information significantly enhances the performance of downstream tasks.
4.5. Time Analysis
To substantiate the efficiency of the MIFM model proposed in this paper for extracting features from multimodal data, the experiment conducts training under the experimental environment and model parameters described in the “Experimental Environment and Parameters” section. The experiment utilizes a training dataset comprising approximately 180,000 samples and employed a computing system equipped with 32 GB of RAM, a NVIDIA GeForce RTX3060 graphics card, and an i5-12400f CPU, investing approximately 100 h in the training process to refine the MIFM model.
Subsequently, this model is applied to perform ten multimodal vectorization operations on a subset of 100 samples, each containing both text and image data. The average processing time, from input to output, is recorded at 16.7 s, with an average execution time of 0.0167 s per data for a single multimodal vectorization operation. The efficiency of the MIFM model owes much to the incorporation of the transformer architecture, particularly the multi-head attention mechanism, which enables independent weight computations for each position, facilitating the processing of the entire sequence in a single computation. Furthermore, the dual-stream architecture proposed in this paper further amplifies the model’s prowess in parallelly processing multimodal data.