Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study

Huang, Jingxiu; Wu, Xiaomin; Wen, Jing; Huang, Chenhan; Luo, Mingrui; Liu, Lixiang; Zheng, Yunxiang

doi:10.3390/app132312818

Open AccessArticle

Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study

by

Jingxiu Huang

,

Xiaomin Wu

,

Jing Wen

,

Chenhan Huang

,

Mingrui Luo

,

Lixiang Liu

and

Yunxiang Zheng

^*

School of Educational Information Technology, South China Normal University, No. 55 Western Zhongshan Avenue, Guangzhou 510631, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12818; https://doi.org/10.3390/app132312818

Submission received: 7 October 2023 / Revised: 21 November 2023 / Accepted: 27 November 2023 / Published: 29 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Psycholinguistic properties such as concept familiarity and concreteness have been investigated in relation to technological innovations in teaching and learning. Due to ongoing advances in semantic representation and machine learning technologies, the automatic extrapolation of lexical psycholinguistic properties has received increased attention across a number of disciplines in recent years. However, little attention has been paid to the reliable and interpretable assessment of familiarity ratings for domain concepts. To address this gap, we present a regression model grounded in advanced natural language processing and interpretable machine learning techniques that can predict domain concepts’ familiarity ratings based on their lexical features. Each domain concept is represented at both the orthographic–phonological level and semantic level by means of pretrained word embedding models. Then, we compare the performance of six tree-based regression models (adaptive boosting, gradient boosting, extreme gradient boosting, a light gradient boosting machine, categorical boosting, and a random forest) on domain concepts’ familiarity rating prediction. Experimental results show that categorical boosting with the lowest MAPE (0.09) and the highest R² value (0.02) is best suited to predicting domain concepts’ familiarity. Experimental results also revealed the prospect of integrating tree-based regression models and interpretable machine learning techniques to expand psycholinguistic resources. Specifically, findings showed that the semantic information of raw words and parts of speech in domain concepts are reliable indicators when predicting familiarity ratings. Our study underlines the importance of leveraging domain concepts’ familiarity ratings; future research should aim to improve familiarity extrapolation methods. Scholars should also investigate the correlation between students’ engagement in online discussions and their familiarity with domain concepts.

Keywords:

concept familiarity; familiarity ratings; tree-based regression; interpretable machine learning; XAI

1. Introduction

Scholars and practitioners have begun to ponder ways to blend psycholinguistic theories with technological advances in education to enhance teaching and learning [1,2,3]. Linguistic measures represent a core aspect of psycholinguistics and have been deemed useful for assessing teaching and learning outcomes. For example, the psycholinguistic features of words derived from online discussions can help characterize patterns of interaction [4]. These attributes can also convey students’ cognitive states [5] and reflect students’ learning engagement [6]. Psycholinguistic theories hold word familiarity as instrumental to word recognition. Such familiarity is especially important in educational settings, such as language learning and e-learning: it can inform students’ cognitive patterns while reading [7] and acquiring word meanings [8]. However, familiarity is conceptually ambiguous [9]. Academic efforts to use word familiarity to discern students’ challenges with cognitive processing have often relied on predefined databases. Building an appropriately sized word familiarity corpus is time consuming and calls for expertise. This type of database also has yet to consistently reveal students’ cognitive characteristics in language learning. An efficient, reliable means of measuring word familiarity is still needed to gain robust insight into language learners’ cognitive abilities [10].

Considerable research has addressed word familiarity ratings, usually via human empirical judgment and automatic extrapolation. Public databases of word familiarity norms are available in multiple languages [11,12,13]. Words’ familiarity ratings are normally determined in psycholinguistics via expert judgment. For example, Stadthagen-Gonzalez and Davis provided these ratings for 3394 English single words [10]. Juhasz et al. developed a database containing 629 English compound words with such ratings [11]. Similarly, Liu et al. amassed familiarity ratings for 2423 single-character simplified Chinese words [12]; Su et al. described 1300 native Mandarin Chinese speakers’ familiarity norms regarding 20,275 two-character, 1231 three-character, and 2819 four-character simplified Chinese words [9]. However, this compilation task is time intensive and contingent on raters’ experience; these aspects greatly limit its applicability to teaching and learning. Researchers have thus recently sought to extrapolate lexical psycholinguistic properties using advanced word representation and machine learning technologies. The automated extrapolation paradigm takes lexical knowledge or semantic information as word representation features. A mathematical method is next adopted to estimate the associations between these features and words’ psycholinguistic property ratings. Estimation can involve either similarity-based or regression-based models. Lakhzoum et al. reported French semantic similarity norms for 630 word pairs with varying levels of similarity and associated concreteness [13]. When designing a similarity-based model to identify words’ familiarity scores in relation to their nearest neighbors, each word can be projected onto a high-dimensional semantic space based on contextual counts in a given corpus [14,15]. Likewise, when evaluating word representation with lexical and semantic knowledge, regression models can be constructed from part of a psycholinguistic database based on human judgment ratings. The remaining database can then be tested with these models [16,17,18,19]. However, no approaches exist to reliably assess familiarity ratings for domain concepts that involve compound words or phrases and include rich semantic information. Authors also have not fully considered regression models’ interpretability for predicting domain concept familiarity. Advanced machine learning technologies can unearth the details of domain concepts’ familiarity ratings. Associated findings will reinforce the trustworthiness of familiarity rating prediction.

We focused on domain concept familiarity in this study, referring to the understanding of domain-specific (or domain-related) terms. This type of familiarity indicates concepts’ complexity. Students must possess a clear understanding, or even mastery, of domain concepts to use them well while communicating. Students’ familiarity with these concepts partly captures their learning engagement and effectiveness. We employed interpretable machine learning techniques to predict familiarity ratings for domain concepts. The findings offer quantitative insight into domain concept familiarity. In brief, our study was rooted in two aims: (1) to establish a robust regression model that can predict domain concepts’ familiarity scores based on lexical features; and (2) to investigate which lexical factors contribute to familiarity prediction in light of explainable artificial intelligence. Three research questions (RQs) guided this effort. The results offer implications for Chinese lexical processing. Our findings can also aid instructors and designers in improving students’ learning experiences and evaluating academic performance.

RQ1: What are the characteristics of lexical features within domain concepts?
RQ2: Can lexical features derived from domain concepts reliably predict accompanying familiarity ratings?
- RQ2(a): Which tree-based regression model is ideal for predicting familiarity ratings?
- RQ2(b): How accurately does the best regression model predict familiarity ratings?
RQ3: What are the most predictive lexical features from an interpretable machine learning perspective? How do lexical features affect the prediction of domain concepts’ familiarity ratings?

2. Materials and Methods

2.1. Overall Framework

We employed a supervised machine learning to fit a series of tree-based ensemble regression models to a set of domain concepts. This approach enabled familiarity prediction based on lexical features. Compared with other types of regression, tree-based ensemble models allow for more rapid computation and are appropriate when processing tabular data; this method yields results that are easy to explain [20,21,22]. Tree-based ensemble models are also insensitive to missing values, and our out-of-vocabulary problem in a pretrained word embedding model occasionally failed to vectorize a domain concept. Figure 1 depicts the flow of the research process in response to research questions.

Initially, a collection of domain concepts was manually obtained from an online course and assembled into raw datasets. Then, we used advanced natural language processing (NLP) techniques to generate lexical features for the collected domain concepts and employed domain experts to determine these domain concepts’ familiarity ratings. Consequently, an experimental dataset was established for predicting domain concepts’ familiarity. To address RQ1, we examined lexical features’ distribution characteristics and computed the Pearson’s correlation coefficient (

r

) to test features’ independence. With regard to RQ2, we evaluated how well our ensemble regression models predicted familiarity ratings using three steps. First, the experimental dataset was partitioned into a training dataset and a testing dataset in a ratio of 1:2. Second, to identify the best regression model, we tested six ensemble tree-based regression models [23] on the training dataset: adaptive boosting (AdaBoost), gradient boosting (GBoost), extreme gradient boosting (XGBoost), a light gradient boosting machine (LightGBM), categorical boosting (CatBoost), and a random forest (RF). We specifically used ten-fold cross validation with the randomized-search method to assess candidate models and tune their hyperparameters for model fitting and evaluation. We adopted common evaluation metrics [24] to compare these models’ performance. Third, based on hyperparameter settings that led to the highest average validation scores, we retrained the optimal regression model on the entire training dataset. Prediction outcomes were also collected for the test dataset to verify the chosen model’s performance. To improve results’ reliability, we again calculated Pearson’s correlation coefficient (

r

) between domain concepts’ predicted and human-judged familiarity ratings. This correlation coefficient signals the accuracy of predicted familiarity ratings [18]. To answer RQ3, the Shapley additive explanation (SHAP) method [25] was leveraged to illustrate the contributions of lexical features in model predictions and provide interpretability for familiarity prediction of the best regression model.

2.2. Data Collection

Our study concerned a massive open online course (MOOC), “Artificial Intelligence in Education”. This course was hosted on www.icourse163.org (accessed on 7 September 2023), a popular MOOC platform among Chinese universities; the website represents one of the largest online course providers in China. Three domain experts randomly chose 1832 concepts related to artificial intelligence and educational technology. These data were stored in Excel for analysis.

We took domain concepts’ familiarity scores as dependent variables. To develop a corpus, two undergraduates studying educational technology were instructed to indicate how familiar they found each concept. The instructions were adapted from the psycholinguistic norms in simplified Chinese words [9]. As a list of examples presented in Table 1, domain concepts were evaluated on a seven-point Likert scale, with 1 = very unfamiliar and 7 = very familiar. A rating of “very unfamiliar” implies that a concept is either unrecognizable or seldom encountered in daily life; a rating of “very familiar” suggests that a concept is common in everyday life. Prior to asking these two students to assign formal ratings, we provided them with 50 examples to ensure the raters understood how to determine domain concepts’ familiarity. Each participant rated domain concepts’ familiarity independently and without a time limit. Concepts’ scores were finalized by taking the harmonic mean of the students’ ratings.

Table 2 presents descriptive statistics for the full sample.

2.3. Lexical Features

Domain concept representation in memory involves two main types of lexical knowledge: (1) representations of words’ concepts, which are orthographic and phonological when speaking or writing; and (2) concepts’ meanings, which are semantic representations within a given context. The following six types of lexical features representing domain concepts at both the orthographic–phonological level and semantic level were included in our regression analysis.

Concept length (CL): A target concept’s length is identical to its number of characters. Word length is closely correlated with psycholinguistic properties such as word familiarity, age of acquisition, and concreteness [26,27,28].
Stroke number (SN): Strokes are crucial in identifying Chinese characters, as each character comprises specific stroke patterns. Extensive research has explored character recognition by varying Chinese characters’ number of strokes [29,30,31]. We speculated that Chinese words or phrases with a larger number of strokes would be more challenging and time intensive to learn than words with a smaller number of strokes.
Frequency of use (FOU): Frequency is a known proxy for familiarity ratings and can be easily estimated using a corpus [32]. An adequate database increases the correlation between words’ log-frequencies and familiarity [33]. We presumed that domain concepts’ log-frequency and familiarity ratings would be strongly related.
Chinese phonetic alphabet (CPA): The Chinese phonetic alphabet, known as Pinyin, is a system meant to facilitate characters’ pronunciation. Chinese characters’ ideographic functions far exceeded their phonetics initially. As such, a word’s pronunciation is not always evident from its corresponding character pattern. Little empirical evidence has supported a direct correlation between Pinyin and word familiarity [9,34]. Nevertheless, Chinese characters’ phonetic information enhances word recognition [34]. Pinyin helps readers say unfamiliar words correctly as readers extract phonological information for pronunciation [35,36].
Part of speech (POS): A word’s part-of-speech tag signifies the word’s membership in a syntactic category (e.g., noun, verb, adjective). Similar to Pinyin, POS directly influences word learning [37,38]. For example, nouns are generally characterized by their concreteness, imageability, and meaningfulness; these words are often easier to learn than verbs [39].
Semantic information (SI): When using a semantic database such as WordNet or a pretrained word embedding model, semantic information possesses predictive power with respect to familiarity ratings [17,18,40]. We hypothesized that, the closer the semantic distance between two domain concepts, the more likely the concepts were to have similar familiarity ratings.

We developed a feature extraction program using state-of-the-art NLP techniques to derive lexical features from raw domain concepts. The original domain concepts were compound words. We transformed them into a list of Chinese characters and tokenized the concepts into a set of words with part-of-speech tags using the jieba Chinese processing toolkit. Each character’s number of strokes per domain concept was obtained from a stroke dictionary. Each word’s frequency and Pinyin expression within a given domain concept were then successively identified with the jieba toolkit and the xpinyin package. Domain concepts’ SNs and frequency were, respectively, determined by taking the harmonic mean of each character’s SN and associated word frequency.

Every raw word within domain concepts, along with the Pinyin and POS, was converted to a vectorized representation using pretrained embedding models. These models spanned 300-dimension word embeddings, 100-dimension Pinyin embeddings, and 30-dimension part-of-speech embeddings; each was built with skip-gram negative sampling on the Chinese Wikipedia corpus. The vector of a domain concept was computed as the average vector of its words. Word embeddings can be manipulated to represent context-dependent meanings [41,42]. In order to reduce redundant information within the vectorized representation, we conducted principal component analysis to project domain concepts’ word vectors, Pinyin vectors, and part-of-speech vectors onto three-dimensional spaces. This approach retains human judgment about domain concepts’ familiarity ratings, which is in line with the suggestion by Richie et al. [42] that the representational space required for predicting human judgments within a specific domain is much sparser than the space provided by pretrained word embeddings. CPA, POS, and SI were reduced to three principal components, respectively: (CPA_C1, CPA_C2, CPA_C3), (POS_C1, POS_C2, POS_C1), and (SI_C3, SI_C2, SI_C3). Therefore, each domain concept was transformed into a bag of lexical features (i.e., CL, SN, FOU, CPA_C1, CPA_C2, CPA_C3, POS_C1, POS_C2, POS_C1, SI_C3, SI_C2, and SI_C3), and used for familiarity prediction.

2.4. Model Selection

2.4.1. Adaptive Boosting

Freund and Schapire [43] introduced the AdaBoost method, which combines weak learners in a weighted manner to enhance the final prediction’s overall accuracy. Weak learners are trained using weak learning algorithms (e.g., decision trees) that only need to show a slight improvement over random guessing. Different weak learners, trained on distinct data sets, serve as joint decision-making members to reach a final prediction. A key characteristic of AdaBoost is its ability to modify the weights assigned to training samples to train weak learners effectively. Incorrectly predicted samples are given higher weights, resulting in increased attention. These revised samples are subsequently used to train a new weak learner. A group of weak learners is assembled after several iterations. By increasing the weights of weak learners with accurate predictions, all trained weak learners eventually constitute a strong learner that can make highly accurate predictions.

2.4.2. Gradient Boosting

GBoost, in which multiple weak learners are iteratively constructed using the gradient descent technique, seeks to reduce the residual error left by the previous learner [44]. GBoost initializes a base predictive model to fit the target values of training data. In each iteration, a new weak learner is built to capture the prior learner’s prediction errors. Progressive iterations minimize errors and refine model predictions. GBoost can include different base learners, typically decision trees (i.e., GBDTs). The method is particularly feasible for medium-sized data sets or tasks requiring high predictive performance in classification or regression. However, it is prone to overfitting and cannot handle high-dimensional sparse data well. Careful hyperparameter tuning and consideration of computational costs are thus necessary.

2.4.3. Extreme Gradient Boosting

XGBoost builds upon the GBoost algorithm but is tuned for greater prediction accuracy, better generalizability, and a shorter training time [45]. Specifically, XGBoost stretches the loss function into the second derivative. Gradient boosting tree models can then approximate the actual loss more closely, resulting in better predictive accuracy. By adding regularization terms to the loss function, setting shrinkage rates in additive models, and using column subsampling techniques, it prevents model overfitting and improves models’ generalizability. Model training is more efficient thanks to weighted quantile sketch and sparsity-aware algorithms, which also optimize caching and model parallelism. XGBoost has demonstrated impressive prediction performance and can handle parallel processing and missing values well. Its parameter configuration is quite complex and requires intensive computational resources.

2.4.4. Light Gradient Boosting Machine

LightGBM is a highly efficient distributed gradient boosting framework [46]. It was created to address machine learning tasks featuring large-scale datasets and high-dimensional features. LightGBM has shown outstanding performance in numerous machine learning competitions and real-world applications. Compared with GBoost and XGBoost, LightGBM boasts even swifter training speeds and lower memory usage. These benefits are attributable to the histogram algorithm, gradient-based one-side sampling technique, and leaf-wise growth strategy. First, LightGBM uses the histogram algorithm to identify feature gradients and Hessians, which accelerates gradient boosting decision trees’ training. Second, the LightGBM algorithm takes advantage of gradient-based one-side sampling to select data points for each subset. This strategy involves prioritizing data points with higher gradients and rejecting those with lower gradients; it effectively decreases the amount of training data while preserving model accuracy, thus expediting training. Lastly, leaf-wise growth minimizes the loss function and consequently enhances model accuracy: it involves choosing the leaf node with the highest gain from all leaf nodes in each step and then splitting that node until the specified number of leaves is reached.

2.4.5. Categorical Boosting

CatBoost, a modified GBDT algorithm proposed by Dorogush [47], is renowned for its comparatively low parameter count, robust handling of categorical data, and superior predictive accuracy. Compared with GBDT-based algorithms such as XGBoost and LightGBM, CatBoost more efficiently solves statistical learning problems. A major strength of CatBoost lies in its processing of categorical features. XGBoost and LightGBM normally require these features to be one-hot encoded or processed via alternative methods, producing high-dimensional feature representations and high computational costs. CatBoost addresses raw categorical features without encoding. Avoiding the dilemmas associated with traditional methods (e.g., one-hot encoding and feature interactions) conserves memory and computational resources while boosting training speed. CatBoost also uses the symmetric tree growth algorithm to optimize trees’ structure. Conventional GBDT algorithms apply depth-first growth strategies, which cause some branch nodes to have small sample sizes. Model generalization can potentially be influenced as well. To mitigate this issue, the symmetric tree growth strategy regards each node as a binary tree structure. It employs a local greedy algorithm to determine thresholds for node splits. This approach alleviates the problem of small sample sizes and enhances generalizability. In addition, CatBoost assuages concerns about gradient bias and prediction shift. The risk of overfitting therefore diminishes while the model’s accuracy and generalizability improve.

2.4.6. Random Forest

RF is an effective ensemble learning method derived from the bagging approach [48]. The decision tree is the essential structural unit in RF: each tree is built on a random sample of data with replacement. Prediction outcomes are identified by aggregating multiple decision trees’ predictions. Each tree is independently trained with randomly selected data and features. Thus, RF tends to be robust to overfitting and is well-suited to large-scale and high-dimensional data. These inherent advantages have earned RF recognition in diverse fields involving classification, regression, feature selection, anomaly detection, and data compression.

In brief, the above-mentioned tree-based regression models fall into two common types of ensemble strategies: (1) boosting, including AdaBoost, GBoost, XGBoost, LightGBM, and CatBoost; and (2) bagging (e.g., RF). The essential idea behind the boosting method is that it iteratively combines weak learners into a single strong prediction by taking the mistakes of the previous learners when training a new weak learner. Unlike boosting, the bagging methods use the bootstrap sampling technique to train several weak learners separately and aggregate learners’ outputs for final prediction. Because of the difference in ensemble strategies, their final predictions differ accordingly. Therefore, the final prediction results for the boosting methods can be written as Equation (1), where X is the training dataset, M is the number of weak learners,

f_{i}^{B} (X)

, and

α_{i}

is the weight of the i-th weak learner. The bagging methods yield prediction results through Equation (2), where N is the number of weak learners,

f_{j}^{b} (X)

.

\hat{F_{B}} (X) = \sum_{i = 1}^{M} α_{i} f_{i}^{B} (X)

(1)

\hat{F_{b}} (X) = \frac{1}{N} \sum_{j = 1}^{N} f_{j}^{b} (X)

(2)

2.5. Performance Evaluation Metrics

We employed five popular metrics to compare regression models’ effectiveness: the mean absolute percentage error (MAPE), mean squared error (MSE), mean absolute error (MAE), root mean squared error (RMSE), and R-squared (R²). N samples were used to establish a regression model. The actual value of each sample is denoted as

y_{i}

, and its predicted value (per the regression model) is denoted as

ŷ_{i}

. The chosen evaluation metrics can be defined as follows:

MAPE is common in statistics and machine learning when evaluating the performance of a predictive model or estimator. As Equation (3) shows, MAPE quantifies the average size of forecasting errors, representing the magnitude of the error in percentage form. A lower MAPE value indicates relatively smaller relative errors and better model performance.

MAPE = \frac{1}{N} \sum_{i = 1}^{N} | \frac{ŷ_{i} - y_{i}}{y_{i}} | \times 100 %

(3)

MSE (see Equation (4)) reflects the mean of the squared differences between projected and observed values. More weight is given to large errors, resulting in increased sensitivity to outliers.

MSE = \frac{1}{N} \sum^{} {(y_{i} - ŷ_{i})}^{2}

(4)

MAE (see Equation (5)) is used in regression analysis to measure the average magnitude of errors between predicted and actual values. It provides a straightforward method to assess a regression model’s prediction accuracy.

MAE = \frac{1}{N} \sum^{} | ŷ_{i} - y_{i} |

(5)

RMSE (see Equation (6)) refers to disparities between a model’s anticipated and reported values. This metric indicates the square root of the average of squared differences between predicted and actual values.

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - ŷ_{i})}^{2}}

(6)

R² (see Equation (7)) evaluates a regression model’s goodness of fit. This score typically ranges from 0 to 1, with a higher score demonstrating a better fit (i.e., the model explains a larger proportion of the variance in the dependent variable). In Equation (7), $\bar{y}$ is the mean actual value. Importantly, a high R² score does not necessarily indicate that the model fits the data well; such a score may instead suggest overfitting, in that the model captures noise in the data rather than underlying patterns. Therefore, one must consider other factors and use additional evaluation metrics when measuring models’ performance.

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - ŷ_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}

(7)

2.6. SHAP Analysis

We used the SHAP method to assign a value to each lexical feature; doing so illuminated each feature’s importance in the optimal regression model. The SHAP method is based on game theory and a local explanation. Suppose that a domain concept,

C

, is represented as a collection of lexical features,

(x_{1}, x_{2}, x_{3}, \dots, x_{N}),

where

N

is the number of lexical features. The explanation model

G (X^{'})

with simplified features,

X^{'}

=

(x_{1}^{'}, x_{2}^{'}, x_{3}^{'}, \dots, x_{N}^{'}),

for a prediction model,

H (X),

can be written as Equation (8), where

ϕ_{0}

represents a constant value when all lexical features are absent;

ϕ_{i}

is the Shapley value of the lexical feature

x_{i}

; and

x_{i}^{'}

is feature that is simplified by assigning its value to 1 (indicating that the variable

x_{i}

is present) or 0 (indicating that the variable

x_{i}

is absent). The Shapley value,

ϕ_{i}

, can be calculated as Equation (9), where

Z^{'} \subseteq X^{'}

represents all

Z^{'}

vectors where the non-zero entries are a subset of the non-zero entries in

X^{'}

;

| Z^{'} |

is the number of non-zero entries in

Z^{'}

;

f_{x} (Z^{'})

is a prediction model trained with the i-th lexical feature; and

f_{x} (Z^{'} \ i)

is another prediction model trained without it. A solution to Equation (8) proposed by Lundberg and Lee [24] is known as SHAP values, where

f_{x} (Z^{'}) = E [f_{x} (X) | X_{S}]

and

S

represent the set of non-zero indices in

Z^{'}

.

H (X) = G (X^{'}) = ϕ_{0} + \sum_{i = 1}^{N} ϕ_{i} x_{i}^{'}

(8)

ϕ_{i} = \sum_{Z^{'} \subseteq X^{'}} \frac{| Z^{'} |! (N - | Z^{'} | - 1)!}{N!} [f_{x} (Z^{'}) - f_{x} (Z^{'} \ i)]

(9)

Consequently, the SHAP method was devised to enhance model interpretability, which affirms users’ trust in model predictions. SHAP also allows for a firmer understanding of features’ roles in improving model performance. SHAP values signify features’ respective contributions to prediction outcomes. Put simply, a feature’s SHAP value indicates the expected prediction change from the base value. Figure 2 presents an example from our data set. Compared with the base value (5.253), the lexical features in red show the degree to which the model’s performance improved; the lexical features in blue show how much the model’s performance declined. Overall, using this method, features’ importance can be ranked by taking the mean SHAP values of each instance in a data set. For implementation, we built a Python program with the SciPy package for statistical analysis and the PyCaret package for interpretable machine learning.

3. Results

3.1. Characteristics of Lexical Features within Domain Concepts

Table 3 lists descriptive statistics of the lexical features in our data set. Domain concepts’ CL and SN values, respectively, ranged from 1 to 15 and from 1 to 12.48, indicating large differences in their word forms. Domain concepts’ FOU values were transformed into log frequencies ranging from 0 to 4.51. The mean values of the three principal components in SI, POS, and CPA were roughly equal to 0: all these values were normalized to a standard distribution during principal component analysis. The normality test results collected from the Kolmogorov–Smirnov (K–S) test also suggested that the values of lexical features did not conform to a normal distribution (i.e., all p values were less than 0.05).

Feature correlations are a key aspect of data modeling alongside familiarity prediction in our case. Highly correlated features can obscure each feature’s standalone importance. Multicollinearity tests are usually performed for clarity. Figure 3 displays the results of a multicollinearity test on lexical features. These features’ correlations and clustering were determined based on Pearson correlation coefficients and dendrogram clustering. The dendrogram clustering enables visualizing lexical features that are more similar together, revealing patterns of lexical features that might not have been identified by correlation analysis. The adequate number of clusters depends on the dendrogram’s hierarchy. We accordingly grouped the lexical features into three clusters: (POS_C3, CL, SN, SI_C2, POS_C2, and CPA_C2), (FOU, SI_C1, and CPA_C1), and (SI_C3, POS_C1, and CPA_C3). Although these clusters may not definitively distinguish the feature groups, they do offer clues about strong relationships among some features (e.g., the association between CL and SN; the semantic associations among POS, CPA, and SI). We observed the strongest association between SI_C1 and CPA_C1, with a coefficient of 0.726. Multicollinearity did not appear to be an issue in this study.

3.2. Predictive Performance of Regression Models

Table 4 presents the candidate regression models’ familiarity predictions on the training dataset through ten-fold cross validation with 75% training samples and 25% holdout samples. All methods returned promising results as evidenced by MAPE values consistently lower than 0.13. As expected, the CatBoost method showed the best performance, with the lowest average MAPE (0.09) and the highest average R² value (0.02) highlighted in bold in Table 4. It is noteworthy that several approaches, such as AdaBoost and LightGBM, produced negative R² values. Negative R² scores may result from parameter constraints that cause the sum of squared errors to exceed the sum of squared deviations of the dependent variable from its mean [49,50]. To proceed with an in-depth performance analysis of regression models on the training samples, the residual distributions between the predicted and actual values for six tree-based ensemble regression models are illustrated in Figure 4. In the cross-validation procedure, it is observed that the R² values of the six models on the training samples are much higher compared to those on the holdout samples. The Catboost method had an acceptable reliability and robustness, as indicated by its R² values of 0.612 on the training samples and 0.112 on the holdout samples. Its residuals tend to adhere to a normal distribution, shown as the right side of Figure 4a; however, there are a few high residual points. This observation indicated that the Catboost method demonstrated strong predictive performance in estimating familiarity ratings for domain concepts, except for several cases of extremely high familiarity ratings. As for the AdaBoost method, it produced lower R² values on both the training and holdout samples, suggesting a suboptimal fitting performance. Both LightGBM and XGBoost methods had R² values greater than 0.8 on the training samples but less than 0.1 on the holdout samples, indicating overfitting due to excessive learning of noise and details in the training samples and thus resulting in poor generalization performance on the holdout samples. Therefore, the CatBoost method was best suited to predicting domain concepts’ familiarity.

In addition, Table 5 summarizes the CatBoost method’s performance on the testing dataset, reinforcing this model’s robustness. The correlation analysis was significant (

r = 0.36

, p < 0.001) and implied that the CatBoost method can accurately predict domain concepts’ familiarity ratings based on lexical features.

3.3. Feature Importance Based on SHAP Values

We conducted a SHAP analysis for the CatBoost model to enhance its interpretability and ascertain features’ contributions to the prediction process. Figure 5 reveals the range and distribution of lexical features’ impacts on the familiarity rating predictions for domain concepts. Each data point in Figure 5 represents a SHAP value assigned to a concept and its corresponding features within our data set. The vertical axis indicates lexical features’ importance in descending order; the horizontal axis shows features’ degree of impact on familiarity ratings. SI_C1, CL, and SN were the three most meaningful variables for prediction. Regarding SI_C1, the data points in dark blue were primarily in the negative area; that is, a lower value for the first principal component of semantic information in domain concepts corresponded to a lower SHAP value in the prediction model. Then, taking the multiple correlation coefficient (R² = 0.067, F-statistic = 44.68, and p-value < 0.001) between domain concepts’ familiarity and SI (i.e., SI_C1, SI_C2 and SI_C3) into consideration, we can infer that domain concepts with little SI are less familiar when perceived. This outcome aligns with our expectation that more semantic concept-related information (i.e., derived from a vast knowledge base) affords individuals more opportunities to understand its usage and differentiate it from other concepts. In contrast, the yellow data points for CL and SN were distributed in the negative region of SHAP, indicating that domain concepts with more characters and strokes tend to be less familiar to students. This observation substantiates the negative effects of word length and SN on students’ lexical processing and word recognition [31,51]. Familiar words have fewer word-length and stroke-number effects versus unfamiliar words because familiarity is more readily apparent in whole-word processing [52].

To further explore how lexical features and their interactions quantitatively affect the prediction of domain concept familiarity, we performed SHAP dependency analyses on the three most important lexical features (Figure 6). Ignoring the colors in Figure 6, the dependency plots show variations in associated SHAP values for SI_C1, CL, and SN. Figure 6a reveals that the SI_C1 with various values of SI_C2 had a joint positive impact on familiarity prediction when SI_C1 was larger than zero. Interestingly, SI_C2 was complementary to SI_C1 when SI_C1 was lower than zero. According to the point distribution in the CL and SI_C1 dependency plot in Figure 6b, a CL value greater than three may compromise familiarity prediction regardless of the shift in SI_C1. No discernible pattern emerged between CL and SI_C1 in terms of SHAP values. Relatedly, as pictured in Figure 6c, an SN value beyond five may adversely affect familiarity prediction. Rising values of SI_C2 might alleviate this consequence to some degree.

4. Discussion

We set out to establish a robust and interpretable regression model to predict domain concepts’ familiarity ratings based on these concepts’ lexical features. We applied advanced NLP techniques to identify six types of lexical features and compared six tree-based regression methods to predict familiarity ratings. SHAP analysis uncovered lexical features’ contributions to the predicted ratings. To the best of our knowledge, this study is the first to automatically extrapolate psycholinguistic properties of domain concepts with a focus on interpretability and responsibility.

4.1. Domain Concepts’ Lexical Characteristics

The inherent structure of lexical features has an influence on people perceiving domain concepts’ familiarity. Principal component analysis and multicollinearity testing unearthed two intriguing aspects of these concepts’ lexical characteristics. First, POS values did not follow a normal distribution; domain concepts thus seemed to involve certain parts of speech. For instance, in our data set, “differentiating instruction” and “multiple assessment methods” represent adjective–noun and adjective–verb–noun pairs, respectively. Van discovered that grammatical patterns are informed by prescriptive grammatical rules as well as by distinctive language usage and context [53]. This claim supports our finding that understanding language usage preferences for domain concepts enables better cognitive processing of these concepts. Second, multicollinearity testing revealed a few interactions among lexical features; however, no critical multicollinearity problem hindered model performance. The strongest correlation between SI_C1 and CPA_C1 implied an interplay between the semantic information processing of domain concepts and their phonological rules. This outcome mirrors the belief that phonological information from Chinese characters enhances word recognition [34]. Furthermore, the observed association between CL and SN suggests that these features jointly indicate domain concepts’ morphological complexity. The semantic relationships among POS, CPA, and SI open avenues through which researchers can further explore these concepts’ cognitive processing as well.

4.2. Familiarity Rating Prediction of Domain Concepts with Tree-Based Regression Models

Various approaches exist to address familiarity ratings, including expert ratings, corpus-based statistics, and regression-based prediction. However, using expert ratings and constructing a corpus for domain concept familiarity require substantial effort and resources. A way to automatically evaluate domain concept familiarity is still needed. In line with this fact, our findings demonstrate the effectiveness of using domain concepts’ lexical features to predict familiarity ratings. Six tree-based regression models were implemented here to estimate domain concept familiarity. The models’ R² values were consistently low. However, a low R² value does not always imply a poor model fit: these values can be quite low for completely linear and noticeably non-linear models [54]. Studies showing inadequate model fits frequently ascribe these results to insufficient model complexity (i.e., to capture complex relationships within data), limited samples, or noise [55,56]. We identified CatBoost as optimal for predicting domain concept familiarity after comparing the six models’ results: with the lowest MAPE and the highest R² value, CatBoost accurately estimated the target variable—likely due to its symmetric tree growth strategy [47]. The encouraging experimental results strengthen the case for bringing pretrained word embeddings to automatic extrapolation of familiarity ratings, which is conducive to the expansion of psycholinguistic resources. This finding is in line with the findings of a great deal of the previous work in familiarity rating prediction [14,15,16,17,18]. For example, Köper and Walde implemented a supervised learning algorithm with word2vec-based representation to collect 350,000 German lemmatized words [15]; Paetzold and Specia [17] used pretrained word embedding models to predict the familiarity ratings of 85,942 words in the MRC (Machine Readable Dictionary) database. Furthermore, we chose the skip-gram word vectors to capture the semantic information within domain concepts. Our experimental results support the view that the accuracy of extrapolating psycholinguistic ratings using different word vector representations can be informative about the word vector representations themselves [57]. Thus, opportunities exist for fellow scholars to explore the effect of different types of word embedding models on domain concepts’ familiarity ratings. Other research could extend our approach to investigate the individual difference in the perception of domain concepts’ familiarity within educational context, such as online collaborative learning [1] and language learning [3]. In addition, the performance of familiarity rating models can likely improve via several means: gathering additional data, using advanced feature engineering techniques, and applying optimization algorithms to fine-tune model parameters. For example, Yang et al. increased rating-task accuracy by integrating CatBoost and the sparrow search algorithm [58]. Their work offered valuable insight to enhance CatBoost’s performance.

4.3. Impacts of Lexical Features on Familiarity Rating Prediction

Our use of SHAP analysis can paint an informative picture when predicting concept familiarity ratings. More precisely, the scatter point distribution within our SHAP summary plots shed light on our preferred model for familiarity prediction. The SHAP dependence plots also portrayed lexical features’ joint effects. This approach can unveil underlying interactions or patterns (e.g., SN, SI, word forms, and semantic meanings of domain concepts). These findings add to the growing consensus that raw words’ SI and parts of speech in domain concepts are reliable predictors and regression features in pretrained embedding models [17,59]. Studies have demonstrated the pronounced predictive utility of frequency for familiarity extrapolation [33,39]. However, the lexical features in our study did not conform to this stereotype. Rather, our results indicate that it is imperative to emphasize SN and CL when predicting familiarity ratings in the Chinese context instead of solely considering usage frequency. Indeed, recent studies have shown that word length and SN each positively influence students’ reading proficiency [60,61]. Moreover, language instructors [3], learning analysts [4,31], and neuroscientists [40] can be informed through our study about the interpretability of lexical features in the familiarity ratings of concepts. Our results suggest that interpretable machine learning allows for a more detailed picture about the contribution of lexical features concepts’ familiarity ratings. For instance, neuroscientist can use interpretable machine learning to figure out the interactions between linguistic contexts and semantic features in concept understanding to simulate how these two types of meaning representation are orchestrated together in the human brain to support rich conceptual representation [40]. In teaching practice, the extent to which students are sufficiently familiar with domain concepts reflects the engagement and effectiveness of their learning [5,6]. Thus, this finding can guide instructional practitioners in better understanding students’ cognitive states and decision-making for the selection of learning materials to facilitate students’ better understanding of domain concepts.

5. Conclusions and Future Work

To conclude, this study provided meaningful insight into domain concepts’ familiarity ratings. Understanding these ratings is valuable in three regards. First, we have underlined the importance of focusing on domain concepts’ familiarity ratings to pinpoint helpful linguistic predictors for assessing students’ cognitive engagement during language learning or online discussions. For instance, a positive relationship has manifested between concept familiarity and high-quality argumentation, especially in students’ epistemic and social interactions [62]. Second, our work opens doors to applying tree-based regression models to expand psycholinguistic resources. Administering and scoring experimental tasks in linguistics are often prohibitively expensive in cognitive and educational studies unless substantial psycholinguistic resources are available [8,63]. If regression models can reliably approximate human raters, then psycholinguistic properties can be modulated under various educational settings. Word familiarity has been widely acknowledged as a significant predictor of word difficulty in word learning [64]: simpler and more common words promote online discussion and participation [65]. Finally, our SHAP analysis fortified the explanatory power when predicting familiarity with domain concepts. It is essential to identify factors that determine these concepts’ familiarity ratings in order to holistically grasp lexical features’ importance. Such recognition can facilitate assessments of students’ learning engagement in educational settings, such as during language instruction and online discussions.

Two limitations of this research stand to be addressed. First, a pair of undergraduate science majors constructed our data set. Our findings’ generalizability is therefore tempered. It is important to take caution when generalizing the findings of this study to a larger population. Students studying other disciplines, such as art, should be recruited to obtain domain concepts’ familiarity ratings in the future. We posited that optimizing data collection would enhance extrapolation methods’ performance. Second, our study was carried out in a monolingual context. A language-independent approach to estimating domain concepts’ psycholinguistic property values is still important [18]: this method can foster the productive use of psycholinguistics data in multilingual environments [13]. Our subsequent work will seek to further refine familiarity extrapolation methods, namely by expanding our data set with more diverse familiarity ratings for domain concepts. Additional research should also be conducted on the relationship between students’ engagement in online discussions and their familiarity with domain concepts. Finally, scholars should investigate which lexical features of domain concepts shape students’ engagement.

Author Contributions

Conceptualization, J.H. and Y.Z.; Data curation, X.W. and L.L.; Formal analysis, L.L. and Y.Z.; Funding acquisition, J.H.; Investigation, J.W., C.H. and M.L.; Methodology, J.H.; Project administration, Y.Z.; Resources, J.W., C.H. and M.L.; Software, J.H.; Supervision, J.H.; Validation, X.W. and Y.Z.; Visualization, J.H.; Writing—original draft, J.H. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China: 62207015, and the Humanities and Social Sciences Youth Foundation of the Chinese Ministry of Education: 22YJC880021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schallert, D.L.; Sanders, A.J.Z.; Williams, K.M.; Seo, E.; Yu, L.-T.; Vogler, J.S.; Song, K.; Williamson, Z.H.; Knox, M.C. Does It Matter If the Teacher Is There?: A Teacher’s Contribution to Emerging Patterns of Interactions in Online Classroom Discussions. Comput. Educ. 2015, 82, 315–328. [Google Scholar]
Yang, X.; Kuo, L.-J.; Ji, X.; McTigue, E. A Critical Examination of the Relationship among Research, Theory, and Practice: Technology and Reading Instruction. Comput. Educ. 2018, 125, 62–73. [Google Scholar] [CrossRef]
Li, R. Investigating Effects of Computer-Mediated Feedback on L2 Vocabulary Learning. Comput. Educ. 2023, 198, 104763. [Google Scholar] [CrossRef]
Vercellone-Smith, P.; Jablokow, K.; Friedel, C. Characterizing Communication Networks in a Web-Based Classroom: Cognitive Styles and Linguistic Behavior of Self-Organizing Groups in Online Discussions. Comput. Educ. 2012, 59, 222–235. [Google Scholar] [CrossRef]
Almatrafi, O.; Johri, A.; Rangwala, H. Needle in a Haystack: Identifying Learner Posts That Require Urgent Response in MOOC Discussion Forums. Comput. Educ. 2018, 118, 1–9. [Google Scholar] [CrossRef]
Xing, W.; Gao, F. Exploring the Relationship between Online Discourse and Commitment in Twitter Professional Learning Communities. Comput. Educ. 2018, 126, 388–398. [Google Scholar] [CrossRef]
Aghababian, V.; Nazir, T.A. Developing Normal Reading Skills: Aspects of the Visual Processes Underlying Word Recognition. J. Exp. Child Psychol. 2000, 76, 123–150. [Google Scholar] [CrossRef]
Neveu, A.; Kaushanskaya, M. Paired-Associate versus Cross-Situational: How Do Verbal Working Memory and Word Familiarity Affect Word Learning? Mem. Cognit. 2023, 51, 1670–1682. [Google Scholar] [CrossRef]
Su, Y.; Li, Y.; Li, H. Familiarity Ratings for 24,325 Simplified Chinese Words. Behav. Res. Methods 2023, 55, 1496–1509. [Google Scholar] [CrossRef]
Stadthagen-Gonzalez, H.; Davis, C.J. The Bristol Norms for Age of Acquisition, Imageability, and Familiarity. Behav. Res. Methods 2006, 38, 598–605. [Google Scholar] [CrossRef]
Juhasz, B.J.; Lai, Y.-H.; Woodcock, M.L. A Database of 629 English Compound Words: Ratings of Familiarity, Lexeme Meaning Dominance, Semantic Transparency, Age of Acquisition, Imageability, and Sensory Experience. Behav. Res. Methods 2015, 47, 1004–1019. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Shu, H.; Li, P. Word Naming and Psycholinguistic Norms: Chinese. Behav. Res. Methods 2007, 39, 192–198. [Google Scholar] [CrossRef] [PubMed]
Lakhzoum, D.; Izaute, M.; Ferrand, L. Semantic Similarity and Associated Abstractness Norms for 630 French Word Pairs. Behav. Res. Methods 2021, 53, 1166–1178. [Google Scholar] [CrossRef] [PubMed]
Mohler, M.; Tomlinson, M.T.; Bracewell, D.B.; Rink, B. Semi-Supervised Methods for Expanding Psycholinguistics Norms by Integrating Distributional Similarity with the Structure of WordNet. In Proceedings of the 9th Language Resources and Evaluation Conference, Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Paris, France, 2014; pp. 3020–3026. [Google Scholar]
Köper, M.; Im Walde, S.S. Automatically Generated Affective Norms of Abstractness, Arousal, Imageability and Valence for 350,000 German Lemmas. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; European Language Resources Association (ELRA): Paris, France, 2016; pp. 2595–2598. [Google Scholar]
Keselman, A.; Tse, T.; Crowell, J.; Browne, A.; Ngo, L.; Zeng, Q. Assessing Consumer Health Vocabulary Familiarity: An Exploratory Study. J. Med. Internet Res. 2007, 9, e5. [Google Scholar] [CrossRef] [PubMed]
Paetzold, G.; Specia, L. Inferring Psycholinguistic Properties of Words. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 435–440. [Google Scholar]
Ehara, Y. Language-Independent Prediction of Psycholinguistic Properties of Words. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 27 November–1 December 2017; Asian Federation of Natural Language Processing: Taipei, Taiwan, 2017; pp. 330–336. [Google Scholar]
Sun, K.; Lu, X. Assessing Lexical Psychological Properties in Second Language Production: A Dynamic Semantic Similarity Approach. Front. Psychol. 2021, 12, 672243. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Ma, X. Hybrid Decision Tree-Based Machine Learning Models for Short-Term Water Quality Prediction. Chemosphere 2020, 249, 126169. [Google Scholar] [CrossRef] [PubMed]
Shwartz-Ziv, R.; Armon, A. Tabular Data: Deep Learning Is Not All You Need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Iranmanesh, M.; Seyedabrishami, S.; Moridpour, S. Identifying High Crash Risk Segments in Rural Roads Using Ensemble Decision Tree-Based Models. Sci. Rep. 2022, 12, 20024. [Google Scholar] [CrossRef]
Sagi, O.; Rokach, L. Ensemble Learning: A Survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
Naser, M.Z.; Alavi, A.H. Error Metrics and Performance Fitness Indicators for Artificial Intelligence and Machine Learning in Engineering and Sciences. Archit. Struct. Constr. 2021, 3, 499–517. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Juhasz, B.J.; Rayner, K. Investigating the Effects of a Set of Intercorrelated Variables on Eye Fixation Durations in Reading. J. Exp. Psychol. Learn. Mem. Cogn. 2003, 29, 1312–1318. [Google Scholar] [CrossRef] [PubMed]
Juhasz, B.J. The Processing of Compound Words in English: Effects of Word Length on Eye Movements during Reading. Lang. Cogn. Process. 2008, 23, 1057–1088. [Google Scholar] [CrossRef]
Culligan, B. A Comparison of Three Test Formats to Assess Word Difficulty. Lang. Test. 2015, 32, 503–520. [Google Scholar] [CrossRef]
Chen, H.-Y.; Chang, E.C.; Chen, S.H.Y.; Lin, Y.-C.; Wu, D.H. Functional and Anatomical Dissociation between the Orthographic Lexicon and the Orthographic Buffer Revealed in Reading and Writing Chinese Characters by fMRI. Neuroimage 2016, 129, 105–116. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Hou, F.; Jiang, X. Analytic versus Holistic Recognition of Chinese Words among L2 Learners. Mod. Lang. J. 2020, 104, 567–580. [Google Scholar] [CrossRef]
Jiang, N.; Feng, L. Analytic Visual Word Recognition among Chinese L2 Learners. Foreign Lang. Ann. 2022, 55, 540–558. [Google Scholar] [CrossRef]
Juhasz, B.J. Age-of-Acquisition Effects in Word and Picture Identification. Psychol. Bull. 2005, 131, 684–712. [Google Scholar] [CrossRef]
Tanaka-Ishii, K.; Terada, H. Word Familiarity and Frequency. Stud. Linguist. 2011, 65, 96–116. [Google Scholar] [CrossRef]
Liu, X.; Vermeylen, L.; Wisniewski, D.; Brysbaert, M. The Contribution of Phonological Information to Visual Word Recognition: Evidence from Chinese Phonetic Radicals. Cortex 2020, 133, 48–64. [Google Scholar] [CrossRef]
Chen, M.J.; Yuen, J.C.-K. Effects of Pinyin and Script Type on Verbal Processing: Comparisons of China, Taiwan, and Hong Kong Experience. Int. J. Behav. Dev. 1991, 14, 429–448. [Google Scholar] [CrossRef]
Meade, G. The Role of Phonology during Visual Word Learning in Adults: An Integrative Review. Psychon. Bull. Rev. 2020, 27, 15–23. [Google Scholar] [CrossRef] [PubMed]
Melinger, A.; Koenig, J.-P. Part-of-Speech Persistence: The Influence of Part-of-Speech Information on Lexical Processes. J. Mem. Lang. 2007, 56, 472–489. [Google Scholar] [CrossRef]
Bolger, D.J.; Balass, M.; Landen, E.; Perfetti, C.A. Context Variation and Definitions in Learning the Meanings of Words: An Instance-Based Learning Approach. Discourse Process. 2008, 45, 122–159. [Google Scholar] [CrossRef]
Crossley, S.A.; Subtirelu, N.; Salsbury, T. Frequency Effects or Context Effects in Second Language Word Learning: What Predicts Early Lexical Production? Stud. Second Lang. Acquis. 2013, 35, 727–755. [Google Scholar] [CrossRef]
Wang, X.; Wu, W.; Ling, Z.; Xu, Y.; Fang, Y.; Wang, X.; Binder, J.R.; Men, W.; Gao, J.-H.; Bi, Y. Organizational Principles of Abstract Words in the Human Brain. Cereb. Cortex 2018, 28, 4305–4318. [Google Scholar] [CrossRef] [PubMed]
Grand, G.; Blank, I.A.; Pereira, F.; Fedorenko, E. Semantic Projection Recovers Rich Human Knowledge of Multiple Object Features from Word Embeddings. Nat. Hum. Behav. 2022, 6, 975–987. [Google Scholar] [CrossRef] [PubMed]
Richie, R.; Zou, W.; Bhatia, S. Predicting High-Level Human Judgment across Diverse Behavioral Domains. Collabra Psychol. 2019, 5, 50. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. Experiments with a New Boosting Algorithm. In Proceedings of the 13th International Conference on International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1996; pp. 148–156. [Google Scholar]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Becker, W.; Kennedy, P. A Lesson in Least Squares and R Squared. Am. Stat. 1992, 46, 282–283. [Google Scholar] [CrossRef]
Book, S.A.; Young, P.H. The Trouble with R². J. Parametr. 2006, 25, 87–114. [Google Scholar] [CrossRef]
New, B.; Ferrand, L.; Pallier, C.; Brysbaert, M. Reexamining the Word Length Effect in Visual Word Recognition: New Evidence from the English Lexicon Project. Psychon. Bull. Rev. 2006, 13, 45–52. [Google Scholar] [CrossRef] [PubMed]
Barton, J.J.S.; Hanif, H.M.; Eklinder Björnström, L.; Hills, C. The Word-Length Effect in Reading: A Review. Cogn. Neuropsychol. 2014, 31, 378–412. [Google Scholar] [CrossRef] [PubMed]
Van Valin, R.D. Review of Constructions at Work: The Nature of Generalization in Language, by A. E. Goldberg. J. Linguist. 2007, 43, 234–240. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The Coefficient of Determination R-Squared Is More Informative than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Wisz, M.S.; Hijmans, R.J.; Li, J.; Peterson, A.T.; Graham, C.H.; Guisan, A.; NCEAS Predicting Species Distributions Working Group. Effects of Sample Size on the Performance of Species Distribution Models. Divers. Distrib. 2008, 14, 763–773. [Google Scholar] [CrossRef]
Khan, Z.Y.; Niu, Z.; Sandiwarno, S.; Prince, R. Deep Learning Techniques for Rating Prediction: A Survey of the State-of-the-Art. Artif. Intell. Rev. 2021, 54, 95–135. [Google Scholar] [CrossRef]
Mandera, P.; Keuleers, E.; Brysbaert, M. How useful are corpus-based methods for extrapolating psycholinguistic variables? Q. J. Exp. Psychol. 2015, 68, 1623–1642. [Google Scholar] [CrossRef] [PubMed]
Yang, R.; Wang, P.; Qi, J. A Novel SSA-CatBoost Machine Learning Model for Credit Rating. J. Intell. Fuzzy Syst. 2023, 44, 2269–2284. [Google Scholar] [CrossRef]
Crossley, S.; Holmes, L. Assessing Receptive Vocabulary Using State-of-the-art Natural Language Processing Techniques. J. Second Lang. Stud. 2023, 6, 1–28. [Google Scholar] [CrossRef]
Zang, C.; Fu, Y.; Bai, X.; Yan, G.; Liversedge, S.P. Investigating Word Length Effects in Chinese Reading. J. Exp. Psychol. Hum. Percept. Perform. 2018, 44, 1831–1841. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.; Yao, P.; Ma, G.; Wang, J.; Zhou, J.; Huang, L.; Xu, P.; Chen, L.; Chen, S.; Gu, J.; et al. The Database of Eye-Movement Measures on Words in Chinese Reading. Sci. Data 2022, 9, 411. [Google Scholar] [CrossRef] [PubMed]
Grooms, J.; Sampson, V.; Enderle, P. How Concept Familiarity and Experience with Scientific Argumentation Are Related to the Way Groups Participate in an Episode of Argumentation. J. Res. Sci. Teach. 2018, 55, 1264–1286. [Google Scholar] [CrossRef]
Keuleers, E.; Balota, D.A. Megastudies, Crowdsourcing, and Large Datasets in Psycholinguistics: An Overview of Recent Developments. Q. J. Exp. Psychol. 2015, 68, 1457–1468. [Google Scholar] [CrossRef]
Williams, R.; Morris, R. Eye Movements, Word Familiarity, and Vocabulary Acquisition. Eur. J. Cogn. Psychol. 2004, 16, 312–339. [Google Scholar] [CrossRef]
Markowitz, D.M.; Shulman, H.C. The Predictive Utility of Word Familiarity for Online Engagements and Funding. Proc. Natl. Acad. Sci. USA 2021, 118, e2026045118. [Google Scholar] [CrossRef]

Figure 1. Workflow of our methodology for predicting domain concepts’ familiarity.

Figure 2. An example of SHAP values in our dataset.

Figure 3. Analysis of lexical feature correlations and clustering.

Figure 4. Residual plots for six regression models on the training dataset through ten-fold cross validation with 75% training samples and 25% holdout samples.

Figure 5. Summary plots of lexical feature importance in the best prediction model.

Figure 6. SHAP dependency plots for the three most important features from the prediction model. (a): SHAP dependency between SI_C1 and SI_C2; (b) SHAP dependency between CL and SI_C1; (c): SHAP dependency between SN and SI_C2

Table 1. Examples of domain concepts’ familiarity determined by domain experts (Note: ■ selected; ☐ unselected).

Domain Concepts	Very Unfamiliar (1)	Unfamiliar (2)	Somewhat Unfamiliar (3)	Neutral (4)	Somewhat Familiar (5)	Familiar (6)	Very Familiar (7)
Modern Educational Technology	☐	☐	☐	☐	☐	■	☐
Collaborative Learning	☐	☐	■	☐	☐	☐	☐
Information Literacy	☐	■	☐	☐	☐	☐	☐
Learning Platform	☐	☐	☐	☐	☐	☐	■
Learning Behavior	☐	☐	☐	☐	☐	☐	■
Online Learning	☐	☐	☐	■	☐	☐	☐
Scaffolding Instruction	■	☐	☐	☐	☐	☐	☐
Topic Map	☐	☐	■	☐	☐	☐	☐
Resource Sharing	☐	☐	☐	☐	☐	☐	■
Self-Directed Learning	☐	☐	☐	☐	■	☐	☐

Table 2. Summary of domain concepts’ familiarity ratings in the experimental dataset.

Number of Ratings		1832
Mean		5.238
Std		0.633
Min		2.000
Max		6.500
Percentile	25	5.000
	50	5.500
	75	5.500

Table 3. Descriptive statistics and normality test results of lexical features.

Feature	Mean	STD	Min	Max	Percentile			K-S Test
Feature	Mean	STD	Min	Max	25	50	75	K-S Statistic	p Value
CL	5.11	1.71	1.00	15.00	4.00	5.00	6.00	368.52	0.00
SN	7.09	1.74	1.00	12.48	5.98	7.10	8.20	27.78	0.00
FOU	2.51	1.17	0.00	4.51	1.53	2.76	3.50	795.99	0.00
SI_C1	0.00	1.25	−3.03	4.44	−0.94	−0.06	0.85	30.96	0.00
SI_C2	0.00	1.14	−2.62	4.29	−0.92	−0.07	0.72	65.62	0.00
SI_C3	0.00	0.86	−2.99	3.22	−0.63	0.05	0.60	8.34	0.02
POS_C1	0.00	0.30	−0.34	1.00	−0.34	0.00	0.15	101.9	0.00
POS_C2	0.00	0.20	−0.49	0.72	−0.12	−0.04	0.15	19.65	0.00
POS_C3	0.00	0.18	−0.29	1.31	−0.09	−0.01	−0.01	1267.03	0.00
CPA_C1	0.00	0.15	−0.28	0.67	−0.12	−0.03	0.12	87.94	0.00
CPA_C2	0.00	0.11	−0.28	0.46	−0.08	−0.02	0.07	104.18	0.00
CPA_C3	0.00	0.10	−0.35	0.47	−0.06	−0.01	0.05	78.98	0.00

Table 4. Performance comparisons of familiarity prediction methods on the training dataset through ten-fold cross validation.

Regression Model	Metrics
Regression Model	MAPE	MSE	MAE	RMSE	R²
CatBoost	0.09	0.39	0.45	0.62	0.02
RF	0.10	0.39	0.47	0.62	0.01
GBoost	0.10	0.39	0.45	0.62	0.01
AdaBoost	0.10	0.40	0.49	0.63	−0.01
LightGBM	0.10	0.42	0.48	0.65	−0.07
XGBoost	0.10	0.45	0.49	0.67	−0.15

Table 5. Performance of the CatBoost method on the testing dataset.

MAPE	MSE	MAE	RMSE	R²
0.09	0.35	0.44	0.59	0.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Wu, X.; Wen, J.; Huang, C.; Luo, M.; Liu, L.; Zheng, Y. Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study. Appl. Sci. 2023, 13, 12818. https://doi.org/10.3390/app132312818

AMA Style

Huang J, Wu X, Wen J, Huang C, Luo M, Liu L, Zheng Y. Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study. Applied Sciences. 2023; 13(23):12818. https://doi.org/10.3390/app132312818

Chicago/Turabian Style

Huang, Jingxiu, Xiaomin Wu, Jing Wen, Chenhan Huang, Mingrui Luo, Lixiang Liu, and Yunxiang Zheng. 2023. "Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study" Applied Sciences 13, no. 23: 12818. https://doi.org/10.3390/app132312818

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Familiarity Ratings of Domain Concepts with Interpretable Machine Learning: A Comparative Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework

2.2. Data Collection

2.3. Lexical Features

2.4. Model Selection

2.4.1. Adaptive Boosting

2.4.2. Gradient Boosting

2.4.3. Extreme Gradient Boosting

2.4.4. Light Gradient Boosting Machine

2.4.5. Categorical Boosting

2.4.6. Random Forest

2.5. Performance Evaluation Metrics

2.6. SHAP Analysis

3. Results

3.1. Characteristics of Lexical Features within Domain Concepts

3.2. Predictive Performance of Regression Models

3.3. Feature Importance Based on SHAP Values

4. Discussion

4.1. Domain Concepts’ Lexical Characteristics

4.2. Familiarity Rating Prediction of Domain Concepts with Tree-Based Regression Models

4.3. Impacts of Lexical Features on Familiarity Rating Prediction

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI