Topic Editors

Psychometric Methods: Theory and Practice
Topic Information
Dear Colleagues,
Measurement and quantification are ubiquitous in modern society. The historical foundation of psychometrics arose from the need to measure human abilities through suitable tests. This discipline then underwent rapid conceptual growth due to the incorporation of advanced mathematical and statistical methods. Today, psychometrics not only covers virtually all statistical methods but also incorporates advanced techniques from machine learning and data mining that are useful for the behavioral and social sciences, including but not limited to the handling of missing data, the combination of multiple-source information with measured data, measurement obtained from special experiments, visualization of statistical outcomes, measurement that discloses underlying problem-solving strategies, and so on. Psychometric methods now have a wide range of applicability in various disciplines, such as education, psychology, social sciences, behavioral genetics, neuropsychology, clinical psychology, medicine, and even visual arts and music, to name a few.
The dramatic development of psychometric methods and rigorous incorporation of psychometrics, data science, and even artificial intelligence techniques in interdisciplinary fields have aroused significant attention and led to pressing discussions about the future of measurement.
The aim of this Special Topic is to gather studies on the latest development of psychometric methods covering a broad range of methods, from traditional statistical methods to advanced data-driven approaches, and to highlight discussions about different approaches (e.g., theory-driven vs. data-driven) to address challenges in psychometric theory and practice.
This Special Topic consists of two subtopics: (1) theory-driven psychometric methods that exhibit the advancement of psychometric and statistical modeling in measurement to contribute to the development of psychological and hypothetical theories; and (2) data-driven computational methods that leverage new data sources and machine learning/data mining/artificial intelligence techniques to address new psychometric challenges.
In this issue, we seek original empirical or methodological studies, thematic/conceptual review articles, and discussion and comment papers highlighting pressing topics related to psychometrics.
Interested authors should submit a letter of intent including (1) a working title for the manuscript, (2) names, affiliations, and contact information for all authors, and (3) an abstract of no more than 500 words detailing the content of the proposed manuscript to the topic editors.
There is a two-stage submission process. Initially, interested authors are requested to submit only abstracts of their proposed papers. Authors of the selected abstracts will then be invited to submit full papers. Please note that the invitation to submit does not guarantee acceptance/publication in the Special Topic. Invited manuscripts will be subject to the usual review standards of the participating journals, including a rigorous peer review process.
Dr. Qiwei He
Dr. Yunxiao Chen
Prof. Dr. Carolyn Jane Anderson
Topic Editors
Participating Journals
Journal Name | Impact Factor | CiteScore | Launched Year | First Decision (median) | APC | |
---|---|---|---|---|---|---|
![]()
Behavioral Sciences
|
2.6 | 3.0 | 2011 | 21.2 Days | CHF 2200 | Submit |
![]()
Education Sciences
|
3.0 | 4.0 | 2011 | 21.6 Days | CHF 1400 | Submit |
![]()
Journal of Intelligence
|
3.5 | 2.5 | 2013 | 28.1 Days | CHF 2600 | Submit |
![]()
Psych
|
- | - | 2019 | 24.6 Days | CHF 1200 | Submit |
Published Papers (3 papers)
Planned Papers
The below list represents only planned manuscripts. Some of these manuscripts have not been received by the Editorial Office yet. Papers submitted to MDPI journals are subject to peer-review.
Title: Metric Invariance in Exploratory Graph Analysis via Permutation Testing
Authors: Laura Jamison1; Hudson Golino1; Alexander P. Christensen2
Affiliation: 1 University of Virginia; 2 Vanderbilt University
Abstract: Establishing measurement invariance (MI) is vital when using any psychological measurement to ensure applicability and comparability across groups (or time points). If MI is violated, mean differences among groups could be due to the measurement rather than true differences in the latent variable. SEM is one of the most common methods for testing MI, however existing methods are subject to many sources of reduced power due to model misspecification (e.g., noninvariant referent indicators, reliance on data-driven methods). Research has shown that many studies testing MI report inaccurate or inadequately described models, where errors in MI modeling are primarily predicted by software choice. Additionally, unequal group sample sizes may impact goodness of fit measures used in testing MI. In network psychometrics, the available methods for comparing network structures is not conceptually analogous to MI and relies on the testing of partial correlations. We propose a more conceptually consistent method for testing MI within the Exploratory Graph Analysis (EGA) framework using network loadings, which are comparable to factor loadings. We calculate the difference in network loadings between groups and, using permutation testing, compare each original network loading difference to a permutated null distribution to determine significance. We conducted a simulation study following commonly found data structures in psychological research, including unequal group sample sizes, and found that compared to SEM methods for testing partial metric invariance, power is comparable using the proposed method and even improved in conditions such as smaller or unequal sample sizes with lower noninvariance effect size.
Title: Explanatory Cognitive Diagnosis Models Incorporating Item Features
Authors: Manqian Liao; Hong Jiao; Qiwei He
Affiliation: Duolingo; University of Maryland College Park; Educational Testing Service
Abstract: Item quality is crucial to psychometric analyses for cognitive diagnosis. In Cognitive Diagnosis Models (CDMs), item quality is often quantified in terms of unobserved item parameters (e.g., guessing and slipping parameters). Calibrating the item parameters with only item response data, as a common practice, could result in challenges in identifying the cause of low-quality items (e.g., the correct answer is easy to be guessed), devising an effective plan to improve the item quality or even collecting sufficient response data for item calibration. To resolve these challenges, we propose the item explanatory CDMs where the CDM item parameters are explained with item features such that item features can serve as an additional source of information for item parameters. The utility of the proposed models is demonstrated with the Trends in International Mathematics and Science Study (TIMSS) released items and response data: around 20 item linguistic features were extracted from the item stem with the natural language processing techniques, and the item feature engineering process is elaborated in the paper; the proposed models are used to examine the relationships between the guessing/slipping parameters of the higher-order DINA model and eight of the item features. Findings from a follow-up simulation study are also presented, which corroborate the validity of the inferences drawn from the empirical data analysis. Finally, future research directions are discussed.
Title: Psychometric Modeling to Identify Examinee Strategy Differences Over the Course of Testing
Authors: Susan Embretson1; Clifford E. Hauenstein2
Affiliation: 1Georgia Institute of Technology; 2Johns Hopkins University
Abstract: Aptitude test scores are typically interpreted similarly for examinees with the same overall score. However, research has found evidence of strategy differences between examinees, as well as in examinees’ application of appropriate procedures over the course of testing. Research has shown that strategy differences can impact the correlates of test scores. Hence, the relevancy of test interpretations for equivalent scores can be questionable. The purpose of this study is to present several item response theory (IRT) models that are relevant to identifying examinee differences in strategies and understanding of test-taking procedures. First, mixture item response theory models identify latent clusters of examinees with different patterns of item responses. Early mixture IRT models (e.g., Rost & van Davier, 1995; Mislevy & Wilson, 1996) identify latent classes differing in patterns of item difficulty. More recently, item response time, in conjunction with item accuracy, are combined in joint IRT models to identify latent clusters of examinees with response patterns. Although mixture IRT models have long been available, they are not routinely applied. Second, more recent IRT-based models can also identify strategy shifts over the course of testing (e.g., de Boeck & Jeon, 2019; Hauenstein & Embretson, 2022; Molenaar & de Boeck, 2018). That is, within-person differences in item specific strategies are identified. In this study, relevant IRT models will be illustrated on test measuring various aspects of intelligence. Relevant tests to be used include items on non-verbal reasoning, spatial ability and mathematical problem solving.
Title: IRTrees: Selection Validity and Adverse Impact in the Presence of Extreme Response Styles
Authors: Justin L. Kern; Victoria L. Quirk
Affiliation: University of Illinois at Urbana-Champaign
Abstract: The measurement of psychological constructs is frequently based on self-report tests, often with Likert-type items rated from “Strongly Disagree” to “Strongly Agree.” However, previous research has suggested that responses to these types of items are often not solely a function of the content trait of interest, but also of other systematic response tendencies due to item format. These tendencies, called response styles, can introduce noise into the measurement of a content trait. Previous research also demonstrates demographic group differences in response styles, potentially introducing bias into the scoring of tests. In recent years, a family of item response theory (IRT) models called IRTree models have been developed that can allow researchers to parse out content traits (e.g., personality traits) from noise traits (e.g., response styles). IRTree models, in which IRT models are organized into a decision tree structure, allow for unique parameters at each node of a theorized response process. While there has been research in evaluating the efficacy of IRTree models in modeling response processes in comparison to other models, there are few studies regarding the practical implications of decisions made based on this structured model. In this study, we analyze the selection validity and adverse impact of classification decisions made based on an IRTree model that controls for response style tendencies compared to an IRT model that does not control for response style tendencies. Here, we consider situations where respondents may display an extreme response style, or a tendency to systematically prefer extreme response categories (e.g., “Strongly Disagree” or “Strongly Agree”), though the approach could be expanded to use for any theorized response process.
Title: Investigating Pre-knowledge and Speed Effects in an IRTree Modeling Framework
Authors: Justin L. Kern; Hahyeong Kim
Affiliation: University of Illinois at Urbana-Champaign
Abstract: Pre-knowledge in testing refers to the situation in which examinees have gained access to exam questions or answers prior to taking an exam. The items the examinees have been exposed to in this way are called compromised items. The exposure of examinees to compromised items can result in an artificial boost in exam scores, jeopardizing test validity and reliability, test security, and test fairness. Furthermore, it has been argued that pre-knowledge may result in quicker responses. A better understanding of the effects of pre-knowledge can help test-creators and psychometricians overcome the problems pre-knowledge can cause. There has been a growing literature in psychometrics focusing on pre-knowledge. This literature has primarily been focused on the detection of person pre-knowledge. However, the majority of this work has used data where it is unknown whether a person has had prior exposure to items. This research aims to explore the effects of pre-knowledge with experimentally obtained data using the Revised Purdue Spatial Visualization Test (PSVT:R). To collect these data, we carried out an online experiment manipulating pre-knowledge levels amongst groups of participants. This was done by exposing a varying number of compromised items to participants in a practice session prior to test administration. Recently, there has also been a growing modeling paradigm using tree-based item response theory models, called IRTree models, to embed the cognitive theories into a model for responding to items on tests. One such form examined the role of speed on intelligence tests, positing differentiated fast and slow test-taking processes (DiTrapani et al., 2016). To investigate this, they proposed using a two-level IRTree model with the first level controlled by speed (i.e., is the item answered quickly or slowly?) and the second level controlled by an intelligence trait. This approach allows for separate parameters at the second level depending upon whether the responses were fast or slow; these can be separate item parameters, person parameters, or both. Building on this literature, we are interested in determining whether and how item pre-knowledge impacts item properties. In this approach, the effects to be studied include 1) whether pre-knowledge impacts the first-level IRTree parameters, affecting response time; 2) whether pre-knowledge impacts the second-level IRTree parameters, affecting response accuracy; and 3) whether the first-level response (i.e., fast or slow) impacts the second-level IRTree parameters. In all cases, an interesting sub-question to be asked is whether any of these effects are constant across items. Estimation of the models will be done using the mirt package in R. To determine efficacy of the IRTree modeling approach to answering these questions, a simulation study will be run under various conditions. Factors to be included are sample size, effect size, and model. The outcomes will include empirical Type I error and power rates. The approach will then be applied to the collected pre-knowledge data.
Title: Bayesian Monte Carlo Simulation Studies in Psychometrics: Practice and Implications
Authors: Allison J. Ames; Brian C. Leventhal; Nnamdi C. Ezike; Kathryn S. Thompson
Affiliation: Amazon
Abstract: Data simulation and Monte Carlo simulation studies (MCSS) are important skills for researchers and practitioners of educational and psychological measurement. Harwell et al. (1996) and Feinberg and Rubright (2016) outline an eight-step process for MCSS:
1. Specifying the research question(s),
2. Defining and justifying conditions,
3. Specifying the experimental design and outcome(s) of interest,
4. Simulating data under the specified conditions,
5. Estimating parameters,
6. Comparing true and estimated parameters,
7. Replicating the procedure a specified number of times, and
8. Analyzing results based on the design and research questions
There are a few didactic resources for psychometric MCSS (e.g., Leventhal & Ames, 2020) and software demonstrations. For example, Ames et al. (2020) demonstrate how to operationalize the eight steps for IRT using SAS software and Feinberg and Rubright (2016) demonstrate similar concepts in R. Despite these resources, there is not a current accounting of MCSS practice for psychometrics. For example, there are no resources that describe the typical number of replications for MCSS (step 7), and whether this varies by outcome of interest (step 3) or number of conditions (step 2). Further, there are no resources for describing how Bayesian MCSS differ from frequentist MCSS. To understand the current practice of MCSS and provide a resource for researchers using MCSS, we reviewed six journals focusing on educational and psychological measurement from 2015-2019. This review examined a total of 1004 journal articles. Across all published manuscripts in those six journals, 55.8% contained a MCSS (n=560), of which 18.8% contained Bayesian simulations (n=105). Full results of the review will be presented in the manuscript. Because there is little guidance for Bayesian MCSS, the practice of Bayesian MCSS often utilizes frequentist techniques. This fails, in our opinion, to leverage the benefits of Bayesian methodology. We examined the outcomes of interest in frequentist and Bayesian MCSS. One trend that emerged from our review is the use of Bayesian posterior point estimates alone, disregarding other aspects of the posterior distribution. Specifically, while 58.72% examined some form of bias (e.g., absolute, relative), relying upon a posterior point estimate, only 10.09% examined coverage rates, defined as the proportion of times the true (generating) value was covered by a specified posterior interval. To address the gap in information specific to Bayesian MCSS, this study focuses on current practice and Bayesian-specific decisions within the MCSS steps. Related to current practice, we ask the following: 1) What are the current practices in psychometric Bayesian MCSS across seven journals from during a five-year period? 2) How are the philosophical differences between the practice of frequentist and Bayesian operationalized in MCSS? 3) What overlap exists between the practice of MCSS in the Bayesian and frequentist frameworks? Regarding Bayesian decisions in MCSS, we ask: 4) What are the implications of differing decisions across the eight steps on common MCSS types (e.g., parameter recovery)?
Title: New insights on internal response processes: A novel application of latent space item response modeling approach to item response tree data
Authors: Nana Kim; Minjeong Jeon
Affiliation: University of Minnesota, Twin-Cities, USA;
University of California, Los Angeles, USA
Abstract: Item response tree (IRTree) models, a recent extension of item response theory (IRT) models, are designed to model internal response processes with a tree structure. The main advantage of an IRTree model is that it helps to statistically separate the latent traits that may be involved in different response subprocesses. For example, when applied to agree-disagree based Likert-type scale data IRTree models can differentiate the process of choosing extreme versus non-extreme response options (e.g., ‘strongly disagree’ versus ‘disagree’) from the process of choosing whether to agree or disagree with the item, thereby allowing us to differentiate response style latent traits. Other applications of IRTree models include modeling omitted response processes and differentiating slow versus fast intelligence.
IRTree approaches expand original item response data based on a tree structure consisting of multiple sub-trees that correspond to postulated subprocesses. Such tree-based expanded data, referred to as IRTree data, are typically modeled by assuming separate, possibly correlated IRT models to different sub-trees. In this study, we propose an alternative approach to gain new insights into IRTree data: the item response latent space model (LSIRT). LSIRT visualizes conditional dependence or interactions between respondents and test items in a low-dimensional Euclidean space, called an interaction map. By fitting the LSIRT to IRTree data, the interaction map can reveal unknown dependence within and across sub-trees at the item and respondent levels, possibly elucidating the nature and relationships of internal response processes assumed in the tree structure. The geometrical representation of person- item node interaction can potentially facilitate the understanding of heterogeneity in response processes across and within respondents. Both empirical and simulated examples will be provided to demonstrate the proposed application of LSIRM for IRTree data.
Title: A Comparative Study of IRT Models for Mixed Discrete-Continuous Responses
Authors: Cengiz Zopluoglu; J.R. Lockwood
Affiliation: University of Oregon
Duolingo
Abstract: The rapid evolution of machine learning and AI-driven technologies is increasing the complexity of
features that can be extracted from constructed responses to tasks in both teaching and assessment
contexts. A critical step in using these features is understanding how person-level and task-level traits
interact to determine the responses and, consequently, the extracted features while making inferences
about a person's proficiency. Item response theory (IRT) models facilitate the separation of person and
task traits, and novel IRT models designed to achieve this separation with increasingly sophisticated
features continue to be developed. For example, Molenaar et al. (2022) recently proposed several
families of IRT models suitable for features with distributions having a mixture of discrete point masses
and continuous values on a bounded interval. Such distributions may arise in contexts where a
constructed response is compared to some target response using features based on a distance metric,
and a nontrivial fraction of responses have a distance of zero because they perfectly match the target. As these and other novel IRT models emerge, empirical evaluations of their performance with real data from diverse contexts are important to both the research and practitioner communities to improve the science of measurement.
To this end, we will present a case study of evaluating novel IRT models using constructed responses
from a high-stakes assessment of English language proficiency. The response data arise from a dictation task, in which a test taker is asked to listen to a sentence spoken in English, and then to type (in English) the sentence that they heard. Prompts are designed to span a broad range of grammatical and linguistic complexity, and are assigned adaptively as part of a computerized adaptive testing (CAT) system. This results in a dataset with a sparse structure, consisting of approximately 275,000 test takers, 2,700 prompts, and about 6 prompts per test taker. This provides a challenging context for item calibration via IRT models.
One of the features extracted from the typed response is an edit distance from the target sentence,
transformed to the unit interval so that a value of 1 indicates exact agreement between the typed
sentence and the target, while smaller values indicate larger discrepancies. The responses thus exhibit a mixture distribution, with a point mass at 1, a smaller point mass at 0, and a continuous distribution on (0,1), making the aforementioned models for zero-and-one-inflated continuous responses recently
developed by Molenaar et al. (2022) potentially useful. Our study will evaluate the performance of three of these models (zero-and-one-inflated extensions of the Beta IRT, Simplex IRT, and Samejima's
Continuous IRT models) when applied to the sparsely structured response data. We will examine the
heterogeneity of item parameters across prompts (target sentences), and the reliability of the item
parameters estimated with and without using collateral information about person proficiency via latent
regression, for each of the candidate models. We will also compare the models using various
cross-validation criteria, and explore relationships between estimated item parameters and NLP-based
features extracted from the prompts.
Title: Metric Invariance in Exploratory Graph Analysis via Permutation Testing
Authors: Laura Jamison; Hudson F Golino; Alex P Christensen
Affiliation: University of Virginia
Abstract: Establishing measurement invariance (MI) is vital when using any psychological measurement to ensure applicability and comparability across groups (or time points). If MI is violated, mean differences among groups could be due to the measurement rather than true differences in the latent variable. SEM is one of the most common methods for testing MI, however existing methods are subject to many sources of reduced power due to model misspecification (e.g., noninvariant referent indicators, reliance on data-driven methods). Research has shown that many studies testing MI report inaccurate or inadequately described models, where errors in MI modeling are primarily predicted by software choice. Additionally, unequal group sample sizes may impact goodness of fit measures used in testing MI. In network psychometrics, the available methods for comparing network structures is not conceptually analogous to MI and relies on the testing of partial correlations. We propose a more conceptually consistent method for testing MI within the Exploratory Graph Analysis (EGA) framework using network loadings, which are comparable to factor loadings. We calculate the difference in network loadings between groups and, using permutation testing, compare each original network loading difference to a permutated null distribution to determine significance. We conducted a simulation study following commonly found data structures in psychological research, including unequal group sample sizes, and found that compared to SEM methods for testing partial metric invariance, power is comparable using the proposed method and even improved in conditions such as smaller or unequal sample sizes with lower noninvariance effect size.
Title: Using keystroke log data to detect non-genuine behaviors in writing assessment: A subgroup analysis
Authors: Yang Jiang; Mo Zhang; Jiangang Hao; Paul Deane
Affiliation: Educational Testing Service
Abstract: In this paper, we will explore the use of keystroke logs – recording of every keypress – in detecting non-genuine writing behaviors in writing assessment, with a particular focus on fairness issues across different demographic subgroups. When writing assessment are delivered online and remotely, meaning the tests can be taken anywhere outside of a well-proctored and monitored testing center, test security related threats arise accordingly. While writing assessments usually require candidates to produce original text in response to a prompt, there are many possible ways to cheat especially in at-home testing. For example, the candidates may hire an imposter to write responses for them; the candidates may memorize some concealed script or general shell-text and simply apply them in whatever prompt they receive; the candidates may have copied text directly from other sources either entirely or partially; etc. Therefore, predicting non-genuine writing behaviors/texts is of great interest to test developers and administrators. Deane et al. (2022) study reported that, by using keystroke log patterns, various machine learning prediction models produced an overall prediction accuracy between .85 and .90 and ROC curve indicated around 80% of true positive and roughly 10% false negative rates. In the paper, we plan to apply similar machine learning methods in predicting non-genuine writing but, in addition to prediction accuracy, we will focus more on the subgroup invariance. It is of important validity concern that non-genuine writing can be predicted equally well across different demographic groups (e.g., race, gender, country, etc.). We will use a large-scale operational data set for exploration.