Next Article in Journal
Home First: Stability and Opportunity in Out-of-Home Care
Next Article in Special Issue
Using Structural Equation Modeling to Reproduce and Extend ANOVA-Based Generalizability Theory Analyses for Psychological Assessments
Previous Article in Journal
Nonverbal Intelligence Does Matter for the Perception of Second Language Sounds
Previous Article in Special Issue
A Cautionary Note Regarding Multilevel Factor Score Estimates from Lavaan
 
 
Please note that, as of 22 March 2024, Psych has been renamed to Psychology International and is now published here.
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effect Sizes for Estimating Differential Item Functioning Influence at the Test Level

1
Department of Educational Psychology, Ball State University, Muncie, IN 47304, USA
2
Learning and Performance Research Centre, Psychometric Laboratory, Washington State University, Pullman, WA 99164, USA
*
Author to whom correspondence should be addressed.
Psych 2023, 5(1), 133-147; https://doi.org/10.3390/psych5010013
Submission received: 2 November 2022 / Revised: 9 February 2023 / Accepted: 13 February 2023 / Published: 15 February 2023
(This article belongs to the Special Issue Computational Aspects and Software in Psychometrics II)

Abstract

:
Differential item functioning (DIF) is a critical step in providing evidence to support a scoring inference in building a validity argument for a psychological or educational assessment. Effect sizes can assist in understanding the accumulation of DIF at the test score level. The current simulation study investigated the performance of several proposed effect size measures under a variety of conditions. Conditions under study included varied sample sizes, DIF effect sizes, the proportion of items with DIF, and the type of DIF (additive vs. non-additive). DIF effect sizes under study included sDTF%, uDTF%, τ ^ w 2 , d , R ¯ Δ 2 , I D I F 2 * , and S D I F V . The results of this study suggest that across study conditions, τ ^ w 2 , I D I F 2 * , and d were consistently the most accurate measures of the DIF effects. The effect sizes were also estimated in an empirical example. Recommendations and implications for practice are discussed.

1. Introduction

A key requirement of educational measurement is that assessments be fair for individuals in different groups within the broader population [1]. Establishing test fairness involves the investigation of differential item/bundle functioning (DIF/DBF) present in the assessment. DIF refers to differential performance on assessment items for individuals in different groups who have the same level of the latent trait being measured, whereas DBF refers to differential group performance on sets, or bundles of items. When differential functioning is considered for the entire set of test items, we refer to differential test functioning (DTF). For the remainder of this paper, we will use DTF to refer to the presence of DIF in an entire assessment. DTF has been investigated in different contexts, from personality selection [2] to health care [3], including comparisons of effect sizes [2]. Capturing a lack of item invariance at the test level can assist in assessing if such issues influence score-level decisions and assisting with understanding the practical importance of a lack of invariance [4,5]. Given the centrality of assessment in educational systems, paramount importance must be given to ensuring fair and equal measurement of abilities for all individuals. DIF/DTF does not guarantee fairness or equal measurement or use of scores, but it does assist in providing a piece of evidence to support a path in that direction.
Test fairness and DIF has focused on item-level differences. However, given decisions about individuals are based on test scores, understanding the influence of DIF on overall test scores is critical [6]. Effect sizes at the score level can aid this process. A number of such effect sizes have been suggested, including the DIF variance estimate from a random effects model, both unweighted ( τ ^ 2 ) and weighted ( τ ^ w 2 ) [7], Cohen’s d for between group average DIF [8,9], the average variance associated with DIF from logistic regression R ¯ Δ 2 ; [8,10], the signed (sDTF) and unsigned (uDTF) DTF, and the percent scoring difference of the test (sDTF%/uDTF%). Doebler [11] described a new DIF assessment framework designed to obviate the need to account for differences in group means on the measured latent trait (impact) or item purification. The statistics associated with this approach provide information about DTF, including a measure of the variance in differences between group difficulty parameters ( I D I F 2 * ), and a standardized DIF variance measure ( S D I F V ) for the difference between group difficulty parameters. These have not been studied for their effectiveness.
The remainder of this manuscript is organized as follows. First, a review of various methods for test-level DIF effect sizes is described, after which the goals of this study are presented. The simulation study used to assess the performance of these effect sizes is then described, followed by a presentation of the results for both a simulation and an applied data example. Finally, these results are discussed, and their implications for practice are considered. The simulation study was motivated by the ability to compare all effect sizes under known and controlled conditions, as has been common in other DTF effect size comparisons [2,12], and moves beyond comparisons that have only been on applied datasets [4].

1.1. Effect Size Methods for Quantifying the Magnitude of DIF in an Assessment

Effect size use in DIF detection has been emphasized for DIF identification [6] and for general measurement invariance work [13]. A number of effect size measures for quantifying DIF for a single assessment have been proposed. One such approach for quantifying the total amount of DIF in a set of items (e.g., subtests) is based on the concept of the random effect item response theory (IRT) model [7]. Consider the 2-parameter logistic model for DIF (2PL-DIF):
P θ = e x p 1.7 a i θ j b i + G ξ i 1 + e x p 1.7 a i θ j b i + G ξ i
In Equation (1), a i is the discrimination parameter for item I, b i is the difficulty parameter for item i, θ j is the latent trait value for examinee j, G is group membership (focal group = 0 and reference group = 1), and ξ i = DIF effect for item i. In this context, ξ i = 0 signals no DIF, ξ i > 0 signals DIF against the focal group, and ξ i < 0 signals DIF against the reference group. The log odds ratio from the Mantel–Haenszel test is an effective and appropriate estimate of ξ i [7] and can be modeled in the random effects context as follows:
l o g α ^ M H i = μ + ξ i + ε i
where μ is the mean DIF across all items and ε i is the estimation error.
When no DIF is present, the variance of ξ , τ 2 , will be 0. In short, τ 2 provides an estimate of the amount of DIF in a set of items, with larger values indicating a greater amount of cumulative DIF. Camilli and Penfield proposed two approaches to estimating τ 2 . The unweighted estimator takes the form:
τ ^ 2 = i = 1 I l o g α ^ M H i μ ^ 2 i = 1 I S i 2 I
where μ ^ is the mean of l o g α ^ M H i across the I items of the test, and S i 2 is the variance in l o g α ^ M H i for each of the i items of the test. The weighted estimator is as follows:
τ ^ w 2 = i = 1 I w i 2 l o g α ^ M H i μ ^ 2 i = 1 I w i 2 i = 1 I w i 2
where w i 2 = S i 2 . The index τ ^ w 2 can provide a more accurate estimation of the true amount of cumulative DIF (DTF) present in a set of items [7].
A second proposed method for quantifying the amount of DIF in an assessment involves a conversion of l o g α ^ M H i   for item i to a Cohen’s d effect size using a transformation first proposed by Hasselblad and Hedges [9]. Specifically, assuming a logistic distribution and homogeneous variances, a log odds ratio, such as l o g α ^ M H i   can be converted to d by the following:
d = l o g α ^ M H i 3 π
In the context of comparing the amount of DTF present in two or more scales, we propose to calculate l o g α ^ M H i for each item in the test or item bundle, convert these to d using Equation (5), and then average the ds across the items, creating d ¯ . The potential advantage is that d is on a familiar scale to researchers and has interpretive values for application. Two values of d ¯ will be calculated in this study, d ¯ s in which the signs of the d values are taken into consideration, and d ¯ u , where the absolute value of the ds is used.
A third approach is based on logistic regression (LR), a popular method for DIF detection [14]. When LR is used for DIF assessment, R Δ 2 , the effect size, is reported with the statistical test for each item. R Δ 2 reflects the amount of DIF present for the item. To characterize the magnitude of uniform DIF for an item, this statistic is calculated as the difference between the R 2 associated with the model including only the latent trait as the predictor for the item response, and the R 2 associated with the model including the latent trait and an indicator for the grouping variable. In order to estimate the amount of DIF present for an entire scale, we propose the use of the mean of R Δ 2 , R ¯ Δ 2 , taken across items in an item bundle or an entire assessment. Thus, for the set of items of interest, R Δ 2 will be calculated for each item and R ¯ Δ 2 will be taken across the set of items, with larger values reflecting a greater amount of DIF present in the set of items. Note that this estimate of DTF is unsigned and does not reflect cancellation DIF.
Doebler [11] described DIF effect sizes that have shown promise for assessing the magnitude of DIF at the item level, and that may be useful for describing the impact of DIF at the test level. One of these statistics is based upon the variance in the difference of item level parameter estimates (e.g., item difficulty) across groups. For the two groups’ case, this difference is estimated as follows:
Δ ^ β i = β ^ F i β ^ R i
where
β ^ F i = Focal group difficulty estimate for item i
β ^ R i = Reference group difficulty estimate for item i
When uniform DIF is not present for an item, then Δ ^ β i = 0 . The variance associated with this statistic can be used to estimate the total amount of DIF present in a test. One such example is the I D I F 2 statistic, which is based upon a statistic used in meta-analysis and which can be calculated as follows:
I D I F 2 = v ^ 2 v ^ 2 + s Δ β 2
where
v ^ 2 = Variance in the Δ ^ β i
s Δ β 2 = Mean squared error of Δ ^ β i
The s Δ β 2 value is estimated as
s Δ β 2 = I 1 i = 1 I w i i = 1 I w i 2 i = 1 I w i 2
where
I = Number of items
w i = 1 s Δ ^ β i 2
Larger values of I D I F 2 indicate greater inconsistency in the DIF structure across items.
In practice, the REML estimator of v ^ 2 using maximum likelihood cannot be obtained in closed form. An alternative estimator of I D I F 2 can be calculated using an estimate of v ^ 2 proposed by DerSimonian and Laird (1986). This estimator takes the form:
I D I F 2 * = m a x 0 , Q I 1 Q
where
Q = i = 1 I Δ ^ β i Δ 2 s Δ ^ β i 2
Δ = i = 1 I w i Δ ^ β i i = 1 I w i
I D I F 2 * can be interpreted as the ratio of excess homogeneity to the total heterogeneity in DIF across the items, and serves as an estimate of I D I F 2 such that larger values indicate a higher level of DIF in the assessment being studied.
Chalmers et al. [15] described two approaches for characterizing the amount of DIF in an assessment, both based on the group-specific expected test score function in Equation (10).
T θ , ψ G = j = 1 n S j θ , ψ G
where
ψ G = Item parameters for group G
S i = K = 0 K 1 k · P y = k | θ , ψ j ; score function for item i
y = Item response
k = Item category
θ = Latent trait value
In turn, T θ , ψ G can be used to ascertain whether there is differential test functioning (DTF) across multiple groups. One such statistic is an index of signed DTF.
s D T F = T θ , ψ R ) T θ , ψ F g θ d θ
where
g θ = Weight function such that g θ d θ = 1
ψ R = Reference group item parameters
ψ F = Focal group item parameters
The sample estimate of s D T F is obtained by evaluating the function in Equation (11) at a set of Q quadrature points, as in Equation (12).
s D T F = q = 1 Q T X q , ψ R ) T X q , ψ F g X q
where
X q = Quadrature value
g X q = Weight associated with the quadrature node
Researchers can interpret s D T F as the mean difference between the test response functions for the two groups, with negative values indicating that the reference group had lower mean scores on the assessment and positive values indicating that the focal group had lower mean scores, after accounting for item parameter values.
An alternative to the s D T F is an unsigned measure of DTF, which focuses on group differences in the test function, without respect to which group is favored. This unsigned DTF measure is expressed as follows:
u D T F = T θ , ψ R T θ , ψ F g θ d θ
Note that the u D T F is identical to the s D T F with the exception that the absolute value of test function differences is applied in Equation (13). As with the s D T F , quadrature is used to obtain estimates for specific samples, as in Equation (14):
u D T F = q = 1 Q T X q , ψ R ) T X q , ψ F g X q
u D T F reflects the mean difference between the two groups’ curves, and can be standardized in order to make interpretation easier, as in Equation (15).
u D T F % = u D T F T S × 100
where
T S = Highest possible test score
Larger values of these statistics reflect a greater degree of DTF.
Doebler [11] described another effect size measure of DIF across a set of items, the standardized DIF variance ( S D I F V ). This statistic takes the form:
S D I F V = J R + J F 2 τ ^ 2 J F 1 σ F 2 + J F 1 σ F 2
where
J R = Reference group sample size
J F = Focal group sample size
σ F 2 = Variance in latent trait for focal group
σ R 2 = Variance in latent trait for reference group
Doebler argues that in contrast to I D I F 2 * , S D I F V is largely independent of sample size and thus may be particularly useful as an effect size.
There is a dearth of work examining the performance of these test-level DIF (DTF) effect size measures. Finch et al. [8] found that when comparing the amount of DIF present in two assessments of the same construct τ ^ 2 , d ¯ s , d ¯ u , and R ¯ Δ 2 all were effective in identifying the test with greater DIF across its items. In contrast, when the tests were not of equal length, τ ^ 2 yielded more accurate results compared to other methods studied. Furthermore, when DIF was present for both tests but greater in one than the other, τ ^ 2 was better able to identify which had greater levels of DIF than were d ¯ s , d ¯ u , and R ¯ Δ 2 . With respect to I D I F 2 * and S D I F V , Doebler [11] demonstrated their use with an extant dataset and supported that they were able to identify cumulative DIF across a set of items in a scale.

1.2. Study Goals

This study’s goals were to (a) describe a variety of effect size statistics designed to measure the impact of uniform DIF at the item level in terms of the impact at the score level, (b) compare these measures across simulated conditions, and (c) show these effect sizes in an applied example. A simulation was used to compare the effect sizes following past research [2,5]. In this work, population parameter effect size values were obtained by generating populations consisting of 20,000,000 individuals in each group under a specific DIF condition. The population data generating conditions appear in Table 1 and are described in the methods section below. This study adds to the literature by comparing a number of test-level DIF effect sizes, several of which have not been examined heretofore using a simulation methodology.

2. Materials and Methods

A Monte Carlo simulation study (1000 replications per condition) was conducted to compare the effect size measures with one another across a variety of study conditions. Item response data for dichotomous items were generated using a Rasch model in order to follow the modeling used in Doebler [11]. The study conditions, which are described below and appear in Table 1, were completely crossed, and were selected to represent what is common in many DIF studies.
Data generation and analysis were completed with the R software system, version 4.0 [16]. The item difficulty parameters used to generate data for the reference group were drawn from a widely used intelligence test [17] and appear in Table 2. The method used to induce DIF for the focal group is described below.

2.1. Manipulated Factors

2.1.1. Number of Items

Prior research has shown that the assessment of DIF/DTF can be influenced by the test length, with longer tests leaving potentially more items for better matching of groups for DIF detection [18,19,20,21]. For this reason, two test lengths were simulated in this study (10 and 20 items), representing short- to moderate-length tests. These values are common in practice, especially with psychological scales [22,23,24], and have appeared in previous simulation studies within this range examining DIF detection [18,19,20,25].

2.1.2. Percentage, Magnitude, and Type of DIF

The percentage of DIF occurred at three levels, 0%, 10%, and 20%, found in practice [26]. DIF was simulated by adjusting the item difficulty parameter values for the focal group through manipulation of the area between the item response functions via Raju’s formula. These differences were employed in order to simulate no DIF (0), small (0.4), and large (0.8) DIF [21]. The percent of 0% DIF condition was by necessity not crossed with the level of DIF conditions. As noted above, the item difficulty values used to generate the data for the reference group appear in Table 2. When DIF was present, the magnitude values listed above were added to the values for items 2, 3, 6, and 7 (for the appropriate proportion of items with DIF) in Table 2 for the target items in order to obtain item difficulty parameters for the focal group. These items were selected randomly. DIF was simulated to be either additive (all DIF items favored the same group) or non-additive (half of the DIF items favored the reference group, and half favored the focal group).

2.1.3. Sample Size/Sample Size Ratio

Focal and reference group sample sizes were simulated at a variety of levels, as seen in Table 1. These values represent both the applied and simulation literature on DIF [14,21,27,28].

2.1.4. Group Ability Differences (Impact)

Group mean differences in the latent trait are associated with Type I error inflation for uniform DIF [27,29,30,31]. Therefore, to ascertain the influence of impact on the performance of the effect sizes, mean abilities of the focal and reference groups were set to be 0/0, −0.5/0, or −1/0.

2.1.5. Study Outcomes

The study outcomes of interest were the mean effect size values, the empirical standard errors for the effect sizes, and the relative estimation bias of the effect size measures. The empirical standard error is the standard deviation of the 1000 parameter estimates for each combination of conditions. In other words, for each combination of the study conditions, the standard deviation of each effect size measure was calculated for the 1000 estimated values. Likewise, the mean effect size value for each effect size statistic was calculated across the 1000 parameter estimates for each combination of conditions. The relative bias was calculated as follows:
R B = θ ^ θ θ
where
θ = Population parameter value
θ ^ = Parameter estimate
The population parameter effect size values were obtained by generating population simulations consisting of 20,000,000 individuals in each group under a specific DIF condition (e.g., 0.8 magnitude with 20% of items containing DIF). To ensure the stability of these values, this procedure was conducted 3 times for each condition, and it was noted that the resulting effect size values were virtually identical across these replications. In order to assess which main effects and interactions of the conditions were related to the relative bias, analysis of variance (ANOVA), in conjunction with the ω 2 , was used, where values above 0.10 singled out factors to examine.
With respect to interpretation of the outcomes, the RB is most useful for comparison across the various effect size statistics because it accounts for differences in the scale of each statistic by standardizing it to the population value. In other words, for each effect size measure, RB reflects the same metric, difference between the estimated and population values divided by the population value. The empirical standard error is most useful for comparisons within the same effect size statistic across different simulation conditions. It is not helpful for comparing across effect size measures because they are calculated on different scales. Finally, the mean of the raw effect size statistic is also useful for comparison across simulation conditions within effect size measure and not across effect sizes, again due to differences in scale.

3. Results

3.1. No DIF

When there was no DIF present in the population, there were no statistically significant terms associated with the values of effect sizes. Table 3 contains the means and empirical standard errors of the DTF effect size statistics in the absence of the DIF condition. If DIF was not present, values of 0 should be observed for these effect sizes. None of these statistics yielded values that were exactly 0, with all but the sDTF% exhibiting a positive bias. Note that it is not possible to calculate relative bias in the no DIF condition because it would require division by 0.

3.2. DIF Present

When DIF was simulated to be present among the items, ANOVA results revealed that the interactions of the proportion of DIF items by DIF magnitude ( F 1 , 441 = 48.77 ,   p < 0.0001 ,   ω 2 = 0.094 ), impact by type of DIF ( F 2 , 441 = 12.07 ,   p < 0.0001 ,   ω 2 = 0.073 ), and the interaction magnitude of DIF by type of DIF ( F 1 , 441 = 5.74 ,   p = 0.017 ,   ω 2 = 0.013 ) were associated with relative bias. Table 4 includes the relative bias, effect size means, and standard errors across simulation replications for the proportion of DIF items by the magnitude of DIF. Relative biases of τ ^ w 2 , d , and sDTF% were generally uninfluenced by the magnitude of DIF or the proportion of items that contained DIF. In contrast, I D I F 2 * , S D I F V , and R ¯ Δ 2 yielded different levels of relative bias based on DIF magnitude and proportion. More specifically, I D I F 2 * yielded somewhat higher levels of bias for the lower DIF magnitude, and S D I F V had a higher bias for larger DIF magnitudes. The R ¯ Δ 2 effect size yielded higher levels of relative bias for both greater DIF magnitude and a larger proportion of DIF items. In addition, for each of the effect sizes, mean values were larger when a greater proportion of items exhibited DIF and when the magnitude of DIF was greater.
The empirical standard errors by DIF magnitude and proportion of DIF items, as well as for the No DIF condition, appear in the third panel of Table 4.
For S D I F V   and R ¯ Δ 2 the standard errors in the No DIF case were lower than when DIF was present. In contrast, I D I F 2 * , uDTF% and sDTF% exhibited empirical standard errors that were equal to or greater than the empirical standard errors when DIF was present. The standard errors for τ ^ w 2 and d were not impacted by the presence of DIF. When DIF was present, the empirical standard errors for I D I F 2 * , uDTF%, and sDTF% were larger for a higher proportion of DIF items when the magnitude of DIF was 0.4, whereas, for a magnitude of 0.8, the reverse pattern was in evidence. The results for S D I F V were the opposite of these three variables, and for τ ^ w 2 , d, and R ¯ Δ 2 there appeared to be little to no relationship between either the DIF magnitude or proportion of DIF items and the empirical standard error.
Table 5 includes the relative bias, mean effect size values, and standard errors by DIF magnitude and type of DIF.
When DIF was additive in nature (i.e., all DIF items favored the same group), the relative bias was consistently higher for I D I F 2 * , S D I F V   , and uDTF%. This effect was particularly marked for S D I F V in the higher DIF condition. In contrast, relative bias in τ ^ w 2 , d, and sDTF% were not affected by either DIF magnitude or type. Finally, R ¯ Δ 2 yielded the highest relative bias across conditions.
With respect to the mean estimates, each effect size yielded larger values when the magnitude of DIF was larger. This result was demonstrated in both the mean effect size values as well as the ratios of these values to the no DIF case. When DIF was not additive (i.e., equal numbers of items favored the two groups), I D I F 2 * , S D I F V , and d displayed little to no differences in the mean values between the two DIF magnitude levels. In addition, the difference in mean values between DIF magnitudes of 0.4 and 0.8 was less for uDTF% and sDTF% for non-additive DIF. Finally, τ ^ w 2 and R ¯ Δ 2 both appeared to be largely unaffected by DIF characteristics, such that the relationship of DIF magnitude on the mean effect size value was similar for both types of DIF.
The empirical standard errors for the effect sizes by DIF magnitude and type also appear in Table 5. The overall patterns of relative empirical standard error sizes for the No DIF and DIF conditions by effect size statistic that were discussed with respect to Table 4 are also evident in Table 5 as well. In terms of the combination of DIF magnitude and type, there was no impact on the empirical standard errors of either τ ^ w 2 or d. The standard errors for S D I F V and R ¯ Δ 2 were larger for the non-additive DIF condition, and this pattern was stronger when the DIF magnitude was 0.8. In contrast, the empirical standard errors for I D I F 2 * , uDTF% and sDTF% were smaller in the non-additive DIF case, and this difference with the additive DIF condition was more pronounced for a greater DIF magnitude.
Table 6 includes the relative estimation bias, mean effect size values, and empirical standard errors by impact and type of DIF.
As was evident in Table 5, the relative biases of τ ^ w 2 , d, and sDTF% were affected relatively little by the type of DIF, regardless of the level of impact. In contrast, the type of DIF did influence the values that were obtained for the other statistics to at least some extent. More specifically, the type of DIF had a greater influence on the relative bias with greater levels of impact for I D I F 2 * , S D I F V and uDTF%. Across conditions, R ¯ Δ 2 yielded the highest degree of relative bias from among the effect sizes included in this study.
A similar overall pattern of results was evident for the mean values of the effect sizes themselves. For example, R ¯ Δ 2 was unaffected by the level of impact. Furthermore, τ ^ w 2 and d evinced relatively small differences in mean values across the different levels of impact. In contrast, magnitudes for several of the effect size measures were impacted differentially by the combination of impact and DIF type. As an example, the values of I D I F 2 * and S D I F V increased concomitantly with greater levels of impact when the type of DIF was additive. Conversely, when the type of DIF was non-additive, the values of these two effect sizes decreased with increasing levels of impact. In other words, both I D I F 2 * and S D I F V were less likely to reflect the presence of DIF when there was a larger impact and when the type of DIF was non-additive, but were more likely to reflect the presence of more DIF among the items when the type of DIF was additive, and the level of impact was higher. In contrast to this pattern, the results in Table 6 show that for both uDTF% and sDTF%, there was an increase in values as the level of impact increased as well. However, this increase was greater for the non-additive DIF, as reflected in both the means and ratios to the non-DIF condition.
The empirical standard errors for the effect size statistics by impact and type of DIF appear in the third panel of Table 6, along with the standard errors when no DIF was present. As was true in the other tables, there was no relationship between either impact or the type of DIF and the empirical standard errors for τ ^ w 2 or d. In addition, the standard errors for R ¯ Δ 2 were larger for the non-additive DIF condition, and this result was consistent across levels of impact. As was the case in Table 4, the empirical standard errors for the additive DIF condition were larger for both uDTF% and sDTF%, and this difference was larger for uDTF% when greater amounts of impact were present. Effect size values, empirical standard errors, and relative estimation bias by sample size (Supplemental Table S1) and effect size values, empirical standard errors, and relative estimation bias by number of items (Supplemental Table S2) appear in Supplemental Materials.

3.3. Empirical Example

In order to demonstrate the use of these effect size statistics in an applied setting, DIF analysis was conducted for an empirical example, following past examples in this area [2]. The data used for this purpose were drawn from the verbal aggression dataset that is included in the difR package [32]. The data include 316 responses to a 24-item verbal aggression questionnaire. The items asked respondents about four different scenarios (bus fails to stop for me, miss train because clerk provided faulty information, grocery store closes as I am about to enter, and operator disconnects me when I used my last 10 cents for a call). For each scenario, the respondents are also asked whether they would curse, scold, or shout or would want to (but not act on a) curse, scold, or shout. The combination of these scenarios (4) and potential actions (6) yields 24 items. For each item, individuals respond either yes (1) or no (0). Uniform DIF for these items was assessed by both the Mantel–Haenszel test and logistic regression based on gender, with 243 women and 73 men in the sample. The effect sizes featured in the simulation study were applied to the verbal aggression dataset.
Table 7 includes the difficulty parameter estimates, ETS Δ effect size, ETS DIF classification, and the R ¯ Δ 2 effect size from logistic regression for each item in the verbal aggression scale. Statistically significant DIF is denoted by *.
Based on these results, items 6, 12, 16, 17, 19, and 20 were identified with statistically significant uniform DIF by MH. The LR results identified items 4, 6, 14, 16, 17, and 19 as having gender-based DIF. The difficulty parameter estimates were lower for females on items 4, 6, and 12, whereas they were higher for females on items 16, 17, 19, and 20. Higher item difficulty means that an individual needs a higher degree of the latent trait (verbal aggression) to endorse the item.
The DTF effect size estimates for the verbal aggression data appear in Table 8.
Effect size values were calculated for both the full set of 24 items and for the subset of 17 items that were not found to have statistically significant DIF. The purpose of this latter analysis was to provide a point of comparison for the two scenarios (with and without DIF). It is clear from these results that when DIF was present, the effect size values were larger than when no DIF was present. It should also be noted that these effect sizes are each designed such that the number of items should not be a factor with respect to their magnitude. Therefore, differences in effect size value reflect differences in the amount of collective DIF present in the items.

4. Discussion

Assessment fairness and equity is an increasingly important issue in educational and psychological measurement. In order for scores from such assessments to be meaningful at the individual and group levels, the measures themselves must perform in the same way for individuals from across the population. When this is not the case, and an instrument performs differentially for individuals from different subgroups, the scores cannot be assumed to be valid reflections of the construct being measured. Traditionally, DIF detection studies have focused primarily on identifying individual items that might exhibit DIF so that they can then be edited or removed. Relatively less attention has been paid to assessing the impact of DIF at the total scale level. Recently, however, authors have suggested such measures for use in ascertaining the extent of the impact of DIF on the total test score [8,11], as well as determining guidelines for the size of the effects and when invariance is of practical importance [3]. The focus of this study was on comparing the performance of several such effect sizes in order to gain insights into which of them might provide measurement professionals the greatest insight into DIF impacts at the test level, and under what conditions.
The results presented above suggest that several of the effect sizes included in this study are potentially useful tools for researchers interested in assessing the effect of DIF across items on an instrument as a whole. In particular, it appears that τ ^ w 2 , sDTF%, and d were perhaps the best performers when considering the totality of the evidence. When DIF was not present, τ ^ w 2 exhibited the lowest degree of bias from the population value of 0, with d also demonstrating relatively low bias. In addition, when DIF was present in the population, these three statistics had relatively low standard errors and low relative bias. In addition, these effect size statistics were largely impervious to group differences in the mean of the latent trait being measured by the assessment.

5. Directions for Future Research

Additional research needs to be conducted with an eye toward developing guidelines for interpreting the effect sizes studied here. Appropriate guidelines in the DIF literature, and in particular DTF, have been a challenge and will continue to be as new DIF procedures are proposed. In addition, the guidelines may differ by the type of outcome studied (e.g., observed vs. latent scores [5]). Future work should focus on a greater array of DIF magnitudes and proportions of items than were included in the current study. In addition, researchers may also want to consider developing confidence intervals for these effect sizes. In particular, a bootstrap approach to estimating standard errors that could then be used to estimate confidence intervals would be useful in providing researchers with greater information about the actual impact of DIF on the total assessment. Given the popularity of polytomous items in the social sciences, the performance of these statistics should be examined for items beyond the dichotomous case presented here, with appropriate modifications made where necessary. In this same vein, future research should also include a greater variety of dichotomous data-generating models, such as the 2-parameter logistic and 3-parameter logistic. The items selected to have DIF in the current study tended to be relatively easy, with the most difficult having a population difficulty parameter value of 0.41. Further simulations with more difficult items exhibiting DIF would be useful for future studies. Work focused on a different set of DIF magnitude values would also be quite useful. Although featured in previous research [21], the DIF magnitudes selected for this study were relatively large. Therefore, it would be useful for practitioners if future research examined the sensitivity of these effect size statistics to cases when DIF was smaller (e.g., 0.1 or 0.2).

6. Conclusions

The results presented above demonstrate that there exist multiple useful statistics for characterizing the effect of DIF at the full assessment level. These effect sizes should be useful tools for researchers as they consider the impact of DIF on a scale in order to determine whether, in fact, it is problematic. These results show that three of the statistics studied here, τ ^ w 2 , I D I F 2 * , and d may prove to be particularly useful. Therefore, measurement practitioners are encouraged to use them in practice, and psychometricians are encouraged to give them further study in order to develop guidelines for interpreting them. These effect sizes should serve as informative additions to DIF studies by providing information about the total impact of DIF on the total score taken from a scale. It is hoped that the current study has provided useful information in that regard, as well as a roadmap for future work in this area.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/psych5010013/s1. Table S1, Effect size values by sample size when DIF was present; Table S2, Effect size values by number of items when DIF was present.

Author Contributions

Conceptualization, B.F.F. and W.H.F.; methodology, B.F.F. and W.H.F.; software, B.F.F. and W.H.F.; validation, B.F.F. and W.H.F.; formal analysis, B.F.F. and W.H.F.; investigation, B.F.F. and W.H.F.; resources, B.F.F. and W.H.F.; data curation, B.F.F. and W.H.F.; writing—original draft preparation, B.F.F. and W.H.F.; writing—review and editing, B.F.F. and W.H.F.; visualization, B.F.F. and W.H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article and Supplementary Material. Empirical data and the associated R code are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing; American Educational Research Association: Washington, DC, USA, 2014. [Google Scholar]
  2. Stark, S.; Chernyshenko, O.S.; Drasgow, F. Examining the Effects of Differential Item (Functioning and Differential) Test Functioning on Selection Decisions: When Are Statistically Significant Effects Practically Important? J. Appl. Psychol. 2004, 89, 497–508. [Google Scholar] [CrossRef]
  3. Badia, X.; Prieto, L.; Roset, M.; Díez-Pírez, A.; Herdman, M. Development of a short osteoporosis quality of life questionnaire by equating items from two existing instruments. J. Clin. Epidemiol. 2002, 55, 32–40. [Google Scholar] [CrossRef] [PubMed]
  4. Meade, A.W. A taxonomy of effect size measures for the differential functioning of items and scales. J. Appl. Psychol. 2010, 95, 728–743. [Google Scholar] [CrossRef] [PubMed]
  5. Nye, C.D.; Drasgow, F. Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. J. Appl. Psychol. 2011, 96, 966–980. [Google Scholar] [CrossRef]
  6. Steinberg, L.; Thissen, D. Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychol. Methods 2006, 11, 402–415. [Google Scholar] [CrossRef]
  7. Camilli, G.; Penfield, R.A. Variance estimation for Differential Test Functioning based on Mantel-Haenszel statistics. J. Educ. Meas. 1997, 34, 123–139. [Google Scholar] [CrossRef]
  8. Finch, W.H.; French, B.F.; Hernandez, M.F. Quantifying Item Invariance for the Selection of the Least Biased Assessment. J. Appl. Meas. 2019, 20, 13–26. [Google Scholar]
  9. Hasselblad, V.; Hedges, L.V. Meta-analysis of screening and diagnostic tests. Psychol. Bull. 1995, 117, 167–178. [Google Scholar] [CrossRef]
  10. Zumbo, B.D.; Thomas, D.R. A measure of Effect Size for a Model-Based Approach for Studying DIF. In Prince George, Canada: University of Northern British Columbia, Edgeworth Laboratory for Quantitative Behavioral Science; 1997. [Google Scholar]
  11. Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef]
  12. Nye, C.D.; Bradburn, J.; Olenick, J.; Bialko, C.; Drasgow, F. How big are my effects? Examining the magnitude of effect sizes in studies of measurement equivalence. Organ. Res. Methods 2019, 22, 678–709. [Google Scholar] [CrossRef]
  13. Gunn, H.J.; Grimm, K.J.; Edwards, M.C. Evaluation of six effect size measures of measurement non-invariance for continuous outcomes. Struct. Equ. Model. A Multidiscip. J. 2020, 27, 503–514. [Google Scholar] [CrossRef]
  14. Narayanan, P.; Swaminathan, H. Idetnification of items that show nonuniform DIF. Appl. Psychol. Meas. 1996, 20, 257–274. [Google Scholar] [CrossRef] [Green Version]
  15. Chalmers, R.P.; Counsell, A.; Flora, D.B. It might not make a big DIF: Improved Differential Test Functioning statistics that account for sampling variability. Educ. Psychol. Meas. 2016, 76, 114–140. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria; Available online: https://www.R-project.org/ (accessed on 20 January 2021).
  17. Wechsler, D. Wechsler Adult Intelligence Scale—Fourth Edition (WAIS-IV); Pearson: San Antonio, TX, USA, 2008. [Google Scholar]
  18. Belzak, W.; Bauer, D.J. Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychol. Methods 2020, 25, 673–690. [Google Scholar] [CrossRef] [PubMed]
  19. Furlow, C.F.; Raiford Ross, T.; Gagné, P. The impact of multidimensionality on the detection of differential bundle functioning using simultaneous item bias test. Appl. Psychol. Meas. 2009, 33, 441–464. [Google Scholar] [CrossRef]
  20. Woods, C.M. Empirical selection of anchors for tests of differential item functioning. Appl. Psychol. Meas. 2009, 33, 42–57. [Google Scholar] [CrossRef]
  21. Finch, W.H.; French, B.F. Detection of crossing differential item functioning: A comparison of four methods. Educ. Psychol. Meas. 2007, 67, 565–582. [Google Scholar] [CrossRef]
  22. Cole, S.R.; Kawachi, I.; Maller, S.J.; Berkman, L.F. Test of item-response bias in the CES-D scale: Experience from the New Haven EPESE study. J. Clin. Epidemiol. 2000, 53, 285–289. [Google Scholar] [CrossRef]
  23. Mak, K.K.; Young, K.S. Development and differential item functioning of the Internet Addiction Test-Revised (IAT-R): An item response theory approach. Cyberpsychol. Behav. Soc. Netw. 2020, 23, 312–328. [Google Scholar] [CrossRef]
  24. Murphy, S.; Elklit, A.; Chen, Y.Y.; Ghazali, S.R.; Shevlin, M. Sex differences in PTSD symptoms: A differential item functioning approach. Psychol. Trauma Theory Res. Pract. Policy 2019, 11, 319. [Google Scholar] [CrossRef]
  25. Candell, G.L.; Drasgow, F. An iterative procedure for linking metrics and assessing item bias in item response theory. Appl. Psychol. Meas. 1988, 12, 253–260. [Google Scholar] [CrossRef]
  26. Oshima, T.C.; Miller, M.D. Multidimensionality and item bias in item response theory. Appl. Psychol. Meas. 1992, 16, 237–248. [Google Scholar] [CrossRef] [Green Version]
  27. Berrío, Á.I.; Herrera, A.N.; Gómez-Benito, J. Effect of sample size ratio and model misfit when using the difficulty parameter differences procedure to detect DIF. J. Exp. Educ. 2019, 87, 367–383. [Google Scholar] [CrossRef]
  28. Rogers, H.J.; Swaminathan, H. A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Appl. Psychol. Meas. 1993, 17, 105–116. [Google Scholar] [CrossRef] [Green Version]
  29. Kim, S.-H.; Cohen, A.S. Detection of differential item functioning under the graded response model with the likelihood ratio test. Appl. Psychol. Meas. 2016, 22, 345–355. [Google Scholar] [CrossRef] [Green Version]
  30. DeMars, C. Item Response Theory; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
  31. Roussos, L.A.; Stout, W.F. Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. J. Educ. Meas. 1996, 33, 215–230. [Google Scholar] [CrossRef]
  32. Magis, D.; Beland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef]
Table 1. Study factors and conditions used in the simulation.
Table 1. Study factors and conditions used in the simulation.
Factor Levels
Number of items 10, 20
Group sample sizes 250/250, 500/500, 1000/1000, 2000/2000, 500/250, 1000/250,
2000/250, 1000/500, 2000/500, 2000/1000
Group mean difference 0, −0.5, −1
Proportion of DIF items 0, 0.1, 0.2
Uniform DIF magnitude0, 0.4, 0.8
DIF effectAdditive, Non-additive
Effect sizessDTF%, uDTF%, τ ^ w 2 , d , R ¯ Δ 2 from logistic regression, I D I F 2 * , S D I F V
Table 2. Data generating item difficulty parameter values for the two simulated assessments in the 30-item condition.
Table 2. Data generating item difficulty parameter values for the two simulated assessments in the 30-item condition.
ItemReference Group Item Difficulty Parameter Values
1 1.92
2 0.96
3 0.55
4 0.14
50.01
60.22
70.41
80.59
9 0.60
100.23
110.77
120.81
130.92
140.93
151.02
161.15
171.30
181.37
191.68
201.74
Table 3. DTF effect size statistics and standard error when DIF was absent.
Table 3. DTF effect size statistics and standard error when DIF was absent.
StatisticMeanStandard Error
I D I F 2 * 0.040.11
S D I F V 0.080.04
τ ^ w 2 0.0030.02
R ¯ Δ 2 0.0010.003
d 0.010.02
uDTF%2.082.88
sDTF%−0.483.23
Table 4. Effect size values, empirical standard error, and relative estimation bias by magnitude of DIF and proportion of items with DIF.
Table 4. Effect size values, empirical standard error, and relative estimation bias by magnitude of DIF and proportion of items with DIF.
Mag DIFProp DIF I D I F 2 * S D I F V τ ^ w 2 R ¯ Δ 2 d uDTF%sDTF%
0.40.10.570.120.020.0040.1013.40−13.38
0.20.640.190.060.0080.1221.64−17.59
0.80.10.610.220.050.0110.1122.04−16.80
0.20.700.250.080.0120.1425.12−18.85
Empirical standard error
0.40.10.080.060.020.010.022.182.94
0.20.110.050.010.010.022.783.20
0.80.10.090.060.020.010.022.663.17
0.20.070.070.020.010.022.603.15
Relative estimation bias
0.40.10.020.010.003−0.06−0.01−0.02−0.01
0.20.020.020.003−0.08−0.01−0.01−0.01
0.80.10.010.040.003−0.07−0.01−0.01−0.01
0.20.010.040.002−0.08−0.01−0.02−0.01
Table 5. Effect size values, empirical standard error, and relative estimation bias by magnitude of DIF and type of DIF.
Table 5. Effect size values, empirical standard error, and relative estimation bias by magnitude of DIF and type of DIF.
Mag DIFDIF Type I D I F 2 * S D I F V τ ^ w 2 R ¯ Δ 2 d uDTF%sDTF%
0.4Add0.630.180.040.0070.1218.80−13.39
Non0.580.130.040.0050.1117.59−7.84
0.8Add0.730.350.090.020.1422.98−17.10
Non0.570.140.090.020.1121.41−8.45
Empirical standard error
0.4Add0.100.050.020.0050.022.563.21
Non0.090.060.020.0060.022.472.91
0.8Add0.090.050.020.0060.022.763.34
Non0.050.090.020.0090.022.462.91
Relative estimation bias
0.4Add0.030.020.003−0.07−0.02−0.02−0.01
Non0.010.010.003−0.07−0.01−0.01−0.01
0.8Add0.020.060.003−0.08−0.01−0.02−0.01
Non0.010.020.003−0.07−0.01−0.01−0.01
Table 6. Effect size values, empirical standard error, and relative estimation bias by impact and type of DIF.
Table 6. Effect size values, empirical standard error, and relative estimation bias by impact and type of DIF.
ImpactDIF Type I D I F 2 * S D I F V τ ^ w 2 R ¯ Δ 2 d uDTF%sDTF%
0Add0.660.240.060.010.1318.34−12.80
Non0.610.170.060.010.1215.30−5.44
−0.5Add0.670.250.060.010.1220.73−17.07
Non0.590.140.050.010.1219.01−11.29
−1.0Add0.710.290.060.010.1324.19−20.68
Non0.530.100.050.010.1023.59−17.89
Standard error
0Add0.090.060.020.0050.022.453.38
Non0.080.070.020.0070.022.412.90
−0.5Add0.090.050.020.0060.022.713.21
Non0.070.070.020.0070.022.472.88
−1.0Add0.110.040.020.0050.022.863.26
Non0.070.080.020.0070.022.562.96
Relative estimation bias
0Add0.010.020.01−0.06−0.01−0.01−0.01
Non0.010.030.01−0.07−0.01−0.01−0.01
−0.5Add0.030.050.01−0.07−0.02−0.02−0.01
Non−0.010.010.01−0.08−0.01−0.01−0.01
−1.0Add0.040.070.01−0.08−0.01−0.03−0.01
Non−0.02−0.020.01−0.06−0.02−0.01−0.01
Table 7. Item difficulty estimates by group, and item level DIF results for verbal anger data.
Table 7. Item difficulty estimates by group, and item level DIF results for verbal anger data.
ItemFemale Item DifficultyMale Item DifficultyETS Δ *MH α Logistic   Regression   R ¯ Δ 2
1−1.26−1.11−1.251.700.007
2−0.59−0.49−1.341.770.006
3−0.110.02−0.871.450.007
4−1.85−1.46−1.561.940.020 *
5−0.76−0.56−1.611.980.016
6−0.180.53−2.49 *2.880.034 *
7−0.50−0.640.140.940.000
80.860.160.770.720.005
91.471.74−1.001.530.008
10−1.13−0.95−1.231.690.007
110.430.09−0.201.090.000
120.961.34−2.00 *2.350.013
13−1.11−1.650.530.800.001
14−0.18−1.111.630.500.011 *
150.890.83−0.381.180.002
16−0.62−1.852.67 *0.320.024 *
170.34−0.872.31 *0.380.025 *
181.591.170.550.790.000
190.48−0.641.82 *0.460.023 *
201.780.771.76 *0.470.017
213.162.501.060.640.003
22−0.55−1.281.030.640.006
230.57−0.201.050.640.007
241.982.08−1.111.610.004
* Statistically significant (α = 0.05) uniform DIF.
Table 8. DTF effect size statistics for verbal anger data.
Table 8. DTF effect size statistics for verbal anger data.
StatisticFull Verbal Aggression DataNon-DIF Verbal Aggression Data
I D I F 2 * 0.780.54
S D I F V 0.170.08
τ ^ w 2 0.240.05
R ¯ Δ 2 0.010.004
d 0.340.01
uDTF%62.4634.08
sDTF%−36.29−16.53
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Finch, W.H.; French, B.F. Effect Sizes for Estimating Differential Item Functioning Influence at the Test Level. Psych 2023, 5, 133-147. https://doi.org/10.3390/psych5010013

AMA Style

Finch WH, French BF. Effect Sizes for Estimating Differential Item Functioning Influence at the Test Level. Psych. 2023; 5(1):133-147. https://doi.org/10.3390/psych5010013

Chicago/Turabian Style

Finch, W. Holmes, and Brian F. French. 2023. "Effect Sizes for Estimating Differential Item Functioning Influence at the Test Level" Psych 5, no. 1: 133-147. https://doi.org/10.3390/psych5010013

Article Metrics

Back to TopTop