Abstract
-
Purpose
- This study examined the impact of differential item functioning (DIF) on ability estimation in a computerized adaptive testing (CAT) environment using real response data from the 2017 Korean Medical Licensing Examination (KMLE). We hypothesized that excluding gender-based DIF items would improve estimation accuracy, particularly for examinees at the extremes of the ability scale.
-
Methods
- The study was conducted in 2 steps: (1) DIF detection and (2) post-hoc simulation. The analysis used data from 3,259 examinees who completed all 360 dichotomous items. Gender-based DIF was detected with the residual-based DIF method (reference group: males; focal group: females). Two CAT conditions (all items vs. DIF-excluded) were compared against a “true θ” estimated from a fixed-form test of 264 non-DIF items. Accuracy was evaluated using bias, root mean square error (RMSE), and correlation with true θ.
-
Results
- In the CAT condition excluding DIF items, accuracy improved, with RMSE reduced and correlation with true θ increased. However, bias was slightly larger in magnitude. Gender-specific analyses showed that DIF removal reduced the underestimation of female ability but increased the underestimation of male ability, yielding estimates that were fairer across genders. When DIF items were included, estimation errors were more pronounced at both low and high ability levels.
-
Conclusion
- Managing DIF in CAT-based high-stakes examinations can enhance fairness and precision. Using real examinee data, this study provides practical evidence of the implications of DIF for CAT-based measurement and supports fairness-oriented test design.
-
Keywords: Computerized adaptive testing; Educational measurement; Psychometrics; Undergraduate medical education; Republic of Korea
Graphical abstract
Introduction
- Background
- Computerized adaptive testing (CAT) is widely recognized as an efficient and precise approach to ability estimation in educational and licensure examinations. It increases efficiency and accuracy by adapting item selection to an examinee’s ability, thereby reducing test length while maintaining measurement precision [1,2]. CAT also strengthens test security by lowering item exposure rates and enabling more frequent item reuse, which is particularly beneficial for high-stakes examinations with limited item pools.
- The validity of CAT depends on the stability and fairness of item characteristics across examinee groups. Differential item functioning (DIF) arises when examinees from different groups, but with the same underlying ability, have unequal probabilities of answering an item correctly. In CAT, biased items are especially problematic if presented early, as they can distort ability estimates and disproportionately affect subsequent item selection, raising fairness concerns in high-stakes testing [3,4].
- Several methods have been proposed for detecting DIF, including logistic regression, item response theory (IRT) likelihood ratio tests, and the CAT-adapted SIBTEST [3,4]. Although effective, these approaches are often resource-intensive. As an alternative, Lim et al. [5] introduced the residual-based DIF (RDIF) method, which compares group differences in item-level residuals without requiring separate calibrations or matching variables. RDIF offers several advantages: it is simple to compute, avoids group-specific calibrations or equating, and maintains competitive power while controlling Type I error [5,6]. Its efficiency makes it applicable to both fixed-form and adaptive testing contexts. RDIF consists of 3 statistics—RDIFR, RDIFS, and RDIFRS—that vary in sensitivity to different types of DIF. RDIFR is most effective for detecting uniform DIF, RDIFS is more sensitive to non-uniform DIF, and RDIFRS combines both indices to provide a robust overall test with well-controlled Type I error [5].
- Most prior research on CAT and DIF has emphasized detecting DIF within CAT environments, while relatively few studies have examined its direct impact on ability estimation. Seo [3] demonstrated through simulation that DIF items administered early in a CAT session can substantially bias ability estimates, especially for examinees at the low and high ends of the ability scale. Şahin Kürşad and Yalçın [7] reported that DIF can reduce measurement precision and distort test information under different item bank sizes and design conditions. Despite such findings, few empirical studies have validated the impact of DIF on CAT-based ability estimation using operational test data.
- Objectives
- This study evaluated the impact of gender-based DIF on CAT ability estimation using real response data from the 2017 Korean Medical Licensing Examination (KMLE). DIF items were identified with the RDIF method, and post-hoc CAT simulations were conducted under 2 conditions: including all items versus excluding DIF items. Ability estimates from both conditions were compared with a reference θ based on non-DIF items. We hypothesized that excluding DIF items would improve accuracy and reduce group-level bias, particularly for examinees at the extremes of the ability scale.
Methods
- Ethics statement
- This study did not require institutional review board approval because it used only de-identified secondary data. The data were collected by the Korea Health Personnel Licensing Examination Institute, and the authors used only examinees’ gender and item-level responses from the 2017 KMLE. No information that could identify individual examinees was included.
- Study design
- This was a post-hoc simulation study using real response data from the 2017 KMLE. The purpose was to evaluate the impact of gender-based DIF on ability estimation in a CAT environment.
- Setting
- This was a secondary analysis of the 2017 KMLE.
- Participants
- The dataset included complete responses from 3,259 examinees (2,071 males and 1,188 females) who sat for the 2017 KMLE. All examinees completed the same fixed-form test consisting of 360 dichotomously scored multiple-choice items. No additional inclusion or exclusion criteria were applied beyond having complete gender and item-level response data.
- Variables
- The outcome variables were RDIF values, bias, root mean square error (RMSE), and the Pearson correlation between estimated and true θ.
- Data sources
- The 360 KMLE items were scored as correct (1) or incorrect (0). For this study, only gender and item-level responses were analyzed. No personal identifiers were present in the dataset. The raw response data are available in Dataset 1.
- Measurement
- This study used item parameters and gender-based DIF classifications derived from the real response data of 3,259 examinees from the 2017 KMLE. For the post-hoc CAT simulations, examinee abilities (true θ) were first established by estimating their abilities from the 264 DIF-excluded items of the fixed-form test. Subsequently, response patterns under the “all-items” and “DIF-excluded” CAT conditions were simulated using these established true θ values and the respective item parameters under the Rasch model. This approach allowed for a controlled environment to assess the impact of DIF while grounding item characteristics and examinee abilities in real operational data.
- We provided item parameter estimates and DIF classifications for all 360 items in Supplement 1. The order of items in Supplement 1 does not correspond to the original test booklet sequence.
- Bias
- This study examined DIF as a source of measurement bias in ability estimation. The definition of “true θ” based on DIF-excluded items served as a crucial reference point to quantify the magnitude and direction of estimation bias under different CAT conditions. Potential biases inherent in post-hoc simulation studies—such as the idealized nature of simulated response patterns compared with real-world examinee behavior and operational CAT complexities—are acknowledged as limitations of this methodology.
- Study size
- This study utilized the complete dataset of 3,259 examinees from the 2017 KMLE. Because the study was based on post-hoc simulations using an existing dataset rather than prospective sampling, no a priori sample size calculation or post-hoc power analysis was conducted. The objective was to assess the impact of DIF across the entire operational examinee cohort.
- Statistical methods
- The first step was DIF detection. Gender-based DIF was detected using the RDIF method [5,6], with males as the reference group and females as the focal group. Because the items were dichotomously scored, we employed the Rasch model [8], which fixes item discrimination and is suitable for analyzing the KMLE data. However, because the Rasch model does not estimate discrimination parameters, it is not suitable for detecting non-uniform DIF. Accordingly, we applied the RDIFR statistic, which is most appropriate for detecting uniform DIF, as the criterion in this study. RDIFR for item j is defined as:
- where rif= xif−Pif and rjr= xjr−Pjr, defined as the raw residuals between the observed binary response (0=incorrect, 1=correct) and the model-predicted probability of a correct response (P). RDIFR asymptotically follows a normal distribution; therefore, the 2-tailed Z-test was conducted, and items were flagged as exhibiting DIF when the RDIFR statistic was significant at the α=0.05 level [5].
- The next step was post-hoc simulation. Two post‐hoc CAT simulations were conducted. In the all‐items condition, all 360 items were available for selection. In the DIF‐excluded condition, only the 264 items not flagged for DIF were available.
- In both conditions, the CAT began with an initial ability estimate of 0.00. The minimum expected posterior variance (MEPV) method was used for item selection [9], and ability was estimated using the expected a posteriori (EAP) method after each item administration [10]. These methods were chosen because Seo and Choi [11] demonstrated that, under the Rasch model, the combination of MEPV item selection and EAP scoring produced the highest estimation accuracy in a post-hoc CAT simulation study of the KMLE. The stopping rule was a standard error of measurement less than 0.3. Item exposure was unrestricted, as the goal was to isolate the effect of DIF rather than optimize item pool usage.
- The accuracy of CAT ability estimates was evaluated using bias, RMSE, Pearson correlation between estimated and true θ, and average test length, defined as follows:
- Bias quantifies the average signed difference between estimated and true ability scores. A positive bias indicates overestimation, while a negative bias indicates underestimation.
- RMSE reflects the overall magnitude of estimation error, incorporating both bias and variability. Lower RMSE values indicate greater precision in ability estimation.
- The Pearson correlation coefficient (ρ) measures the strength of the linear relationship between estimated and true abilities. Values closer to 1 indicate stronger alignment between CAT estimates and the reference scores. For all calculations, the true θ for each examinee was defined as the ability estimate derived from the 264 DIF-excluded items. Finally, average test length was calculated as the mean number of items administered until the stopping criterion was reached.
- These outcomes were computed both overall and separately for the reference and focal groups, and were also analyzed across ability levels to assess differential impact.
- All analyses were conducted using R software ver. 4.4.1 (R Core Team) [12]. CAT simulations were implemented using the catR package [13], and DIF detection was carried out using the irtQ package [14]. Supplement 2 provides a simplified R code snippet illustrating the core CAT simulation logic and the use of the catR package. The example generates responses for a small number of hypothetical examinees to demonstrate the functions applied in the full-scale simulation of 3,259 examinees.
Results
- DIF detection
-
Table 1 and Fig. 1 present the distribution of item difficulty parameters for all items, DIF-free items, and DIF items.
- The item bank contained 360 dichotomous items, of which 96 were identified as exhibiting gender-based DIF using the RDIF method. Under the Rasch model, the mean item difficulty for the full bank was –0.465 (standard deviation [SD]=1.694; range, –5.697 to 6.770). Most DIF items were concentrated around the mid-difficulty range, with relatively few at the extremes. As a result, when these items were excluded, the remaining DIF-free bank (n=264) showed a slightly easier mean difficulty (–0.713) and a wider spread of difficulties (SD=1.756), since the exclusion disproportionately removed mid-difficulty items.
- In the paper-based test (PBT), females had higher mean ability estimates than males in both the all-items condition (1.026 vs. 0.878) and the DIF-free condition (1.070 vs. 0.851). The gender gap widened slightly after DIF removal, with females showing an increase (+0.044) and males a decrease (–0.027). This pattern suggests that DIF items tended to underestimate female ability and overestimate male ability. Accordingly, PBT estimates based on the DIF-free items were considered the least biased measure of ability and were used as the reference (“true θ”) in subsequent CAT analyses (Table 2).
- CAT simulation
- Post-hoc CAT simulations were conducted under 2 conditions: all items versus DIF-free items, using the Rasch model, MEPV item selection, and EAP scoring. The DIF-free condition showed modest improvements, with RMSE reduced by 0.051 and correlation increased by 0.058 compared with the all-items condition. Although bias shifted slightly (–0.095 vs. –0.082), the change was small, suggesting that overall estimation error was reduced without introducing meaningful systematic distortion (Table 3, Fig. 2).
- Gender-specific analyses showed that for males, bias increased from –0.050 (all-items) to –0.081 (DIF-free), while RMSE decreased from 0.330 to 0.283 and correlation increased from 0.833 to 0.890. For females, bias decreased from –0.137 to –0.118, RMSE decreased from 0.336 to 0.277, and correlation increased from 0.841 to 0.898 (Figs. 3, 4).
- The average test length was longer in the DIF-free condition (49.820 items, SD=4.896) than in the all-items condition (48.789 items, SD=2.020). The increase was observed for both genders, but was larger for females (Δ=1.242) than for males (Δ=0.910) (Fig. 5).
-
Figs. 2–5 illustrate how DIF removal affected performance across the ability spectrum. Bias and RMSE patterns across ability bins showed minimal error near the average ability level and larger errors at the extremes under both conditions. In the high-ability bin (≥2.0), bias was higher in the DIF-free condition, but RMSE was lower. Correlation with true θ peaked in the mid-ability range and decreased toward the extremes. For examinees with θ≤0, the all-items condition showed slightly higher correlations than the DIF-free condition.
- Mean test length was consistently greater in the DIF-free condition across most ability bins, with the largest increases observed at the low and high ends of the ability scale. Detailed numerical results for bias, RMSE, correlation, and test length by ability level (corresponding to Figs. 2–5) are available in Table 4.
Discussion
- Key results
- This study examined the impact of removing gender-based DIF items on ability estimation in CAT using real data from the 2017 KMLE. Gender-based DIF was detected with the RDIF method, identifying 96 DIF items out of 360. Most DIF items were of mid-level difficulty, with very few at the extremes. Removing these items produced a slightly easier overall difficulty level and greater variability (SD) in the remaining item pool. In the post-hoc CAT simulations, the DIF-free condition demonstrated reduced overall error, as reflected in lower RMSE and higher correlations with the reference measure. At the same time, bias shifted modestly in direction, highlighting the trade-off between minimizing random error and introducing a small but consistent offset. Mean test length increased in the DIF-free condition, reflecting the reduced and less optimally distributed item pool after DIF removal.
- Interpretation
- The finding that DIF-free CAT improved accuracy, even if only modestly, underscores the importance of eliminating items that unfairly advantage or disadvantage certain groups. These benefits were most evident at the extremes of the ability scale, where DIF had the greatest impact on bias in the all-items CAT.
- Bias and RMSE diverged for high-ability examinees (≥2.0), with higher bias but lower RMSE in the DIF-free condition. This reflects the distinction between bias, which captures directional error, and RMSE, which reflects overall error magnitude. Thus, estimates may shift slightly in one direction while still being more consistent overall.
- Gender-specific results further clarify the effect of DIF removal. In the all-items condition, female ability was substantially underestimated (bias=–0.137) compared with males (bias=–0.050). After removing DIF items, underestimation decreased for females (bias=–0.118) but increased for males (bias=–0.081). This pattern suggests that DIF disproportionately disadvantaged female examinees by lowering their ability estimates, while slightly inflating male estimates. Removing DIF items reduced this imbalance, producing estimates that were fairer across genders.
- Correlation with true θ was generally highest in the mid-ability range and declined toward the extremes. For examinees with θ≤0, the all-items condition showed slightly higher correlations than the DIF-free condition. This may reflect the advantage of a larger item pool in better targeting low-ability examinees, even though it contained biased items, while the DIF-free pool had fewer items suited to this range.
- The increase in test length in the DIF-free condition can be attributed to 2 factors: (1) the reduction in the total number of items after DIF removal, and (2) the disproportionate removal of mid-difficulty items, which are most informative for examinees near the average ability level. This reduction in optimal targeting required the administration of more items to meet the stopping criterion, particularly at the extremes of the ability distribution.
- Overall, these findings suggest that while removing DIF items enhances fairness and validity—especially by reducing gender-related bias—it also reduces testing efficiency. In operational CAT settings, this trade-off should be carefully managed, potentially through replenishing the pool with unbiased items that match the difficulty ranges of those removed.
- Limitations
- This study analyzed data from a single high-stakes examination in Korea, relied solely on the RDIF method for DIF detection, and used post-hoc CAT simulations that may not fully reflect operational CAT administration. While valuable for controlled analysis, post-hoc simulations inherently introduce limitations. For example, examinees’ adaptive behavior and the dynamic nature of item exposure in live CAT administrations are simplified, which may lead to an overestimation of accuracy or an underestimation of certain biases compared with real-world testing environments.
- Generalizability
- Findings from this study, based on a large dataset from the KMLE, provide insights relevant to other high-stakes educational and licensure examinations employing or considering CAT. Although the context is the KMLE, the psychometric principles concerning the impact of DIF on ability estimation are broadly generalizable to health professions education. The demonstrated trade-off between fairness/accuracy and test efficiency is an important consideration for test developers worldwide.
- Suggestion for further studies
- Future research should apply multiple DIF detection methods, explore different IRT models and item selection algorithms, and investigate replenishing the pool with unbiased items that target the difficulty ranges most affected by DIF removal. Such strategies could help preserve CAT efficiency while enhancing fairness and validity in high-stakes assessments.
- Conclusion
- Removing gender-based DIF items modestly improved fairness and accuracy in CAT, though at the cost of longer test lengths due to a smaller and less balanced item pool. By analyzing data from actual test-takers, this study demonstrates the real-world impact of DIF on CAT and provides evidence supporting the development of fairer examinations.
Authors’ contributions
Conceptualization: DGS. Data curation: DGS, DK, JC. Methodology/formal analysis/validation: DGS, DK, JC. Funding acquisition: DGS. Writing–original draft: DK. Writing–review & editing: DGS, JC.
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Funding
This study was supported by a research grant from Hallym University (HRF-202504-006).
Data availability
Data files are available from https://doi.org/10.7910/DVN/PETWZF
Dataset 1.
Acknowledgments
None.
Supplementary materials
Supplementary files are available from https://doi.org/10.7910/DVN/PETWZF
Fig. 1.Distribution of item difficulty parameters for all items (A), differential item functioning (DIF)-free items (B), and DIF items (C).
Fig. 2.Bias across ability levels in computerized adaptive testing simulations, comparing performance with all items, including those with differential item functioning (DIF), versus DIF-excluded items.
Fig. 3.Root mean square error (RMSE) across ability levels in computerized adaptive testing (CAT) simulations using all items and differential item functioning (DIF)-free items.
Fig. 4.Correlation between estimated and true θ across ability levels in computerized adaptive testing (CAT) simulations.
Fig. 5.Mean test length across ability levels in computerized adaptive testing (CAT) simulations using all items and differential item functioning (DIF)-free items.
Table 1.Item parameter (b) summary for all items, DIF items, and DIF-free items
|
Statistic |
All items |
DIF items |
DIF-free items |
|
No. of items |
360 |
96 |
264 |
|
Mean |
–0.465 |
0.215 |
–0.713 |
|
Standard deviation |
1.694 |
1.293 |
1.756 |
|
Maximum values |
6.770 |
3.156 |
6.770 |
|
Minimum values |
–5.697 |
–3.201 |
–5.697 |
Table 2.Ability estimates from paper-based testing using all items versus DIF-free items, by gender
|
Group |
All items |
DIF-free items |
Δ Mean |
|
Male |
0.878±0.521 |
0.851±0.586 |
0.027 |
|
Female |
1.026±0.506 |
1.070±0.564 |
–0.044 |
|
All |
0.932±0.520 |
0.931±0.587 |
0.001 |
Table 3.Bias, RMSE, correlation, and mean (SD) test length in CAT simulations using all items versus DIF-free items, by group
|
Group |
Bias |
RMSE |
Correlation |
Test length (mean±SD) |
|
All items |
DIF-free items |
Δ
|
All items |
DIF-free items |
Δ
|
All items |
DIF-free items |
Δ
|
All items |
DIF-free items |
Δ
|
|
Total |
–0.082 |
–0.095 |
–0.013 |
0.332 |
0.281 |
–0.051 |
0.838 |
0.895 |
0.058 |
48.789±2.020 |
49.820±4.896 |
1.031 |
|
Male |
–0.050 |
–0.081 |
–0.031 |
0.330 |
0.283 |
–0.047 |
0.833 |
0.890 |
0.057 |
48.721±2.071 |
49.631±3.407 |
0.910 |
|
Female |
–0.137 |
–0.118 |
0.019 |
0.336 |
0.277 |
–0.058 |
0.841 |
0.898 |
0.057 |
48.908±1.923 |
50.151±6.737 |
1.242 |
Table 4.Ability‐level comparison of bias, RMSE, correlation, and test length between all items and DIF‐free items in CAT simulations
|
θ Bin |
No. |
All items |
DIF-free items |
|
Bias |
RMSE |
Correlation |
Test length |
Bias |
RMSE |
Correlation |
Test length |
|
≤–1.0 |
22 |
0.412 |
0.670 |
0.586 |
49.045 |
–0.074 |
0.459 |
0.588 |
49.364 |
|
–1.0 to –0.5 |
40 |
0.382 |
0.544 |
0.249 |
47.675 |
0.007 |
0.426 |
0.121 |
48.125 |
|
–0.5 to 0.0 |
96 |
0.154 |
0.384 |
0.241 |
47.958 |
0.001 |
0.326 |
0.167 |
48.146 |
|
0.0 to 0.5 |
358 |
0.151 |
0.354 |
0.229 |
47.908 |
0.046 |
0.305 |
0.313 |
48.425 |
|
0.5 to 1.0 |
1,016 |
0.003 |
0.275 |
0.299 |
48.288 |
–0.024 |
0.248 |
0.346 |
48.905 |
|
1.0 to 1.5 |
1,134 |
–0.171 |
0.313 |
0.364 |
48.751 |
–0.146 |
0.271 |
0.485 |
49.505 |
|
1.5 to 2.0 |
505 |
–0.283 |
0.390 |
0.466 |
49.838 |
–0.228 |
0.310 |
0.526 |
51.487 |
|
≥2.0 |
88 |
–0.287 |
0.387 |
0.566 |
53.977 |
–0.218 |
0.277 |
0.742 |
63.284 |
References
- 1. van der Linden WJ, Glas CA. Computerized adaptive testing: theory and practice. Springer Science & Business Media; 2000.
- 2. Weiss DJ, Sahin A. Computerized adaptive testing: from concept to implementation. Guilford Press; 2024.
- 3. Seo M. The impact of DIF on ability estimation in computerized adaptive testing: a simulation study. Korean J Educ Eval 2022;35:1-22. https://doi.org/10.31158/JEEV.2022.35.1.1 Article
- 4. Nandakumar R, Roussos L. Evaluation of the CATSIB DIF procedure in a pretest setting. J Educ Behav Stat 2004;29:177-199. https://doi.org/10.3102/10769986029002177 Article
- 5. Lim H, Choe EM, Han KT. A residual-based differential item functioning detection framework in item response theory. J Educ Meas 2022;59:80-104. https://doi.org/10.1111/jedm.12313 Article
- 6. Lim H, Choe EM. Detecting differential item functioning in CAT using IRT residual DIF approach. J Educ Meas 2023;60:626-650. https://doi.org/10.1111/jedm.12366 Article
- 7. Sahin Kursad M, Yalcin S. Effect of differential item functioning on computer adaptive testing under different conditions. Appl Psychol Meas 2024;48:303-322. https://doi.org/10.1177/01466216241284295 ArticlePubMedPMC
- 8. Rasch G. Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research; 1960.
- 9. Owen RJ. A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. J Am Stat Assoc 1975;70:351-356. https://doi.org/10.2307/2285821 Article
- 10. Bock RD, Mislevy RJ. Adaptive EAP estimation of ability in a microcomputer environment. Appl Psychol Meas 1982;6:431-444. https://doi.org/10.1177/014662168200600405 Article
- 11. Seo DG, Choi J. Post-hoc simulation study of computerized adaptive testing for the Korean Medical Licensing Examination. J Educ Eval Health Prof 2018;15:14. https://doi.org/10.3352/jeehp.2018.15.14 ArticlePubMedPMC
- 12. R Core Team. R: a language and environment for statistical computing [Internet]. R Foundation for Statistical Computing; 2024;[cited 2025 Aug 10]. Available from: https://www.R-project.org/
- 13. Magis D, Raiche G. Random generation of response patterns under computerized adaptive testing with the R package catR. J Stat Softw 2012;48:1-31. https://doi.org/10.18637/jss.v048.i08 Article
- 14. Lim H, Kang K. The irtQ R package: a user-friendly tool for item response theory-based test data analysis and calibration. J Educ Eval Health Prof 2024;21:23. https://doi.org/10.3352/jeehp.2024.21.23 ArticlePubMedPMC
Citations
Citations to this article as recorded by
