Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Educ Eval Health Prof > Volume 22; 2025 > Article
Research article
The impact of differential item functioning on ability estimation using the Korean Medical Licensing Examination with computerized adaptive testing: a post-hoc simulation study
Dogyeong Kimorcid, Jeongwook Choiorcid, Dong Gi Seo*orcid

DOI: https://doi.org/10.3352/jeehp.2025.22.31
Published online: October 10, 2025

Department of Psychology, College of Social Science, Hallym University, Chuncheon, Korea

*Corresponding email: dgseo@hallym.ac.kr

Editor: A Ra Cho, The Catholic University of Korea, Korea

• Received: August 21, 2025   • Accepted: September 22, 2025

© 2025 Korea Health Personnel Licensing Examination Institute

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 2,656 Views
  • 169 Download
  • Purpose
    This study examined the impact of differential item functioning (DIF) on ability estimation in a computerized adaptive testing (CAT) environment using real response data from the 2017 Korean Medical Licensing Examination (KMLE). We hypothesized that excluding gender-based DIF items would improve estimation accuracy, particularly for examinees at the extremes of the ability scale.
  • Methods
    The study was conducted in 2 steps: (1) DIF detection and (2) post-hoc simulation. The analysis used data from 3,259 examinees who completed all 360 dichotomous items. Gender-based DIF was detected with the residual-based DIF method (reference group: males; focal group: females). Two CAT conditions (all items vs. DIF-excluded) were compared against a “true θ” estimated from a fixed-form test of 264 non-DIF items. Accuracy was evaluated using bias, root mean square error (RMSE), and correlation with true θ.
  • Results
    In the CAT condition excluding DIF items, accuracy improved, with RMSE reduced and correlation with true θ increased. However, bias was slightly larger in magnitude. Gender-specific analyses showed that DIF removal reduced the underestimation of female ability but increased the underestimation of male ability, yielding estimates that were fairer across genders. When DIF items were included, estimation errors were more pronounced at both low and high ability levels.
  • Conclusion
    Managing DIF in CAT-based high-stakes examinations can enhance fairness and precision. Using real examinee data, this study provides practical evidence of the implications of DIF for CAT-based measurement and supports fairness-oriented test design.
Background
Computerized adaptive testing (CAT) is widely recognized as an efficient and precise approach to ability estimation in educational and licensure examinations. It increases efficiency and accuracy by adapting item selection to an examinee’s ability, thereby reducing test length while maintaining measurement precision [1,2]. CAT also strengthens test security by lowering item exposure rates and enabling more frequent item reuse, which is particularly beneficial for high-stakes examinations with limited item pools.
The validity of CAT depends on the stability and fairness of item characteristics across examinee groups. Differential item functioning (DIF) arises when examinees from different groups, but with the same underlying ability, have unequal probabilities of answering an item correctly. In CAT, biased items are especially problematic if presented early, as they can distort ability estimates and disproportionately affect subsequent item selection, raising fairness concerns in high-stakes testing [3,4].
Several methods have been proposed for detecting DIF, including logistic regression, item response theory (IRT) likelihood ratio tests, and the CAT-adapted SIBTEST [3,4]. Although effective, these approaches are often resource-intensive. As an alternative, Lim et al. [5] introduced the residual-based DIF (RDIF) method, which compares group differences in item-level residuals without requiring separate calibrations or matching variables. RDIF offers several advantages: it is simple to compute, avoids group-specific calibrations or equating, and maintains competitive power while controlling Type I error [5,6]. Its efficiency makes it applicable to both fixed-form and adaptive testing contexts. RDIF consists of 3 statistics—RDIFR, RDIFS, and RDIFRS—that vary in sensitivity to different types of DIF. RDIFR is most effective for detecting uniform DIF, RDIFS is more sensitive to non-uniform DIF, and RDIFRS combines both indices to provide a robust overall test with well-controlled Type I error [5].
Most prior research on CAT and DIF has emphasized detecting DIF within CAT environments, while relatively few studies have examined its direct impact on ability estimation. Seo [3] demonstrated through simulation that DIF items administered early in a CAT session can substantially bias ability estimates, especially for examinees at the low and high ends of the ability scale. Şahin Kürşad and Yalçın [7] reported that DIF can reduce measurement precision and distort test information under different item bank sizes and design conditions. Despite such findings, few empirical studies have validated the impact of DIF on CAT-based ability estimation using operational test data.
Objectives
This study evaluated the impact of gender-based DIF on CAT ability estimation using real response data from the 2017 Korean Medical Licensing Examination (KMLE). DIF items were identified with the RDIF method, and post-hoc CAT simulations were conducted under 2 conditions: including all items versus excluding DIF items. Ability estimates from both conditions were compared with a reference θ based on non-DIF items. We hypothesized that excluding DIF items would improve accuracy and reduce group-level bias, particularly for examinees at the extremes of the ability scale.
Ethics statement
This study did not require institutional review board approval because it used only de-identified secondary data. The data were collected by the Korea Health Personnel Licensing Examination Institute, and the authors used only examinees’ gender and item-level responses from the 2017 KMLE. No information that could identify individual examinees was included.
Study design
This was a post-hoc simulation study using real response data from the 2017 KMLE. The purpose was to evaluate the impact of gender-based DIF on ability estimation in a CAT environment.
Setting
This was a secondary analysis of the 2017 KMLE.
Participants
The dataset included complete responses from 3,259 examinees (2,071 males and 1,188 females) who sat for the 2017 KMLE. All examinees completed the same fixed-form test consisting of 360 dichotomously scored multiple-choice items. No additional inclusion or exclusion criteria were applied beyond having complete gender and item-level response data.
Variables
The outcome variables were RDIF values, bias, root mean square error (RMSE), and the Pearson correlation between estimated and true θ.
Data sources
The 360 KMLE items were scored as correct (1) or incorrect (0). For this study, only gender and item-level responses were analyzed. No personal identifiers were present in the dataset. The raw response data are available in Dataset 1.
Measurement
This study used item parameters and gender-based DIF classifications derived from the real response data of 3,259 examinees from the 2017 KMLE. For the post-hoc CAT simulations, examinee abilities (true θ) were first established by estimating their abilities from the 264 DIF-excluded items of the fixed-form test. Subsequently, response patterns under the “all-items” and “DIF-excluded” CAT conditions were simulated using these established true θ values and the respective item parameters under the Rasch model. This approach allowed for a controlled environment to assess the impact of DIF while grounding item characteristics and examinee abilities in real operational data.
We provided item parameter estimates and DIF classifications for all 360 items in Supplement 1. The order of items in Supplement 1 does not correspond to the original test booklet sequence.
Bias
This study examined DIF as a source of measurement bias in ability estimation. The definition of “true θ” based on DIF-excluded items served as a crucial reference point to quantify the magnitude and direction of estimation bias under different CAT conditions. Potential biases inherent in post-hoc simulation studies—such as the idealized nature of simulated response patterns compared with real-world examinee behavior and operational CAT complexities—are acknowledged as limitations of this methodology.
Study size
This study utilized the complete dataset of 3,259 examinees from the 2017 KMLE. Because the study was based on post-hoc simulations using an existing dataset rather than prospective sampling, no a priori sample size calculation or post-hoc power analysis was conducted. The objective was to assess the impact of DIF across the entire operational examinee cohort.
Statistical methods
The first step was DIF detection. Gender-based DIF was detected using the RDIF method [5,6], with males as the reference group and females as the focal group. Because the items were dichotomously scored, we employed the Rasch model [8], which fixes item discrimination and is suitable for analyzing the KMLE data. However, because the Rasch model does not estimate discrimination parameters, it is not suitable for detecting non-uniform DIF. Accordingly, we applied the RDIFR statistic, which is most appropriate for detecting uniform DIF, as the criterion in this study. RDIFR for item j is defined as:
RDIFR=1nfi=1nfrif1nrj=1nrrjr
where rif= xifPif and rjr= xjrPjr, defined as the raw residuals between the observed binary response (0=incorrect, 1=correct) and the model-predicted probability of a correct response (P). RDIFR asymptotically follows a normal distribution; therefore, the 2-tailed Z-test was conducted, and items were flagged as exhibiting DIF when the RDIFR statistic was significant at the α=0.05 level [5].
The next step was post-hoc simulation. Two post‐hoc CAT simulations were conducted. In the all‐items condition, all 360 items were available for selection. In the DIF‐excluded condition, only the 264 items not flagged for DIF were available.
In both conditions, the CAT began with an initial ability estimate of 0.00. The minimum expected posterior variance (MEPV) method was used for item selection [9], and ability was estimated using the expected a posteriori (EAP) method after each item administration [10]. These methods were chosen because Seo and Choi [11] demonstrated that, under the Rasch model, the combination of MEPV item selection and EAP scoring produced the highest estimation accuracy in a post-hoc CAT simulation study of the KMLE. The stopping rule was a standard error of measurement less than 0.3. Item exposure was unrestricted, as the goal was to isolate the effect of DIF rather than optimize item pool usage.
The accuracy of CAT ability estimates was evaluated using bias, RMSE, Pearson correlation between estimated and true θ, and average test length, defined as follows:
Bias=1Ni=1N(θ^iθi)
Bias quantifies the average signed difference between estimated and true ability scores. A positive bias indicates overestimation, while a negative bias indicates underestimation.
RMSE=1Ni=1Nθ^iθi2
RMSE reflects the overall magnitude of estimation error, incorporating both bias and variability. Lower RMSE values indicate greater precision in ability estimation.
ρ=Covθ^,θσθ^σθ
The Pearson correlation coefficient (ρ) measures the strength of the linear relationship between estimated and true abilities. Values closer to 1 indicate stronger alignment between CAT estimates and the reference scores. For all calculations, the true θ for each examinee was defined as the ability estimate derived from the 264 DIF-excluded items. Finally, average test length was calculated as the mean number of items administered until the stopping criterion was reached.
These outcomes were computed both overall and separately for the reference and focal groups, and were also analyzed across ability levels to assess differential impact.
All analyses were conducted using R software ver. 4.4.1 (R Core Team) [12]. CAT simulations were implemented using the catR package [13], and DIF detection was carried out using the irtQ package [14]. Supplement 2 provides a simplified R code snippet illustrating the core CAT simulation logic and the use of the catR package. The example generates responses for a small number of hypothetical examinees to demonstrate the functions applied in the full-scale simulation of 3,259 examinees.
DIF detection
Table 1 and Fig. 1 present the distribution of item difficulty parameters for all items, DIF-free items, and DIF items.
The item bank contained 360 dichotomous items, of which 96 were identified as exhibiting gender-based DIF using the RDIF method. Under the Rasch model, the mean item difficulty for the full bank was –0.465 (standard deviation [SD]=1.694; range, –5.697 to 6.770). Most DIF items were concentrated around the mid-difficulty range, with relatively few at the extremes. As a result, when these items were excluded, the remaining DIF-free bank (n=264) showed a slightly easier mean difficulty (–0.713) and a wider spread of difficulties (SD=1.756), since the exclusion disproportionately removed mid-difficulty items.
In the paper-based test (PBT), females had higher mean ability estimates than males in both the all-items condition (1.026 vs. 0.878) and the DIF-free condition (1.070 vs. 0.851). The gender gap widened slightly after DIF removal, with females showing an increase (+0.044) and males a decrease (–0.027). This pattern suggests that DIF items tended to underestimate female ability and overestimate male ability. Accordingly, PBT estimates based on the DIF-free items were considered the least biased measure of ability and were used as the reference (“true θ”) in subsequent CAT analyses (Table 2).
CAT simulation
Post-hoc CAT simulations were conducted under 2 conditions: all items versus DIF-free items, using the Rasch model, MEPV item selection, and EAP scoring. The DIF-free condition showed modest improvements, with RMSE reduced by 0.051 and correlation increased by 0.058 compared with the all-items condition. Although bias shifted slightly (–0.095 vs. –0.082), the change was small, suggesting that overall estimation error was reduced without introducing meaningful systematic distortion (Table 3, Fig. 2).
Gender-specific analyses showed that for males, bias increased from –0.050 (all-items) to –0.081 (DIF-free), while RMSE decreased from 0.330 to 0.283 and correlation increased from 0.833 to 0.890. For females, bias decreased from –0.137 to –0.118, RMSE decreased from 0.336 to 0.277, and correlation increased from 0.841 to 0.898 (Figs. 3, 4).
The average test length was longer in the DIF-free condition (49.820 items, SD=4.896) than in the all-items condition (48.789 items, SD=2.020). The increase was observed for both genders, but was larger for females (Δ=1.242) than for males (Δ=0.910) (Fig. 5).
Figs. 25 illustrate how DIF removal affected performance across the ability spectrum. Bias and RMSE patterns across ability bins showed minimal error near the average ability level and larger errors at the extremes under both conditions. In the high-ability bin (≥2.0), bias was higher in the DIF-free condition, but RMSE was lower. Correlation with true θ peaked in the mid-ability range and decreased toward the extremes. For examinees with θ≤0, the all-items condition showed slightly higher correlations than the DIF-free condition.
Mean test length was consistently greater in the DIF-free condition across most ability bins, with the largest increases observed at the low and high ends of the ability scale. Detailed numerical results for bias, RMSE, correlation, and test length by ability level (corresponding to Figs. 25) are available in Table 4.
Key results
This study examined the impact of removing gender-based DIF items on ability estimation in CAT using real data from the 2017 KMLE. Gender-based DIF was detected with the RDIF method, identifying 96 DIF items out of 360. Most DIF items were of mid-level difficulty, with very few at the extremes. Removing these items produced a slightly easier overall difficulty level and greater variability (SD) in the remaining item pool. In the post-hoc CAT simulations, the DIF-free condition demonstrated reduced overall error, as reflected in lower RMSE and higher correlations with the reference measure. At the same time, bias shifted modestly in direction, highlighting the trade-off between minimizing random error and introducing a small but consistent offset. Mean test length increased in the DIF-free condition, reflecting the reduced and less optimally distributed item pool after DIF removal.
Interpretation
The finding that DIF-free CAT improved accuracy, even if only modestly, underscores the importance of eliminating items that unfairly advantage or disadvantage certain groups. These benefits were most evident at the extremes of the ability scale, where DIF had the greatest impact on bias in the all-items CAT.
Bias and RMSE diverged for high-ability examinees (≥2.0), with higher bias but lower RMSE in the DIF-free condition. This reflects the distinction between bias, which captures directional error, and RMSE, which reflects overall error magnitude. Thus, estimates may shift slightly in one direction while still being more consistent overall.
Gender-specific results further clarify the effect of DIF removal. In the all-items condition, female ability was substantially underestimated (bias=–0.137) compared with males (bias=–0.050). After removing DIF items, underestimation decreased for females (bias=–0.118) but increased for males (bias=–0.081). This pattern suggests that DIF disproportionately disadvantaged female examinees by lowering their ability estimates, while slightly inflating male estimates. Removing DIF items reduced this imbalance, producing estimates that were fairer across genders.
Correlation with true θ was generally highest in the mid-ability range and declined toward the extremes. For examinees with θ≤0, the all-items condition showed slightly higher correlations than the DIF-free condition. This may reflect the advantage of a larger item pool in better targeting low-ability examinees, even though it contained biased items, while the DIF-free pool had fewer items suited to this range.
The increase in test length in the DIF-free condition can be attributed to 2 factors: (1) the reduction in the total number of items after DIF removal, and (2) the disproportionate removal of mid-difficulty items, which are most informative for examinees near the average ability level. This reduction in optimal targeting required the administration of more items to meet the stopping criterion, particularly at the extremes of the ability distribution.
Overall, these findings suggest that while removing DIF items enhances fairness and validity—especially by reducing gender-related bias—it also reduces testing efficiency. In operational CAT settings, this trade-off should be carefully managed, potentially through replenishing the pool with unbiased items that match the difficulty ranges of those removed.
Limitations
This study analyzed data from a single high-stakes examination in Korea, relied solely on the RDIF method for DIF detection, and used post-hoc CAT simulations that may not fully reflect operational CAT administration. While valuable for controlled analysis, post-hoc simulations inherently introduce limitations. For example, examinees’ adaptive behavior and the dynamic nature of item exposure in live CAT administrations are simplified, which may lead to an overestimation of accuracy or an underestimation of certain biases compared with real-world testing environments.
Generalizability
Findings from this study, based on a large dataset from the KMLE, provide insights relevant to other high-stakes educational and licensure examinations employing or considering CAT. Although the context is the KMLE, the psychometric principles concerning the impact of DIF on ability estimation are broadly generalizable to health professions education. The demonstrated trade-off between fairness/accuracy and test efficiency is an important consideration for test developers worldwide.
Suggestion for further studies
Future research should apply multiple DIF detection methods, explore different IRT models and item selection algorithms, and investigate replenishing the pool with unbiased items that target the difficulty ranges most affected by DIF removal. Such strategies could help preserve CAT efficiency while enhancing fairness and validity in high-stakes assessments.
Conclusion
Removing gender-based DIF items modestly improved fairness and accuracy in CAT, though at the cost of longer test lengths due to a smaller and less balanced item pool. By analyzing data from actual test-takers, this study demonstrates the real-world impact of DIF on CAT and provides evidence supporting the development of fairer examinations.

Authors’ contributions

Conceptualization: DGS. Data curation: DGS, DK, JC. Methodology/formal analysis/validation: DGS, DK, JC. Funding acquisition: DGS. Writing–original draft: DK. Writing–review & editing: DGS, JC.

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Funding

This study was supported by a research grant from Hallym University (HRF-202504-006).

Data availability

Data files are available from https://doi.org/10.7910/DVN/PETWZF

Dataset 1.

Acknowledgments

None.

Supplementary files are available from https://doi.org/10.7910/DVN/PETWZF
Supplement 1. Item parameter estimates and gender-based DIF indicators.
jeehp-22-31-suppl1.csv
Supplement 2. R code for the CAT simulations.
jeehp-22-31-suppl2.txt
Supplement 3. Audio recording of the abstract.
jeehp-22-31-abstract-recording.avi
Fig. 1.
Distribution of item difficulty parameters for all items (A), differential item functioning (DIF)-free items (B), and DIF items (C).
jeehp-22-31f1.jpg
Fig. 2.
Bias across ability levels in computerized adaptive testing simulations, comparing performance with all items, including those with differential item functioning (DIF), versus DIF-excluded items.
jeehp-22-31f2.jpg
Fig. 3.
Root mean square error (RMSE) across ability levels in computerized adaptive testing (CAT) simulations using all items and differential item functioning (DIF)-free items.
jeehp-22-31f3.jpg
Fig. 4.
Correlation between estimated and true θ across ability levels in computerized adaptive testing (CAT) simulations.
jeehp-22-31f4.jpg
Fig. 5.
Mean test length across ability levels in computerized adaptive testing (CAT) simulations using all items and differential item functioning (DIF)-free items.
jeehp-22-31f5.jpg
jeehp-22-31f6.jpg
Table 1.
Item parameter (b) summary for all items, DIF items, and DIF-free items
Statistic All items DIF items DIF-free items
No. of items 360 96 264
Mean –0.465 0.215 –0.713
Standard deviation 1.694 1.293 1.756
Maximum values 6.770 3.156 6.770
Minimum values –5.697 –3.201 –5.697

DIF, differential item functioning.

Table 2.
Ability estimates from paper-based testing using all items versus DIF-free items, by gender
Group All items DIF-free items Δ Mean
Male 0.878±0.521 0.851±0.586 0.027
Female 1.026±0.506 1.070±0.564 –0.044
All 0.932±0.520 0.931±0.587 0.001

Values are presented as mean±standard deviation.

DIF, differential item functioning; Δ Mean, difference in mean ability estimates between all items and DIF-free items.

Table 3.
Bias, RMSE, correlation, and mean (SD) test length in CAT simulations using all items versus DIF-free items, by group
Group Bias RMSE Correlation Test length (mean±SD)
All items DIF-free items Δ All items DIF-free items Δ All items DIF-free items Δ All items DIF-free items Δ
Total –0.082 –0.095 –0.013 0.332 0.281 –0.051 0.838 0.895 0.058 48.789±2.020 49.820±4.896 1.031
Male –0.050 –0.081 –0.031 0.330 0.283 –0.047 0.833 0.890 0.057 48.721±2.071 49.631±3.407 0.910
Female –0.137 –0.118 0.019 0.336 0.277 –0.058 0.841 0.898 0.057 48.908±1.923 50.151±6.737 1.242

RMSE, root mean square error; SD, standard deviation; CAT, computerized adaptive testing; DIF, differential item functioning; Δ, difference between the DIF-free items and all-items conditions.

Table 4.
Ability‐level comparison of bias, RMSE, correlation, and test length between all items and DIF‐free items in CAT simulations
θ Bin No. All items DIF-free items
Bias RMSE Correlation Test length Bias RMSE Correlation Test length
≤–1.0 22 0.412 0.670 0.586 49.045 –0.074 0.459 0.588 49.364
–1.0 to –0.5 40 0.382 0.544 0.249 47.675 0.007 0.426 0.121 48.125
–0.5 to 0.0 96 0.154 0.384 0.241 47.958 0.001 0.326 0.167 48.146
0.0 to 0.5 358 0.151 0.354 0.229 47.908 0.046 0.305 0.313 48.425
0.5 to 1.0 1,016 0.003 0.275 0.299 48.288 –0.024 0.248 0.346 48.905
1.0 to 1.5 1,134 –0.171 0.313 0.364 48.751 –0.146 0.271 0.485 49.505
1.5 to 2.0 505 –0.283 0.390 0.466 49.838 –0.228 0.310 0.526 51.487
≥2.0 88 –0.287 0.387 0.566 53.977 –0.218 0.277 0.742 63.284

RMSE, root mean square error; DIF, differential item functioning; CAT, computerized adaptive testing.

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      Figure
      • 0
      • 1
      • 2
      • 3
      • 4
      • 5
      The impact of differential item functioning on ability estimation using the Korean Medical Licensing Examination with computerized adaptive testing: a post-hoc simulation study
      Image Image Image Image Image Image
      Fig. 1. Distribution of item difficulty parameters for all items (A), differential item functioning (DIF)-free items (B), and DIF items (C).
      Fig. 2. Bias across ability levels in computerized adaptive testing simulations, comparing performance with all items, including those with differential item functioning (DIF), versus DIF-excluded items.
      Fig. 3. Root mean square error (RMSE) across ability levels in computerized adaptive testing (CAT) simulations using all items and differential item functioning (DIF)-free items.
      Fig. 4. Correlation between estimated and true θ across ability levels in computerized adaptive testing (CAT) simulations.
      Fig. 5. Mean test length across ability levels in computerized adaptive testing (CAT) simulations using all items and differential item functioning (DIF)-free items.
      Graphical abstract
      The impact of differential item functioning on ability estimation using the Korean Medical Licensing Examination with computerized adaptive testing: a post-hoc simulation study
      Statistic All items DIF items DIF-free items
      No. of items 360 96 264
      Mean –0.465 0.215 –0.713
      Standard deviation 1.694 1.293 1.756
      Maximum values 6.770 3.156 6.770
      Minimum values –5.697 –3.201 –5.697
      Group All items DIF-free items Δ Mean
      Male 0.878±0.521 0.851±0.586 0.027
      Female 1.026±0.506 1.070±0.564 –0.044
      All 0.932±0.520 0.931±0.587 0.001
      Group Bias RMSE Correlation Test length (mean±SD)
      All items DIF-free items Δ All items DIF-free items Δ All items DIF-free items Δ All items DIF-free items Δ
      Total –0.082 –0.095 –0.013 0.332 0.281 –0.051 0.838 0.895 0.058 48.789±2.020 49.820±4.896 1.031
      Male –0.050 –0.081 –0.031 0.330 0.283 –0.047 0.833 0.890 0.057 48.721±2.071 49.631±3.407 0.910
      Female –0.137 –0.118 0.019 0.336 0.277 –0.058 0.841 0.898 0.057 48.908±1.923 50.151±6.737 1.242
      θ Bin No. All items DIF-free items
      Bias RMSE Correlation Test length Bias RMSE Correlation Test length
      ≤–1.0 22 0.412 0.670 0.586 49.045 –0.074 0.459 0.588 49.364
      –1.0 to –0.5 40 0.382 0.544 0.249 47.675 0.007 0.426 0.121 48.125
      –0.5 to 0.0 96 0.154 0.384 0.241 47.958 0.001 0.326 0.167 48.146
      0.0 to 0.5 358 0.151 0.354 0.229 47.908 0.046 0.305 0.313 48.425
      0.5 to 1.0 1,016 0.003 0.275 0.299 48.288 –0.024 0.248 0.346 48.905
      1.0 to 1.5 1,134 –0.171 0.313 0.364 48.751 –0.146 0.271 0.485 49.505
      1.5 to 2.0 505 –0.283 0.390 0.466 49.838 –0.228 0.310 0.526 51.487
      ≥2.0 88 –0.287 0.387 0.566 53.977 –0.218 0.277 0.742 63.284
      Table 1. Item parameter (b) summary for all items, DIF items, and DIF-free items

      DIF, differential item functioning.

      Table 2. Ability estimates from paper-based testing using all items versus DIF-free items, by gender

      Values are presented as mean±standard deviation.

      DIF, differential item functioning; Δ Mean, difference in mean ability estimates between all items and DIF-free items.

      Table 3. Bias, RMSE, correlation, and mean (SD) test length in CAT simulations using all items versus DIF-free items, by group

      RMSE, root mean square error; SD, standard deviation; CAT, computerized adaptive testing; DIF, differential item functioning; Δ, difference between the DIF-free items and all-items conditions.

      Table 4. Ability‐level comparison of bias, RMSE, correlation, and test length between all items and DIF‐free items in CAT simulations

      RMSE, root mean square error; DIF, differential item functioning; CAT, computerized adaptive testing.


      JEEHP : Journal of Educational Evaluation for Health Professions
      TOP