Purpose This study evaluates the use of ChatGPT-4o in creating tailored continuing professional development (CPD) plans for radiography students, addressing the challenge of aligning CPD with Medical Radiation Practice Board of Australia (MRPBA) requirements. We hypothesized that ChatGPT-4o could support students in CPD planning while meeting regulatory standards.
Methods A descriptive, experimental design was used to generate 3 unique CPD plans using ChatGPT-4o, each tailored to hypothetical graduate radiographers in varied clinical settings. Each plan followed MRPBA guidelines, focusing on computed tomography specialization by the second year. Three MRPBA-registered academics assessed the plans using criteria of appropriateness, timeliness, relevance, reflection, and completeness from October 2024 to November 2024. Ratings underwent analysis using the Friedman test and intraclass correlation coefficient (ICC) to measure consistency among evaluators.
Results ChatGPT-4o generated CPD plans generally adhered to regulatory standards across scenarios. The Friedman test indicated no significant differences among raters (P=0.420, 0.761, and 0.807 for each scenario), suggesting consistent scores within scenarios. However, ICC values were low (–0.96, 0.41, and 0.058 for scenarios 1, 2, and 3), revealing variability among raters, particularly in timeliness and completeness criteria, suggesting limitations in the ChatGPT-4o’s ability to address individualized and context-specific needs.
Conclusion ChatGPT-4o demonstrates the potential to ease the cognitive demands of CPD planning, offering structured support in CPD development. However, human oversight remains essential to ensure plans are contextually relevant and deeply reflective. Future research should focus on enhancing artificial intelligence’s personalization for CPD evaluation, highlighting ChatGPT-4o’s potential and limitations as a tool in professional education.
Purpose This study aimed to compare cognitive, non-cognitive, and overall learning outcomes for sepsis and trauma resuscitation skills in novices with virtual patient simulation (VPS) versus in-person simulation (IPS).
Methods A randomized controlled trial was conducted on junior doctors in 1 emergency department from January to December 2022, comparing 70 minutes of VPS (n=19) versus IPS (n=21) in sepsis and trauma resuscitation. Using the nominal group technique, we created skills assessment checklists and determined Bloom’s taxonomy domains for each checklist item. Two blinded raters observed participants leading 1 sepsis and 1 trauma resuscitation simulation. Satisfaction was measured using the Student Satisfaction with Learning Scale (SSLS). The SSLS and checklist scores were analyzed using the 2-tailed t-test.
Results For sepsis, there was no significant difference between VPS and IPS in overall scores (2.0; 95% confidence interval [CI], -1.4 to 5.4; Cohen’s d=0.38), as well as in items that were cognitive (1.1; 95% CI, -1.5 to 3.7) and not only cognitive (0.9; 95% CI, -0.4 to 2.2). Likewise, for trauma, there was no significant difference in overall scores (-0.9; 95% CI, -4.1 to 2.3; Cohen’s d=0.19), as well as in items that were cognitive (-0.3; 95% CI, -2.8 to 2.1) and not only cognitive (-0.6; 95% CI, -2.4 to 1.3). The median SSLS scores were lower with VPS than with IPS (-3.0; 95% CI, -1.0 to -5.0).
Conclusion For novices, there were no major differences in overall and non-cognitive learning outcomes for sepsis and trauma resuscitation between VPS and IPS. Learners were more satisfied with IPS than with VPS (clinicaltrials.gov identifier: NCT05201950).
Purpose With the coronavirus disease 2019 pandemic, online high-stakes exams have become a viable alternative. This study evaluated the feasibility of computer-based testing (CBT) for medical residency applications in Brazil and its impacts on item quality and applicants’ access compared to paper-based testing.
Methods In 2020, an online CBT was conducted in a Ribeirao Preto Clinical Hospital in Brazil. In total, 120 multiple-choice question items were constructed. Two years later, the exam was performed as paper-based testing. Item construction processes were similar for both exams. Difficulty and discrimination indexes, point-biserial coefficient, difficulty, discrimination, guessing parameters, and Cronbach’s α coefficient were measured based on the item response and classical test theories. Internet stability for applicants was monitored.
Results In 2020, 4,846 individuals (57.1% female, mean age of 26.64±3.37 years) applied to the residency program, versus 2,196 individuals (55.2% female, mean age of 26.47±3.20 years) in 2022. For CBT, there was an increase of 2,650 applicants (120.7%), albeit with significant differences in demographic characteristics. There was a significant increase in applicants from more distant and lower-income Brazilian regions, such as the North (5.6% vs. 2.7%) and Northeast (16.9% vs. 9.0%). No significant differences were found in difficulty and discrimination indexes, point-biserial coefficients, and Cronbach’s α coefficients between the 2 exams.
Conclusion Online CBT with multiple-choice questions was a viable format for a residency application exam, improving accessibility without compromising exam integrity and quality.
Purpose The primary aim of this study is to validate the Blended Learning Usability Evaluation–Questionnaire (BLUE-Q) for use in the field of health professions education through a Bayesian approach. As Bayesian questionnaire validation remains elusive, a secondary aim of this article is to serve as a simplified tutorial for engaging in such validation practices in health professions education.
Methods A total of 10 health education-based experts in blended learning were recruited to participate in a 30-minute interviewer-administered survey. On a 5-point Likert scale, experts rated how well they perceived each item of the BLUE-Q to reflect its underlying usability domain (i.e., effectiveness, efficiency, satisfaction, accessibility, organization, and learner experience). Ratings were descriptively analyzed and converted into beta prior distributions. Participants were also given the option to provide qualitative comments for each item.
Results After reviewing the computed expert prior distributions, 31 quantitative items were identified as having a probability of “low endorsement” and were thus removed from the questionnaire. Additionally, qualitative comments were used to revise the phrasing and order of items to ensure clarity and logical flow. The BLUE-Q’s final version comprises 23 Likert-scale items and 6 open-ended items.
Conclusion Questionnaire validation can generally be a complex, time-consuming, and costly process, inhibiting many from engaging in proper validation practices. In this study, we demonstrate that a Bayesian questionnaire validation approach can be a simple, resource-efficient, yet rigorous solution to validating a tool for content and item-domain correlation through the elicitation of domain expert endorsement ratings.
Purpose This study aimed to develop and validate the Student Perceptions of Real Patient Use in Physical Therapy Education (SPRP-PTE) survey to assess physical therapy student (SPT) perceptions regarding real patient use in didactic education.
Methods This cross-sectional observational study developed a 48-item survey and tested the survey on 130 SPTs. Face and content validity were determined by an expert review and content validity index (CVI). Construct validity and internal consistency reliability were determined via exploratory factor analysis (EFA) and Cronbach’s α.
Results Three main constructs were identified (value, satisfaction, and confidence), each having 4 subconstruct components (overall, cognitive, psychomotor, and affective learning). Expert review demonstrated adequate face and content validity (CVI=96%). The initial EFA of the 48-item survey revealed items with inconsistent loadings and low correlations, leading to the removal of 18 items. An EFA of the 30-item survey demonstrated 1-factor loadings of all survey constructs except satisfaction and the entire survey. All constructs had adequate internal consistency (Cronbach’s α >0.85).
Conclusion The SPRP-PTE survey provides a reliable and valid way to assess student perceptions of real patient use. Future studies are encouraged to validate the SPRP-PTE survey further.
This study investigated the performance of ChatGPT-4.0o in evaluating the quality of positioning in radiographic images. Thirty radiographs depicting a variety of knee, elbow, ankle, hand, pelvis, and shoulder projections were produced using anthropomorphic phantoms and uploaded to ChatGPT-4.0o. The model was prompted to provide a solution to identify any positioning errors with justification and offer improvements. A panel of radiographers assessed the solutions for radiographic quality based on established positioning criteria, with a grading scale of 1–5. In only 20% of projections, ChatGPT-4.0o correctly recognized all errors with justifications and offered correct suggestions for improvement. The most commonly occurring score was 3 (9 cases, 30%), wherein the model recognized at least 1 specific error and provided a correct improvement. The mean score was 2.9. Overall, low accuracy was demonstrated, with most projections receiving only partially correct solutions. The findings reinforce the importance of robust radiography education and clinical experience.
Citations
Citations to this article as recorded by
Conversational LLM Chatbot ChatGPT-4 for Colonoscopy Boston Bowel Preparation Scoring: An Artificial Intelligence-to-Head Concordance Analysis Raffaele Pellegrino, Alessandro Federico, Antonietta Gerarda Gravina Diagnostics.2024; 14(22): 2537. CrossRef
This study examines the legality and appropriateness of keeping the multiple-choice question items of the Korean Medical Licensing Examination (KMLE) confidential. Through an analysis of cases from the United States, Canada, and Australia, where medical licensing exams are conducted using item banks and computer-based testing, we found that exam items are kept confidential to ensure fairness and prevent cheating. In Korea, the Korea Health Personnel Licensing Examination Institute (KHPLEI) has been disclosing KMLE questions despite concerns over exam integrity. Korean courts have consistently ruled that multiple-choice question items prepared by public institutions are non-public information under Article 9(1)(v) of the Korea Official Information Disclosure Act (KOIDA), which exempts disclosure if it significantly hinders the fairness of exams or research and development. The Constitutional Court of Korea has upheld this provision. Given the time and cost involved in developing high-quality items and the need to accurately assess examinees’ abilities, there are compelling reasons to keep KMLE items confidential. As a public institution responsible for selecting qualified medical practitioners, KHPLEI should establish its disclosure policy based on a balanced assessment of public interest, without influence from specific groups. We conclude that KMLE questions qualify as non-public information under KOIDA, and KHPLEI may choose to maintain their confidentiality to ensure exam fairness and efficiency.
Purpose The Dr. LEE Jong-wook Fellowship Program, established by the Korea Foundation for International Healthcare (KOFIH), aims to strengthen healthcare capacity in partner countries. The aim of the study was to develop new performance evaluation indicators for the program to better assess long-term educational impact across various courses and professional roles.
Methods A 3-stage process was employed. First, a literature review of established evaluation models (Kirkpatrick’s 4 levels, context/input/process/product evaluation model, Organization for Economic Cooperation and Development Assistance Committee criteria) was conducted to devise evaluation criteria. Second, these criteria were validated via a 2-round Delphi survey with 18 experts in training projects from May 2021 to June 2021. Third, the relative importance of the evaluation criteria was determined using the analytic hierarchy process (AHP), calculating weights and ensuring consistency through the consistency index and consistency ratio (CR), with CR values below 0.1 indicating acceptable consistency.
Results The literature review led to a combined evaluation model, resulting in 4 evaluation areas, 20 items, and 92 indicators. The Delphi surveys confirmed the validity of these indicators, with content validity ratio values exceeding 0.444. The AHP analysis assigned weights to each indicator, and CR values below 0.1 indicated consistency. The final set of evaluation indicators was confirmed through a workshop with KOFIH and adopted as the new evaluation tool.
Conclusion The developed evaluation framework provides a comprehensive tool for assessing the long-term outcomes of the Dr. LEE Jong-wook Fellowship Program. It enhances evaluation capabilities and supports improvements in the training program’s effectiveness and international healthcare collaboration.
Purpose The System for Improving and Measuring Procedure Learning (SIMPL), a smartphone-based operative assessment application, was developed to assess the intraoperative performance of surgical residents. This study aims to examine the reliability of the SIMPL assessment and determine the optimal number of procedures for a reliable assessment.
Methods In this retrospective observational study, we analyzed data collected between 2015 and 2023 from 4,616 residents across 94 General Surgery Residency programs in the United States that utilized the SIMPL smartphone application. We employed multivariate generalizability theory and initially conducted generalizability studies to estimate the variance components associated with procedures. We then performed decision studies to estimate the reliability coefficient and the minimum number of procedures required for a reproducible assessment.
Results We estimated that the reliability of the assessment of surgical trainees’ intraoperative autonomy and performance using SIMPL exceeded 0.70. Additionally, the optimal number of procedures required for a reproducible assessment was 10, 17, 15, and 17 for postgraduate year (PGY) 2, PGY 3, PGY 4, and PGY 5, respectively. Notably, the study highlighted that the assessment of residents in their senior years necessitated a larger number of procedures compared to those in their junior years.
Conclusion The study demonstrated that the SIMPL assessment is reliably effective for evaluating the intraoperative performance of surgical trainees. Adjusting the number of procedures based on the trainees’ training stage enhances the assessment process’s accuracy and effectiveness.
Purpose The coronavirus disease 2019 (COVID-19) pandemic limited healthcare professional education and training opportunities in rural communities. Because the US Department of Veterans Affairs (VA) has robust programs to train clinicians in the United States, this study examined VA trainee perspectives regarding pandemic-related training in rural and urban areas and interest in future employment with the VA.
Methods Survey responses were collected nationally from VA physicians and nursing trainees before and after COVID-19 (2018 to 2021). Logistic regression models were used to test the association between pandemic timing (pre-pandemic or pandemic), trainee program (physician or nurse), and the interaction of trainee pandemic timing and program on VA trainee satisfaction and trainee likelihood to consider future VA employment in rural and urban areas.
Results While physician trainees at urban facilities reported decreases in overall training satisfaction and corresponding decreases in the likelihood of considering future VA employment from pre-pandemic to pandemic, rural physician trainees showed no changes in either outcome. In contrast, while nursing trainees at both urban and rural sites had decreases in training satisfaction associated with the pandemic, there was no corresponding effect on the likelihood of future employment by nurses at either urban or rural VA sites.
Conclusion The study’s findings suggest differences in the training experiences of physicians and nurses at rural sites, as well as between physician trainees at urban and rural sites. Understanding these nuances can inform the development of targeted approaches to address the ongoing provider shortages that rural communities in the United States are facing.
Purpose This study investigated the effect of simulation-based training on nursing students’ problem-solving skills, critical thinking skills, and self-efficacy.
Methods A single-group pretest and posttest study was conducted among 173 second-year nursing students at a public university in Vietnam from May 2021 to July 2022. Each student participated in the adult nursing preclinical practice course, which utilized a moderate-fidelity simulation teaching approach. Instruments including the Personal Problem-Solving Inventory Scale, Critical Thinking Skills Questionnaire, and General Self-Efficacy Questionnaire were employed to measure participants’ problem-solving skills, critical thinking skills, and self-efficacy. Data were analyzed using descriptive statistics and the paired-sample t-test with the significance level set at P<0.05.
Results The mean score of the Personal Problem-Solving Inventory posttest (127.24±12.11) was lower than the pretest score (131.42±16.95), suggesting an improvement in the problem-solving skills of the participants (t172=2.55, P=0.011). There was no statistically significant difference in critical thinking skills between the pretest and posttest (P=0.854). Self-efficacy among nursing students showed a substantial increase from the pretest (27.91±5.26) to the posttest (28.71±3.81), with t172=-2.26 and P=0.025.
Conclusion The results suggest that simulation-based training can improve problem-solving skills and increase self-efficacy among nursing students. Therefore, the integration of simulation-based training in nursing education is recommended.
Computerized adaptive testing (CAT) has become a widely adopted test design for high-stakes licensing and certification exams, particularly in the health professions in the United States, due to its ability to tailor test difficulty in real time, reducing testing time while providing precise ability estimates. A key component of CAT is item response theory (IRT), which facilitates the dynamic selection of items based on examinees' ability levels during a test. Accurate estimation of item and ability parameters is essential for successful CAT implementation, necessitating convenient and reliable software to ensure precise parameter estimation. This paper introduces the irtQ R package (http://CRAN.R-project.org/), which simplifies IRT-based analysis and item calibration under unidimensional IRT models. While it does not directly simulate CAT, it provides essential tools to support CAT development, including parameter estimation using marginal maximum likelihood estimation via the expectation-maximization algorithm, pretest item calibration through fixed item parameter calibration and fixed ability parameter calibration methods, and examinee ability estimation. The package also enables users to compute item and test characteristic curves and information functions necessary for evaluating the psychometric properties of a test. This paper illustrates the key features of the irtQ package through examples using simulated datasets, demonstrating its utility in IRT applications such as test data analysis and ability scoring. By providing a user-friendly environment for IRT analysis, irtQ significantly enhances the capacity for efficient adaptive testing research and operations. Finally, the paper highlights additional core functionalities of irtQ, emphasizing its broader applicability to the development and operation of IRT-based assessments.
Purpose Evaluating medical school selection tools is vital for evidence-based student selection. With previous reviews revealing knowledge gaps, this meta-analysis offers insights into the effectiveness of these selection tools.
Methods A systematic review and meta-analysis were conducted applying the following criteria: peer-reviewed articles available in English, published from 2010 and which include empirical data linking performance in selection tools with assessment and dropout outcomes of undergraduate entry medical programs. Systematic reviews, meta-analyses, general opinion pieces, or commentaries were excluded. Effect sizes (ESs) of the predictability of academic and clinical performance within and by the end of the medicine program were extracted, and the pooled ESs were presented.
Results Sixty-seven out of 2,212 articles were included, which yielded 236 ESs. Previous academic achievement predicted medical program academic performance (Cohen’s d=0.697 in early program; 0.619 in end of program) and clinical exams (0.545 in end of program). Within aptitude tests, verbal reasoning and quantitative reasoning predicted academic achievement in the early program and in the last years (0.704 & 0.643, respectively). Overall aptitude tests predicted academic achievement in both the early and last years (0.550 & 0.371, respectively). Neither panel interviews, multiple mini-interviews, nor situational judgement tests (SJT) yielded statistically significant pooled ES.
Conclusion Current evidence suggests that learning outcomes are predicted by previous academic achievement and aptitude tests. The predictive value of SJT and topics such as selection algorithms, features of interview (e.g., content of the questions) and the way the interviewers’ reports are used, warrant further research.
Purpose This study aimed to determine whether ChatGPT-4o, a generative artificial intelligence (AI) platform, was able to pass a simulated written European Board of Interventional Radiology (EBIR) exam and whether GPT-4o can be used to train medical students and interventional radiologists of different levels of expertise by generating exam items on interventional radiology.
Methods GPT-4o was asked to answer 370 simulated exam items of the Cardiovascular and Interventional Radiology Society of Europe (CIRSE) for EBIR preparation (CIRSE Prep). Subsequently, GPT-4o was requested to generate exam items on interventional radiology topics at levels of difficulty suitable for medical students and the EBIR exam. Those generated items were answered by 4 participants, including a medical student, a resident, a consultant, and an EBIR holder. The correctly answered items were counted. One investigator checked the answers and items generated by GPT-4o for correctness and relevance. This work was done from April to July 2024.
Results GPT-4o correctly answered 248 of the 370 CIRSE Prep items (67.0%). For 50 CIRSE Prep items, the medical student answered 46.0%, the resident 42.0%, the consultant 50.0%, and the EBIR holder 74.0% correctly. All participants answered 82.0% to 92.0% of the 50 GPT-4o generated items at the student level correctly. For the 50 GPT-4o items at the EBIR level, the medical student answered 32.0%, the resident 44.0%, the consultant 48.0%, and the EBIR holder 66.0% correctly. All participants could pass the GPT-4o-generated items for the student level; while the EBIR holder could pass the GPT-4o-generated items for the EBIR level. Two items (0.3%) out of 150 generated by the GPT-4o were assessed as implausible.
Conclusion GPT-4o could pass the simulated written EBIR exam and create exam items of varying difficulty to train medical students and interventional radiologists.
Citations
Citations to this article as recorded by
From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance Markus Kipp Information.2024; 15(9): 543. CrossRef
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study Yikai Chen, Xiujie Huang, Fangjie Yang, Haiming Lin, Haoyu Lin, Zhuoqun Zheng, Qifeng Liang, Jinhai Zhang, Xinxin Li BMC Medical Education.2024;[Epub] CrossRef