Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Search

Page Path
HOME > Search
3 "Certification"
Filter
Filter
Article category
Keywords
Publication year
Authors
Funded articles
Research articles
Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study  
Max Samuel Yudovich, Elizaveta Makarova, Christian Michael Hague, Jay Dilip Raman
J Educ Eval Health Prof. 2024;21:17.   Published online July 8, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.17
  • 7,729 View
  • 356 Download
  • 20 Web of Science
  • 21 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods
In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results
GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P<0.00001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions
ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.

Citations

Citations to this article as recorded by  
  • Response to letter to the editor Re: Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions
    Jason Bitterman, Alexander D'Angelo, Alexandra Holachek, James E. Eubanks
    PM&R.2026; 18(1): 109.     CrossRef
  • Artificial Intelligence as a Drug Information Resource: Limitations and Strategies to Optimize in Pharmacy Practice
    Christopher Soujah, Carole Bejjani, Nour Adra, Laura Blackburn
    Hospital Pharmacy.2026; 61(2): 117.     CrossRef
  • Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions
    Kiera L Vrindten, Megan Hsu, Yuri Han, Brian Rust, Heili Truumees, Brian M Katt
    Cureus.2025;[Epub]     CrossRef
  • Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology
    Nadide Melike Sav
    Pediatric Nephrology.2025; 40(9): 2879.     CrossRef
  • Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination
    Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Oh
    Journal of Orthopaedic Science.2025; 30(6): 1193.     CrossRef
  • Advancements in large language model accuracy for answering physical medicine and rehabilitation board review questions
    Jason Bitterman, Alexander D'Angelo, Alexandra Holachek, James E. Eubanks
    PM&R.2025; 17(9): 1091.     CrossRef
  • Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis
    Ling Wang, Jinglin Li, Boyang Zhuang, Shasha Huang, Meilin Fang, Cunze Wang, Wen Li, Mohan Zhang, Shurong Gong
    Journal of Medical Internet Research.2025; 27: e64486.     CrossRef
  • OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions
    Ryan Shean, Tathya Shah, Sina Sobhani, Alan Tang, Ali Setayesh, Kyle Bolo, Van Nguyen, Benjamin Xu
    Ophthalmology Science.2025; 5(6): 100844.     CrossRef
  • A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions
    Ryan Shean, Tathya Shah, Aditya Pandiarajan, Alan Tang, Kyle Bolo, Van Nguyen, Benjamin Xu
    Scientific Reports.2025;[Epub]     CrossRef
  • Assessing the utility of a natural language processing model in answering common urological questions
    Wyatt MacNevin, John‐David Brown, Nicholas Dawe, Jesse T. R. Spooner, Nicholas R. Paterson, Daniel T. Keefe, David G. Bell
    UroPrecision.2025;[Epub]     CrossRef
  • The performance of ChatGPT on medical image-based assessments and implications for medical education
    Xiang Yang, Wei Chen
    BMC Medical Education.2025;[Epub]     CrossRef
  • Applications, Challenges, and Prospects of Generative Artificial Intelligence Empowering Medical Education: Scoping Review
    Yuhang Lin, Zhiheng Luo, Zicheng Ye, Nuoxi Zhong, Lijian Zhao, Long Zhang, Xiaolan Li, Zetao Chen, Yijia Chen
    JMIR Medical Education.2025; 11: e71125.     CrossRef
  • ChatGPT’s role in the rapidly evolving hematologic cancer landscape
    Tiffany Nong, Sean Britton, Viralkumar Bhanderi, Justin Taylor
    Future Science OA.2025;[Epub]     CrossRef
  • Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study
    Emma Dejean-Bouyer, Anoujat Kanlagna, François Thuau, Pierre Perrot, Ugo Lancien
    Journal of Educational Evaluation for Health Professions.2025; 22: 27.     CrossRef
  • Technologies, opportunities, challenges, and future directions for integrating generative artificial intelligence into medical education: a narrative review
    Junseok Kang, Jihyun Ahn
    Ewha Medical Journal.2025; 48(4): e53.     CrossRef
  • Performance of the ChatGPT-5 Language Model in Solving a Specialty Examination in Balneology and Physical Medicine
    Michalina Loson-Kawalec, Anna Kowalczyk, Dawid Szymanski, Patrycja Dadynska, Aleksander Tabor, Dawid Bartosik , Marta Zerek, Gracjan Sitarek, Bartosz Starzynski, Alina Keska, Bartlomiej Cwikla, Piotr Sawina, Tomasz Dolata, Adrianna Pielech, Maciej Majchrz
    Cureus.2025;[Epub]     CrossRef
  • Diagnostic accuracy and bias in open access and subscription-based large language models for multiple sclerosis and neuromyelitis optica spectrum disorder
    Tom G. Punnen, Kevin S. Shan, Mahi A. Patel, Morgan C. McCreary, Diem H. Tran, Jose R. Santoyo, Katy W. Burgess, Tatum M. Moog, Alexander D. Smith, Darin T. Okuda
    Intelligence-Based Medicine.2025; 12: 100314.     CrossRef
  • Privacy-by-Design Framework for Large Language Model Chatbots in Urology
    Eun Joung Kim, JungYoon Kim
    International Neurourology Journal.2025; 29(Suppl 2): S65.     CrossRef
  • Potential and pitfalls: accuracy versus adequacy of ChatGPT’s performance on surgery shelf examination
    Baylee Brochu, Michael D. Cobler-Lichter, Talia R. Arcieri, Nikita M. Shah, Jessica M. Delamater, Ana M. Reyes, Matthew S. Sussman, Edward B. Lineen, Laurence R. Sands, Vanessa W. Hui, Steven E. Rodgers, Chad M. Thorson
    Global Surgical Education - Journal of the Association for Surgical Education.2025;[Epub]     CrossRef
  • From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
    Markus Kipp
    Information.2024; 15(9): 543.     CrossRef
  • Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios
    Max S Yudovich, Ahmad N Alzubaidi, Jay D Raman
    Clinical Medicine Insights: Oncology.2024;[Epub]     CrossRef
Performance of the Ebel standard-setting method for the spring 2019 Royal College of Physicians and Surgeons of Canada internal medicine certification examination consisting of multiple-choice questions  
Jimmy Bourque, Haley Skinner, Jonathan Dupré, Maria Bacchus, Martha Ainslie, Irene W. Y. Ma, Gary Cole
J Educ Eval Health Prof. 2020;17:12.   Published online April 20, 2020
DOI: https://doi.org/10.3352/jeehp.2020.17.12
  • 9,442 View
  • 183 Download
  • 11 Web of Science
  • 12 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to assess the performance of the Ebel standard-setting method for the spring 2019 Royal College of Physicians and Surgeons of Canada internal medicine certification examination consisting of multiple-choice questions. Specifically, the following parameters were evaluated: inter-rater agreement, the correlations between Ebel scores and item facility indices, the impact of raters’ knowledge of correct answers on the Ebel score, and the effects of raters’ specialty on inter-rater agreement and Ebel scores.
Methods
Data were drawn from a Royal College of Physicians and Surgeons of Canada certification exam. The Ebel method was applied to 203 multiple-choice questions by 49 raters. Facility indices came from 194 candidates. We computed the Fleiss kappa and the Pearson correlations between Ebel scores and item facility indices. We investigated differences in the Ebel score according to whether correct answers were provided or not and differences between internists and other specialists using the t-test.
Results
The Fleiss kappa was below 0.15 for both facility and relevance. The correlation between Ebel scores and facility indices was low when correct answers were provided and negligible when they were not. The Ebel score was the same whether the correct answers were provided or not. Inter-rater agreement and Ebel scores were not significantly different between internists and other specialists.
Conclusion
Inter-rater agreement and correlations between item Ebel scores and facility indices were consistently low; furthermore, raters’ knowledge of the correct answers and raters’ specialty had no effect on Ebel scores in the present setting.

Citations

Citations to this article as recorded by  
  • Optimizing Radiology Resident Competency in Pediatric Musculoskeletal Radiograph Interpretation
    Kathy Boutis, Carl Starvaggi, Andrea S. Doria, Maryse Bouchard, Mark Camp, Jana Taylor, Cameron J. Hauge, Olivia Carter, Jennifer Stimec
    Academic Radiology.2026;[Epub]     CrossRef
  • Refining competency benchmarks: a scoping review of Angoff standard-setting in dental education
    Galvin Sim Siang Lin, Abdul Rauf Badrul Hisham, Muhammad Nazmi Abdul Majid, Chan Choong Foong, Ting Khee Ho, Lara T. Friedlander
    BMC Oral Health.2026;[Epub]     CrossRef
  • Sustained Improvement in Medical Students' Academic Performance in Renal Physiology Through the Application of Flipped Classroom
    Thana Thongsricome, Rahat Longsomboon, Nattacha Srithawatpong, Sarunyapong Atchariyapakorn, Kasiphak Kaikaew, Danai Wangsaturaka
    The Clinical Teacher.2025;[Epub]     CrossRef
  • Investigating assessment standards and fixed passing marks in dental undergraduate finals: a mixed-methods approach
    Ting Khee Ho, Lucy O’Malley, Reza Vahid Roudsari
    BMC Medical Education.2025;[Epub]     CrossRef
  • Standard setting for dental knowledge tests: reproducibility of the modified Angoff and Ebel method across judges
    Ting Khee Ho, Noor Lide Abu Kassim, Lucy O’Malley, Reza Vahid Roudsari
    BMC Medical Education.2025;[Epub]     CrossRef
  • Competency Standard Derivation for Point-of-Care Ultrasound Image Interpretation for Emergency Physicians
    Maya Harel-Sterling, Charisse Kwan, Jonathan Pirie, Mark Tessaro, Dennis D. Cho, Ailish Coblentz, Mohamad Halabi, Eyal Cohen, Lynne E. Nield, Martin Pusic, Kathy Boutis
    Annals of Emergency Medicine.2023; 81(4): 413.     CrossRef
  • The effects of a land-based home exercise program on surfing performance in recreational surfers
    Jerry-Thomas Monaco, Richard Boergers, Thomas Cappaert, Michael Miller, Jennifer Nelson, Meghan Schoenberger
    Journal of Sports Sciences.2023; 41(4): 358.     CrossRef
  • Medical specialty certification exams studied according to the Ottawa Quality Criteria: a systematic review
    Daniel Staudenmann, Noemi Waldner, Andrea Lörwald, Sören Huwendiek
    BMC Medical Education.2023;[Epub]     CrossRef
  • A Target Population Derived Method for Developing a Competency Standard in Radiograph Interpretation
    Michelle S. Lee, Martin V. Pusic, Mark Camp, Jennifer Stimec, Andrew Dixon, Benoit Carrière, Joshua E. Herman, Kathy Boutis
    Teaching and Learning in Medicine.2022; 34(2): 167.     CrossRef
  • Pediatric Musculoskeletal Radiographs: Anatomy and Fractures Prone to Diagnostic Error Among Emergency Physicians
    Winny Li, Jennifer Stimec, Mark Camp, Martin Pusic, Joshua Herman, Kathy Boutis
    The Journal of Emergency Medicine.2022; 62(4): 524.     CrossRef
  • Possibility of independent use of the yes/no Angoff and Hofstee methods for the standard setting of the Korean Medical Licensing Examination written test: a descriptive study
    Do-Hwan Kim, Ye Ji Kang, Hoon-Ki Park
    Journal of Educational Evaluation for Health Professions.2022; 19: 33.     CrossRef
  • Image interpretation: Learning analytics–informed education opportunities
    Elana Thau, Manuela Perez, Martin V. Pusic, Martin Pecaric, David Rizzuti, Kathy Boutis
    AEM Education and Training.2021;[Epub]     CrossRef
Review article
Overview and current management of computerized adaptive testing in licensing/certification examinations  
Dong Gi Seo
J Educ Eval Health Prof. 2017;14:17.   Published online July 26, 2017
DOI: https://doi.org/10.3352/jeehp.2017.14.17
  • 42,360 View
  • 407 Download
  • 16 Web of Science
  • 17 Crossref
AbstractAbstract PDF
Computerized adaptive testing (CAT) has been implemented in high-stakes examinations such as the National Council Licensure Examination-Registered Nurses in the United States since 1994. Subsequently, the National Registry of Emergency Medical Technicians in the United States adopted CAT for certifying emergency medical technicians in 2007. This was done with the goal of introducing the implementation of CAT for medical health licensing examinations. Most implementations of CAT are based on item response theory, which hypothesizes that both the examinee and items have their own characteristics that do not change. There are 5 steps for implementing CAT: first, determining whether the CAT approach is feasible for a given testing program; second, establishing an item bank; third, pretesting, calibrating, and linking item parameters via statistical analysis; fourth, determining the specification for the final CAT related to the 5 components of the CAT algorithm; and finally, deploying the final CAT after specifying all the necessary components. The 5 components of the CAT algorithm are as follows: item bank, starting item, item selection rule, scoring procedure, and termination criterion. CAT management includes content balancing, item analysis, item scoring, standard setting, practice analysis, and item bank updates. Remaining issues include the cost of constructing CAT platforms and deploying the computer technology required to build an item bank. In conclusion, in order to ensure more accurate estimations of examinees’ ability, CAT may be a good option for national licensing examinations. Measurement theory can support its implementation for high-stakes examinations.

Citations

Citations to this article as recorded by  
  • From Development to Validation: Exploring the Efficiency of Numetrive, a Computerized Adaptive Assessment of Numerical Reasoning
    Marianna Karagianni, Ioannis Tsaousis
    Behavioral Sciences.2025; 15(3): 268.     CrossRef
  • Global harmonization in advanced therapeutics: balancing innovation, safety, and access
    Ankit Dahiya, Kartikey Singh, Anunav Ashish, Nipun, Aayush Bhadyaria, Shubham Thakur, Manish Kumar, Ghanshyam Das Gupta, Balak Das Kurmi, Ravi Raj Pal
    Personalized Medicine.2025; 22(3): 181.     CrossRef
  • Development of a CAT based Diagnostic System for Assessing Basic Academic Skills in Undergraduate Students
    Woo-Jin Han, Jeongwook Choi, Dong-Gi Seo
    The Korean Association of General Education.2025; 19(3): 177.     CrossRef
  • Beyond the Score: Riding the Possibilities of Item Response Theory in Medical Education
    Gurjeet Singh, Nilakantan Ananthakrishnan, Raksha Singh
    CHRISMED Journal of Health and Research.2025; 12(3): 221.     CrossRef
  • Validation of the cognitive section of the Penn computerized adaptive test for neurocognitive and clinical psychopathology assessment (CAT-CCNB)
    Akira Di Sandro, Tyler M. Moore, Eirini Zoupou, Kelly P. Kennedy, Katherine C. Lopez, Kosha Ruparel, Lucky J. Njokweni, Sage Rush, Tarlan Daryoush, Olivia Franco, Alesandra Gorgone, Andrew Savino, Paige Didier, Daniel H. Wolf, Monica E. Calkins, J. Cobb S
    Brain and Cognition.2024; 174: 106117.     CrossRef
  • Comparison of real data and simulated data analysis of a stopping rule based on the standard error of measurement in computerized adaptive testing for medical examinations in Korea: a psychometric study
    Dong Gi Seo, Jeongwook Choi, Jinha Kim
    Journal of Educational Evaluation for Health Professions.2024; 21: 18.     CrossRef
  • The irtQ R package: a user-friendly tool for item response theory-based test data analysis and calibration
    Hwanggyu Lim, Kyungseok Kang
    Journal of Educational Evaluation for Health Professions.2024; 21: 23.     CrossRef
  • Implementing Computer Adaptive Testing for High-Stakes Assessment: A Shift for Examinations Council of Lesotho
    Musa Adekunle Ayanwale, Julia Chere-Masopha, Mapulane Mochekele, Malebohang Catherine Morena
    International Journal of New Education.2024;[Epub]     CrossRef
  • The current utilization of the patient-reported outcome measurement information system (PROMIS) in isolated or combined total knee arthroplasty populations
    Puneet Gupta, Natalia Czerwonka, Sohil S. Desai, Alirio J. deMeireles, David P. Trofa, Alexander L. Neuwirth
    Knee Surgery & Related Research.2023;[Epub]     CrossRef
  • Evaluating a Computerized Adaptive Testing Version of a Cognitive Ability Test Using a Simulation Study
    Ioannis Tsaousis, Georgios D. Sideridis, Hannan M. AlGhamdi
    Journal of Psychoeducational Assessment.2021; 39(8): 954.     CrossRef
  • Accuracy and Efficiency of Web-based Assessment Platform (LIVECAT) for Computerized Adaptive Testing
    Do-Gyeong Kim, Dong-Gi Seo
    The Journal of Korean Institute of Information Technology.2020; 18(4): 77.     CrossRef
  • Transformaciones en educación médica: innovaciones en la evaluación de los aprendizajes y avances tecnológicos (parte 2)
    Veronica Luna de la Luz, Patricia González-Flores
    Investigación en Educación Médica.2020; 9(34): 87.     CrossRef
  • Introduction to the LIVECAT web-based computerized adaptive testing platform
    Dong Gi Seo, Jeongwook Choi
    Journal of Educational Evaluation for Health Professions.2020; 17: 27.     CrossRef
  • Computerised adaptive testing accurately predicts CLEFT-Q scores by selecting fewer, more patient-focused questions
    Conrad J. Harrison, Daan Geerards, Maarten J. Ottenhof, Anne F. Klassen, Karen W.Y. Wong Riff, Marc C. Swan, Andrea L. Pusic, Chris J. Sidey-Gibbons
    Journal of Plastic, Reconstructive & Aesthetic Surgery.2019; 72(11): 1819.     CrossRef
  • Presidential address: Preparing for permanent test centers and computerized adaptive testing
    Chang Hwi Kim
    Journal of Educational Evaluation for Health Professions.2018; 15: 1.     CrossRef
  • Updates from 2018: Being indexed in Embase, becoming an affiliated journal of the World Federation for Medical Education, implementing an optional open data policy, adopting principles of transparency and best practice in scholarly publishing, and appreci
    Sun Huh
    Journal of Educational Evaluation for Health Professions.2018; 15: 36.     CrossRef
  • Linear programming method to construct equated item sets for the implementation of periodical computer-based testing for the Korean Medical Licensing Examination
    Dong Gi Seo, Myeong Gi Kim, Na Hui Kim, Hye Sook Shin, Hyun Jung Kim
    Journal of Educational Evaluation for Health Professions.2018; 15: 26.     CrossRef

JEEHP : Journal of Educational Evaluation for Health Professions
TOP