Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Educ Eval Health Prof > Volume 22; 2025 > Article
Review
Performance of large language models in medical licensing examinations: a systematic review and meta-analysis
Haniyeh Nouri1orcid, Abdollah Mahdavi2orcid, Ali Abedi3orcid, Alireza Mohammadnia2orcid, Mahnaz Hamedan2orcid, Masoud Amanzadeh2*orcid

DOI: https://doi.org/10.3352/jeehp.2025.22.36
Published online: November 18, 2025

1Student Research Committee, School of Medicine, Ardabil University of Medical Sciences, Ardabil, Iran

2Department of Health Information Management, School of Medicine, Ardabil University of Medical Sciences, Ardabil, Iran

3Department of Physiology, School of Medicine, Ardabil University of Medical Sciences, Ardabil, Iran

*Corresponding email: M.amanzadeh@arums.ac.ir

Editor: A Ra Cho, The Catholic University of Korea, Korea

• Received: September 23, 2025   • Accepted: October 30, 2025

© 2025 Korea Health Personnel Licensing Examination Institute

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 2,029 Views
  • 220 Download
  • 3 Web of Science
  • 3 Crossref
  • 3 Scopus
  • Purpose
    This study systematically evaluates and compares the performance of large language models (LLMs) in answering medical licensing examination questions. By conducting subgroup analyses based on language, question format, and model type, this meta-analysis aims to provide a comprehensive overview of LLM capabilities in medical education and clinical decision-making.
  • Methods
    This systematic review, registered in PROSPERO and following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searched MEDLINE (PubMed), Scopus, and Web of Science for relevant articles published up to February 1, 2025. The search strategy included Medical Subject Headings (MeSH) terms and keywords related to (“ChatGPT” OR “GPT” OR “LLM variants”) AND (“medical licensing exam*” OR “medical exam*” OR “medical education” OR “radiology exam*”). Eligible studies evaluated LLM accuracy on medical licensing examination questions. Pooled accuracy was estimated using a random-effects model, with subgroup analyses by LLM type, language, and question format. Publication bias was assessed using Egger’s regression test.
  • Results
    This systematic review identified 2,404 studies. After removing duplicates and excluding irrelevant articles through title and abstract screening, 36 studies were included after full-text review. The pooled accuracy was 72% (95% confidence interval, 70.0% to 75.0%) with high heterogeneity (I2=99%, P<0.001). Among LLMs, GPT-4 achieved the highest accuracy (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%) (P=0.001). Performance differences across languages (range, 62% in Polish to 77% in German) were not statistically significant (P=0.170).
  • Conclusion
    LLMs, particularly GPT-4, can match or exceed medical students’ examination performance and may serve as supportive educational tools. However, due to variability and the risk of errors, they should be used cautiously as complements rather than replacements for traditional learning methods.
Background
Large language models (LLMs) represent a category of artificial intelligence (AI) systems that have acquired human-like comprehension and reasoning capabilities through transformer architectures [1,2]. By leveraging deep learning and advanced artificial neural networks, these models can interpret relationships between characters and words and generate coherent text. They are trained on billions of parameters to automatically identify complex patterns and relationships within massive datasets [3]. In recent years, LLMs have demonstrated considerable potential across various domains, including programming, commerce, law, translation, and others [4-6]. This technology has also garnered substantial attention in medical sciences and healthcare, and the utilization of these models among students and faculty for health and medical examinations has recently become a prevalent practice [7,8]. These examinations serve as practical benchmarks for assessing model accuracy, and the results can be used for educational purposes and for comparison with healthcare students [9-11]. In recent years, numerous studies have investigated LLM performance on medical licensing examination questions in various languages and countries [12-18]. Nevertheless, despite the promising potential of these models, their application in healthcare contexts faces notable limitations. Given the sensitivity of the field, the generation of incorrect or misleading outputs due to hallucination, misinterpretation, bias, over-reliance, incomplete training, or lack of transparency continues to hinder full trust and adoption [19-22].
Objectives
Although many primary studies have examined the performance of LLMs across different medical examinations, their findings vary widely depending on model version, specialty domain, and question type. These inconsistencies make it difficult to draw a clear conclusion regarding the overall capabilities of LLMs in medical education. Therefore, a systematic review and meta-analysis is warranted. Previous reviews have often focused on a single model—most notably ChatGPT [4,9,23]—without comparing other emerging systems. In contrast, our meta-analysis evaluates multiple LLMs, including ChatGPT (OpenAI), Gemini (Google), Copilot (Microsoft), and Claude (Anthropic). Furthermore, we conducted subgroup analyses based on language, question format, and content type, which have received limited attention in earlier research. Considering these factors, this meta-analysis provides a comprehensive summary of the performance of 4 widely used LLMs across various health-related examinations, without restrictions by country or language. Ultimately, this study explores the capacity of these models to serve as supportive educational tools and offers insights into their potential application in decision-making within medical education.
Ethics statement
This study was based entirely on previously published literature; therefore, ethical approval and informed consent were not required.
Study design
This study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Fig. 1) and was registered in PROSPERO (CRD420251055880).
Eligibility criteria
Eligibility criteria for study selection were defined based on the PICOS (population, intervention, comparison, outcome, study design) framework (Table 1).
Information sources
A comprehensive search of electronic databases, including MEDLINE (PubMed), Scopus, and Web of Science (WOS), was performed to identify relevant studies published up to February 1, 2025.
Search strategy
The search strategy used a combination of Medical Subject Headings (MeSH) terms and relevant keywords, including (“ChatGPT” OR “GPT” OR “Generative Pre-trained Transformer” OR “Gemini” OR “Bard” OR “Claude” OR “Copilot” OR “Bing” OR “large language model*” OR “LLM”) AND (“medical licensing exam*” OR “medical exam*” OR “medical license*” OR “medical education”). The complete search strategy is provided in Supplement 1. All retrieved records were managed in EndNote ver. 20.0 (Clarivate), and duplicates were removed prior to screening.
Selection process
Two investigators (M.A. and H.N.) independently and in duplicate screened all studies. Titles and abstracts were reviewed first to identify potentially eligible studies, followed by full-text assessments to confirm inclusion. Disagreements were resolved through discussion and consensus; if consensus was not reached, a third reviewer (M.H.) provided the final decision. The screening process was documented, including the number of articles reviewed and reasons for exclusion at each stage (Supplement 2).
Data collection process
Two reviewers (M.A. and H.N.) independently extracted data from all included studies using a standardized data-extraction form in Microsoft Excel (Microsoft Corp.). Any discrepancies were resolved through discussion, with arbitration by a third reviewer (M.H.) when necessary. The extraction form is available in Supplement 3.
Data items
The following information was extracted from each study: first author’s name, publication year, country, question language, question source, number of questions, question format, type of LLM used, and LLM accuracy. Accuracy was defined as the percentage of correct responses provided by the LLM out of the total number of examination questions. If accuracy was not explicitly reported, it was calculated by dividing the number of correct answers by the total number of questions. In studies evaluating multiple LLMs, data for each model were collected separately (Dataset 1).
Study risk of bias assessment
The quality and risk of bias of included studies were evaluated using a modified version of the QUADAS-2 tool (University of Bristol), previously applied in similar research [9]. This adapted framework comprises 21 items across 4 domains: question selection, index model, reference standard, and flow/timing. Two reviewers (M.A. and H.N.) conducted independent assessments, with discrepancies resolved by a third reviewer (M.H.).
Synthesis methods
A meta-analysis was performed using a random-effects model to estimate the pooled accuracy of LLMs. Forest plots were generated to visualize overall accuracy along with corresponding 95% confidence intervals (CIs). Heterogeneity across studies was assessed using the I2 statistic, with values greater than 50% indicating substantial heterogeneity. Subgroup analyses were performed based on LLM type (e.g., GPT-4, GPT-3.5, Gemini), question format (text-based or image-based), and question language (English vs. non-English).
To explore potential sources of heterogeneity, sensitivity analyses were conducted using the leave-one-out method, recalculating pooled accuracy after sequentially excluding each study. Publication bias was evaluated through visual inspection of funnel-plot asymmetry and statistically tested using Egger’s regression test; a significant result indicated potential publication bias. All statistical analyses were performed using the meta and metaprop packages in Stata ver. 17.0 (Stata Corp.).
Study selection
In this systematic review, 2,404 studies were identified across multiple databases. After removing duplicates and screening titles and abstracts, 127 articles were selected for full-text review. Following the eligibility assessment, 36 articles met the inclusion criteria and were included in the final analysis. The study selection process is presented in Fig. 1.
Study characteristics
The characteristics of the included studies are summarized in Table 2. The studies were published between 2023 and 2024, with 21 appearing in 2024 and 15 in 2023. They originated from various countries, including Germany (n=5) [12,16,24-27], Japan (n=5) [18,28-31], Poland (n=5) [14,32-35], USA (n=4) [36-39], China (n=4) [17,40-43], Peru (n=2) [44,45], Saudi Arabia (n=2) [46,47], Taiwan (n=2) [48,49], Brazil (n=1) [23], UK (n=1) [50], Australia (n=1) [15], Belgium (n=1) [51], Chile (n=1) [52], Iran (n=1) [53], and Spain (n=1) [13]. Regarding the language of the questions, most studies used English (n=11), while others used Chinese (n=6), Japanese (n=5), Polish (n=5), Spanish (n=4), German (n=2), Portuguese (n=1), Arabic (n=1), and Italian (n=1).
The number of questions per study ranged from 95 to 2,700. Twenty-seven studies used text-based questions exclusively, while 9 combined text and images. In terms of question format, 31 studies employed multiple-choice questions (MCQs), 3 used single-choice questions (SCQs), and 2 incorporated both MCQs and SCQs. Further study details are provided in Supplement 4.
Risk of bias in studies
Using the QUADAS-2 method, we evaluated study quality and potential bias. Most studies demonstrated a low risk of bias across all domains; however, some uncertainties were noted in the index test and flow/timing domains. Detailed quality assessment results are provided in Supplements 5 and 6.
Accuracy of LLMs
Across the included studies, 63 different LLMs were evaluated in medical examinations. These included GPT-4 (n=30), GPT-3.5 (n=23), Bard/Gemini (n=3), Claude (n=3), and Bing (n=4). Reported LLM accuracy ranged from 43% for GPT-3.5 to 90% for GPT-4. The pooled accuracy of all models was 72% (95% CI, 70.0%–75.0%). The heterogeneity across studies was significant (I2=99.99%, P<0.001). A forest plot summarizing pooled accuracy is shown in Fig. 2.
Subgroup analysis of the accuracy of LLMs
Subgroup analyses were conducted based on LLM type, question language, and question format (Table 3). By model type, pooled accuracy was highest for GPT-4 (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%), with statistically significant differences among models (P=0.001). When stratified by question format, studies incorporating both text- and image-based questions achieved slightly higher pooled accuracy compared with text-only studies (78% vs. 71%, P=0.03). In terms of question language, pooled LLM accuracy ranged from 62% to 77%, with the highest accuracy observed for German and the lowest for Polish. These differences, however, were not statistically significant (P=0.17). Detailed subgroup results and corresponding forest plots are provided in Supplements 711.
Sensitivity analysis and publication bias
A sensitivity analysis using a leave-one-out approach was conducted to explore potential sources of heterogeneity. The results indicated that excluding individual studies did not meaningfully alter the pooled accuracy (Supplement 12). Publication bias was assessed through funnel-plot visualization and Egger’s regression test (Supplement 13). Visual inspection revealed no substantial asymmetry, consistent with Egger’s test results showing no evidence of small-study effects, thereby supporting the absence of publication bias (P=0.75).
In this systematic review and meta-analysis, we examined 36 peer-reviewed studies that evaluated the accuracy of 63 LLMs in answering medical licensing examination questions. The analysis encompassed diverse languages, question formats, and examination contexts. The overall pooled accuracy of the models was 72%, reflecting generally strong performance. Among the LLMs assessed, GPT-4, GPT-3.5, Google Gemini/Bard, and Microsoft Bing Chat were the most frequently evaluated. GPT-4 demonstrated the highest pooled accuracy (81%), significantly outperforming GPT-3.5 (60%), Claude (74%), Gemini/Bard (70%), and Bing (79%). A recent study by Liu et al. [9] reported pooled accuracies of 81% for GPT-4 and 58% for GPT-3, findings consistent with our results. In another study, Waldock et al. [10] found a pooled accuracy of 61% for all LLMs and 64% for ChatGPT. The discrepancy may reflect both temporal improvements in model performance and methodological differences—Waldock et al. [10] combined GPT-3.5 and GPT-4 in their estimates, whereas our analysis reports each model separately. The observed heterogeneity (I2=99%) underscores the variation in study design, question type, language, and model selection, reflecting real-world complexity but complicating direct comparisons. Subgroup analyses confirmed that model type substantially influences performance, with GPT-4 consistently surpassing earlier models. This trend aligns with ongoing advances in natural language processing and highlights the importance of continuous model refinement for specialized medical applications. Our results support previous findings demonstrating GPT-4’s superior performance in clinical reasoning, discipline-specific tasks, and general test-taking ability [9]. For example, studies in ophthalmology and general medicine have shown that GPT-4 performs at or above the level of medical students when responding to board-style questions [11]. These findings collectively suggest that model architecture, training data scope, and domain-specific tuning play critical roles in enhancing diagnostic and interpretive accuracy in medical contexts.
The subgroup analysis further revealed that incorporating image-based content significantly improved pooled accuracy—78% compared with 71% for text-only formats (P=0.030). This improvement aligns with studies showing that multimodal prompting, particularly in GPT-4, enhances diagnostic reasoning and clinical comprehension [54]. Although LLM performance varied across languages—from 62% in Polish to 77% in German—these differences were not statistically significant, contrasting with earlier reports that emphasized language-related disparities [21,55]. The convergence observed here may reflect improved multilingual training and cross-lingual generalization in recent model architectures, especially GPT-4.
The remarkable accuracy of models such as GPT-4 and Bing underscores their potential as valuable tools in medical education, particularly in resource-limited settings where access to expert instructors is constrained. Their consistent accuracy across languages and question types also suggests the feasibility of global deployment. Furthermore, their ability to process image-based questions highlights new opportunities for training in clinically relevant skills such as radiologic interpretation. Despite these encouraging results, substantial heterogeneity (I2>95%) indicates that variations in study design, evaluation criteria, and question sources remain influential. Differences in model versions (e.g., GPT-3.5 vs. GPT-4), question difficulty, and dataset size likely contributed to this variability. The inclusion of both official examination questions and public question banks may also account for part of the heterogeneity observed. These methodological differences should therefore be considered when interpreting pooled estimates.
This study has several limitations. First, the high heterogeneity among studies—likely due to differences in question sources, languages, test formats, and study designs—limits the generalizability of the results. Second, a substantial proportion of the reviewed studies focused on OpenAI models (GPT-3.5 and GPT-4), which may have introduced bias into the analysis. Third, detailed information on prompt design, question difficulty, or the specific model versions used was not available in all studies. Furthermore, because of continuous model updates and new releases, the performance reported at a given point in time may not accurately reflect the future performance of these models. Another limitation is the potential overlap between examination questions and the training data of LLMs. As many of the question banks used in the included studies were publicly available, it is possible that some items were encountered by the models during pretraining, which could have slightly inflated performance estimates. However, our subgroup analysis based on question source showed only a minimal difference in accuracy (71% for internal exams vs. 73% for public databases), suggesting that this effect was limited. Nonetheless, potential data contamination should be taken into account when interpreting the results, particularly for widely trained models such as GPT-4.
Future research should focus on improving LLM architectures for medical reasoning, exploring their ability to directly interpret imaging data, and evaluating their impact on real-world diagnostic workflows. In addition, longitudinal studies examining the integration of LLMs into medical curricula—particularly in simulation-based training and formative assessments—could provide valuable frameworks for advancing the practical application of this technology in medical education. Ultimately, while current LLMs show considerable promise, their integration into medical education and practice must proceed with careful awareness of their limitations, ensuring that AI supports rather than replaces expert human judgment. Ethical considerations and potential challenges associated with these models must also be carefully addressed.
This systematic review and meta-analysis highlight the promising performance of LLMs, particularly GPT-4, in answering medical licensing examination questions. Our findings indicate that LLMs hold significant potential to enhance medical education, assessment, and clinical decision-making. Despite substantial progress, the high heterogeneity among studies and the intrinsic limitations of LLMs suggest that these technologies are not yet ready to replace traditional educational resources or to be used independently in formal assessments. Therefore, LLMs should be regarded as supportive educational tools that complement conventional learning methods, guided by ethical principles and scientific standards. Future efforts should aim to develop standardized assessment protocols and conduct comparative research in authentic educational and clinical settings to ensure the safe and effective implementation of these technologies.

Authors’ contributions

Conceptualization: MA, HN, AM, MH, AA, AM. Data curation: MA, HN. Methodology/formal analysis/validation: MA, HN, MH. Project administration: none. Funding acquisition: none. Writing–original draft: MA. Writing–review and editing: HN, AM, MH, AA, AM.

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Funding

None.

Data availability

Data files are available from https://doi.org/10.7910/DVN/KMHVAF

Dataset 1. Extracted data for meta-analysis.

jeehp-22-36-dataset1.xlsx

Acknowledgments

None.

Supplement files are available from https://doi.org/10.7910/DVN/KMHVAF
Supplement 1. Search strategy for retrieval of publications from databases.
jeehp-22-36-suppl1.docx
Supplement 2. Excluded studies with reasons.
jeehp-22-36-suppl2.docx
Supplement 3. Data extraction form.
jeehp-22-36-suppl3.docx
Supplement 4. Additional details of the included studies.
jeehp-22-36-suppl4.docx
Supplement 5. Quality assessment of the included studies.
jeehp-22-36-suppl5.docx
Supplement 6. The detailed information on the results of the quality assessment.
jeehp-22-36-suppl6.docx
Supplement 7. Subgroup analysis of LLM accuracy based on source of questions.
jeehp-22-36-suppl7.docx
Supplement 8. Subgroup analysis of LLM accuracy based on different models.
jeehp-22-36-suppl8.docx
Supplement 9. Subgroup analysis of LLM accuracy based on question language.
jeehp-22-36-suppl9.docx
Supplement 10. Subgroup analysis of LLM accuracy based on question format.
jeehp-22-36-suppl10.docx
Supplement 11. Subgroup analysis of LLM accuracy based on question type.
jeehp-22-36-suppl11.docx
Supplement 12. Leave-one-out meta-analysis.
jeehp-22-36-suppl12.docx
Supplement 13. Funnel plot for publication bias assessment.
jeehp-22-36-suppl13.docx
Supplement 14. Audio recording of the abstract.
jeehp-22-36-abstract-recording.avi
Fig. 1.
Preferred reporting items PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram for systematic reviews and meta-analyses. LLM, large language model.
jeehp-22-36f1.jpg
Fig. 2.
Forest plot for the pooled accuracy of large language models (LLMs). CI, confidence interval.
jeehp-22-36f2.jpg
jeehp-22-36f3.jpg
Table 1.
Inclusion and exclusion criteria based on the PICOS framework
PICOS Inclusion criteria Exclusion criteria
Population (P) Studies assessing the accuracy of LLMs in answering medical license examination questions. Studies not involving answers to medical license examination questions.
Intervention (I) Studies that use LLMs (e.g., GPT-4, Gemini, Claude) to answer questions Research that does not use LLMs.
Comparator (C) Studies evaluating accuracy of LLM against Human performance (such as medical students) or other AI models -
Outcome (O) • Studies reporting accuracy of LLMs Studies that do not report accuracy or lack sufficient data for metric calculation.
• Studies that provide data for calculating accuracy (number of correct answers and total number of questions)
Study design (S) • Original peer-reviewed studies that examine the performance of LLMs in medical license examinations. Review, editorials, commentaries, letters to the editor, case series, case reports, conference abstracts and preprint articles
• Studies published in English language, regardless of the language of the examination questions

LLMs, large language models; AI, artificial intelligence.

Table 2.
Characteristics of the included studies
Study (year) LLMs Country Language Question type Question format No. of questions ACC (%)
Rodrigues Alessi et al. [23] (2024) GPT-3.5 Brazil Portuguese MCQs Text 333 68
Alfertshofer et al. [24] (2024) GPT-4 Germany Italian MCQs Text 300 73
Aljindan et al. [46] (2023) GPT-4 Saudi Arabia Arabic MCQs Text 220 87
Bicknell et al. [36] (2024) GPT-4 USA English MCQs Text 750 90
GPT-3.5 60
Ebrahimian et al. [53] (2023) GPT-4 Iran English MCQs Text 200 69
Fang et al. [40] (2023) GPT-4 China Chinese MCQs Text 600 74
Flores-Cohaila et al. [44] (2023) GPT-4 Peru Spanish MCQs Text 180 86
GPT-3.5 77
Funk et al. [12] (2024) GPT-3.5 Germany English MCQs Text 2,700 58
GPT-4 86
Garabet et al. [37] (2024) GPT-4 USA English MCQs Text 1,300 86
Guillen-Grima et al. [13] (2023) GPT-3.5 Spain Spanish MCQs Text-image 182 63
GPT-4 87
Haze et al. [28] (2023) GPT-4 Japan Japanese MCQs–SCQs Text 861 81
GPT-3.5 56
Huang et al. [48] (2024) GPT-4 Taiwan Chinese MCQs Text-image 600 88
Jaworski et al. [14] (2024) GPT-3.5 Poland Polish MCQs Text 196 50
GPT-4 78
Kleinig et al. [15] (2023) GPT-4 Australia English MCQs Text 150 79
GPT-3.5 66
Bing 72
Knoedler et al. [25] (2024) GPT-4 Germany English MCQs Text-image 299 85
GPT-3.5 Text 1,840 57
Kufel et al. [32] (2024) GPT-3.5 Poland Polish MCQs–SCQs Text 117 56
Lai et al. [50] (2023) GPT-4 UK English SCQs Text 573 76
Lin et al. [49] (2024) GPT-4 Taiwan Chinese SCQs Text-image 1,280 82
Liu et al. [9] (2024) GPT-4 Japan Japanese MCQs Text-image 790 89
Gemini 80
Claude 82
Mackey et al. [38] (2024) GPT-4 USA English MCQs Text 900 89
Meo et al. [47] (2023) GPT-3.5 Saudi Arabia English MCQs Text 100 72
Meyer et al. [16] (2024) GPT-3.5 Germany German MCQs Text-image 937 58
GPT-4 85
Ming et al. [17] (2024) GPT-3.5 China Chinese MCQs–SCQs Text 600 54
GPT-4 73
Morreel et al. [51] (2024) GPT-3.5 Belgium English MCQs Text 95 67
GPT-4 76
Claude 67
Bing 76
Bard 62
Nakao et al. [30] (2024) GPT-4 Japan Japanese MCQs Text-image 108 68
Rojas et al. [52] (2024) GPT-3.5 Chile Spanish MCQs Text-image 540 57
GPT-4 79
Roos et al. [27] (2023) GPT-4 Germany German MCQs Text-image 630 88
Bing 86
GPT-3.5 66
Shieh et al. [39] (2024) GPT-3.5 USA English MCQs Text 109 48
GPT-4 78
Siebielec et al. [33] (2024) GPT-3.5 Poland Polish MCQs Text 980 60
Suwała et al. [34] (2024) GPT-3.5 Poland Polish MCQs Text 2,138 59
Tanaka et al. [18] (2024) GPT-3.5 Japan Japanese MCQs Text 397 63
GPT-4 81
Tong et al. [41] (2023) GPT-4 China Chinese SCQs Text 160 81
Torres-Zegarra et al. [45] (2023) GPT-3.5 Peru Spanish MCQs Text 180 69
GPT-4 87
Bing 82
Bard 69
Claude 72
Wojcik et al. [35] (2024) GPT-4 Poland Polish MCQs Text 120 67
Yanagita et al. [31] (2023) GPT-3.5 Japan Japanese MCQs Text 292 43
GPT-4 81
Zong et al. [43] (2024) GPT-3.5 China Chinese MCQs Text 600 60

LLM, large language model; ACC, accuracy; MCQ, multiple choice question; SCQ, single choice question.

Table 3.
The results of subgroup analysis
Subgroup No. (%) Accuracy (%) 95% CI (%) I2 (%) P-value Test of group differences
Overall 63 (100) 72 70.0–75.0 99 0.001 -
LLM types 0.001
 GPT-4 30 (48) 81 79.0–83.0 99 0.001
 Bing 4 (6) 79 73.0–85.0 99 0.001
 GPT-3.5 23 (36) 60 57.0–63.0 99 0.001
 Claude 3 (5) 74 65.0–82.0 99 0.001
 Gemini/Bard 3 (5) 70 60.0–81.0 99 0.001
Question language 0.17
 English 21 (33) 72 64.0–77.0 99 0.001
 Chinese 7 (11) 73 64.0–82.0 99 0.001
 German 5 (8) 77 65.0–89.0 99 0.001
 Japanese 10 (16) 72 63.0–81.0 99 0.001
 Polish 6 (10) 62 54.0–69.0 99 0.001
 Spanish 11 (17) 75 69.0–81.0 99 0.001
 Other 3 (5) 76 65.0–87.0 99 0.001
Question format 0.03
 Text 47 (75) 71 67.0–74.0 99 0.001
 Text and image 16 (25) 78 72.0–83.0 99 0.001
Question type 0.001
 MCQ 55 (87) 73 70.0–76.0 99 0.001
 MCQ and SCQ 5 (8) 64 53.0–75.0 99 0.001
 SCQ 3 (5) 72 70.0–75.0 99 0.001
Source of questions 0.63
 Public database 54 (86) 73 69.0–76.0 99 0.001
 Internal exam 9 (14) 71 65.0–77.0 99 0.001

CI, confidence interval; LLM, large language model; MCQ, multiple choice question; SCQ, single choice question.

  • 1. Kasneci E, Seßler K, Kuchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Gunnemann S, Hullermeier E, Krusche S. ChatGPT for good?: on opportunities and challenges of large language models for education. Learn Individ Differ 2023;103:102274. https://doi.org/10.1016/j.lindif.2023.102274 Article
  • 2. Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak 2025;25:117. https://doi.org/10.1186/s12911-025-02954-4 ArticlePubMedPMC
  • 3. Toloka Team. The history, timeline, and future of LLMs: Essential ML Guide [Internet]. Toloka Team; 2023 [cited 2025 Sep 10]. Available from: https://toloka.ai/blog/history-of-llms/
  • 4. Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ 2024;24:1013. https://doi.org/10.1186/s12909-024-05944-8 ArticlePubMedPMC
  • 5. Okonkwo CW, Ade-Ibijola A. Chatbots applications in education: a systematic review. Comput Educ Artif Intell 2021;2:100033. https://doi.org/10.1016/j.caeai.2021.100033 Article
  • 6. Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys Syst 2023;3:121-154. https://doi.org/10.1016/j.iotcps.2023.04.003 Article
  • 7. Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ 2024;17:926-931. https://doi.org/10.1002/ase.2270 ArticlePubMed
  • 8. Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JP. Large language models for science and medicine. Eur J Clin Invest 2024;54:e14183. https://doi.org/10.1111/eci.14183 ArticlePubMed
  • 9. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res 2024;26:e60807. https://doi.org/10.2196/60807 ArticlePubMedPMC
  • 10. Waldock WJ, Zhang J, Guni A, Nabeel A, Darzi A, Ashrafian H. The accuracy and capability of artificial intelligence solutions in health care examinations and certificates: systematic review and meta-analysis. J Med Internet Res 2024;26:e56532. https://doi.org/10.2196/56532 ArticlePubMedPMC
  • 11. Wu JH, Nishida T, Liu TY. Accuracy of large language models in answering ophthalmology board-style questions: a meta-analysis. Asia Pac J Ophthalmol (Phila) 2024;13:100106. https://doi.org/10.1016/j.apjo.2024.100106 ArticlePubMed
  • 12. Funk PF, Hoch CC, Knoedler S, Knoedler L, Cotofana S, Sofo G, Bashiri Dezfouli A, Wollenberg B, Guntinas-Lichius O, Alfertshofer M. ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Investig Health Psychol Educ 2024;14:657-668. https://doi.org/10.3390/ejihpe14030043 ArticlePubMedPMC
  • 13. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the efficacy of ChatGPT in navigating the Spanish Medical Residency Entrance Examination (MIR): promising horizons for AI in clinical medicine. Clin Pract 2023;13:1460-1487. https://doi.org/10.3390/clinpract13060130 ArticlePubMedPMC
  • 14. Jaworski A, Jasinski D, Jaworski W, Hop A, Janek A, Slawinska B, Konieczniak L, Rzepka M, Jung M, Syslo O, Jarzabek V, Blecha Z, Harazinski K, Jasinska N. Comparison of the performance of artificial intelligence versus medical professionals in the Polish Final Medical Examination. Cureus 2024;16:e66011. https://doi.org/10.7759/cureus.66011 ArticlePubMedPMC
  • 15. Kleinig O, Gao C, Bacchi S. This too shall pass: the performance of ChatGPT-3.5, ChatGPT-4 and New Bing in an Australian medical licensing examination. Med J Aust 2023;219:237. https://doi.org/10.5694/mja2.52061 Article
  • 16. Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the Written German Medical Licensing Examination: observational study. JMIR Med Educ 2024;10:e50965. https://doi.org/10.2196/50965 ArticlePubMedPMC
  • 17. Ming S, Guo Q, Cheng W, Lei B. Influence of model evolution and system roles on ChatGPT’s performance in Chinese Medical Licensing Exams: comparative study. JMIR Med Educ 2024;10:e52784. https://doi.org/10.2196/52784 ArticlePubMedPMC
  • 18. Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, Kawai H, Higashino F, Enomoto M, Noda M, Kometani M, Takamura M, Yoneda T, Kakizaki H, Nomura A. Performance of generative pretrained transformer on the National Medical Licensing Examination in Japan. PLOS Digit Health 2024;3:e0000433. https://doi.org/10.1371/journal.pdig.0000433 ArticlePubMedPMC
  • 19. Briganti G. A clinician’s guide to large language models. Future Med AI 2023;1:FMAI1. https://doi.org/10.2217/fmai-2023-0003 Article
  • 20. Jiang Y, Qiu R, Zhang Y, Zhang PF. Balanced and explainable social media analysis for public health with large language models. In: Bao Z, Borovica-Gajic R, Qiu R, Choudhury F, Yang Z, editors. Databases theory and applications. Proceedings of the 34th Australasian Database Conference, ADC 2023; 2023 Nov 1-3; Melbourne, Australia. Springer-Verlag; 2023. p. 73-86. https://doi.org/10.1007/978-3-031-47843-7_6
  • 21. Mahdavi A, Amanzadeh M, Hamedan M. The role of large language models in modern medical education: opportunities and challenges. Shiraz E-Med J 2024;25:e144847. https://doi.org/10.5812/semj-144847 Article
  • 22. Nazi ZA, Peng W. Large language models in healthcare and medical domain: a review. Informatics 2024;11:57. https://doi.org/10.3390/informatics11030057 Article
  • 23. Rodrigues Alessi M, Gomes HA, Lopes de Castro M, Terumy Okamoto C. Performance of ChatGPT in solving questions from the Progress Test (Brazilian National Medical Exam): a potential artificial intelligence tool in medical practice. Cureus 2024;16:e64924. https://doi.org/10.7759/cureus.64924 ArticlePubMedPMC
  • 24. Alfertshofer M, Hoch CC, Funk PF, Hollmann K, Wollenberg B, Knoedler S, Knoedler L. Sailing the seven seas: a multinational comparison of ChatGPT’s performance on medical licensing examinations. Ann Biomed Eng 2024;52:1542-1545. https://doi.org/10.1007/s10439-023-03338-3 ArticlePubMed
  • 25. Knoedler L, Alfertshofer M, Knoedler S, Hoch CC, Funk PF, Cotofana S, Maheta B, Frank K, Brebant V, Prantl L, Lamby P. Pure wisdom or Potemkin villages?: a comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE step 3 style questions: quantitative analysis. JMIR Med Educ 2024;10:e51148. https://doi.org/10.2196/51148 ArticlePubMedPMC
  • 26. Knoedler L, Knoedler S, Hoch CC, Prantl L, Frank K, Soiderer L, Cotofana S, Dorafshar AH, Schenck T, Vollbach F, Sofo G, Alfertshofer M. In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions. Sci Rep 2024;14:13553. https://doi.org/10.1038/s41598-024-63997-7 ArticlePubMedPMC
  • 27. Roos J, Kasapovic A, Jansen T, Kaczmarczyk R. Artificial intelligence in medical education: comparative analysis of ChatGPT, Bing, and medical students in Germany. JMIR Med Educ 2023;9:e46482. https://doi.org/10.2196/46482 ArticlePubMedPMC
  • 28. Haze T, Kawano R, Takase H, Suzuki S, Hirawa N, Tamura K. Influence on the accuracy in ChatGPT: differences in the amount of information per medical field. Int J Med Inform 2023;180:105283. https://doi.org/10.1016/j.ijmedinf.2023.105283 ArticlePubMed
  • 29. Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, Furukawa E, Kiuchi T. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using Japanese national medical examination. Int J Med Inform 2025;193:105673. https://doi.org/10.1016/j.ijmedinf.2024.105673 ArticlePubMed
  • 30. Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: evaluation study. JMIR Med Educ 2024;10:e54393. https://doi.org/10.2196/54393 ArticlePubMedPMC
  • 31. Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on medical questions in the National Medical Licensing Examination in Japan: evaluation study. JMIR Form Res 2023;7:e48023. https://doi.org/10.2196/48023 ArticlePubMedPMC
  • 32. Kufel J, Bielowka M, Rojek M, Mitrega A, Czogalik L, Kaczynska D, Kondol D, Palkij K, Mielcarska S. Assessing ChatGPT’s performance in national nuclear medicine specialty examination: an evaluative analysis. Iran J Nucl Med 2024;32:60-65. https://doi.org/10.22034/IRJNM.2023.129434.1580 Article
  • 33. Siebielec J, Ordak M, Oskroba A, Dworakowska A, Bujalska-Zadrozny M. Assessment study of ChatGPT-3.5’s performance on the final Polish Medical Examination: accuracy in answering 980 questions. Healthcare (Basel) 2024;12:1637. https://doi.org/10.3390/healthcare12161637 ArticlePubMedPMC
  • 34. Suwala S, Szulc P, Guzowski C, Kaminska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland’s medical final examination: is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024;12:20503121241257777. https://doi.org/10.1177/20503121241257777 ArticlePubMedPMC
  • 35. Wojcik S, Rulkiewicz A, Pruszczyk P, Lisik W, Pobozy M, Domienik-Karlowicz J. Reshaping medical education: performance of ChatGPT on a PES medical examination. Cardiol J 2024;31:442-450. https://doi.org/10.5603/cj.97517 ArticlePubMedPMC
  • 36. Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, Spaedy O, Skelton A, Edupuganti N, Dzubinski L, Tate H, Dyess G, Lindeman B, Lehmann LS. ChatGPT-4 omni performance in USMLE disciplines and clinical skills: comparative analysis. JMIR Med Educ 2024;10:e63430. https://doi.org/10.2196/63430 ArticlePubMedPMC
  • 37. Garabet R, Mackey BP, Cross J, Weingarten M. ChatGPT-4 performance on USMLE step 1 style questions and its implications for medical education: a comparative study across systems and disciplines. Med Sci Educ 2024;34:145-152. https://doi.org/10.1007/s40670-023-01956-z ArticlePubMed
  • 38. Mackey BP, Garabet R, Maule L, Tadesse A, Cross J, Weingarten M. Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students. Discov Artif Intell 2024;4:38. https://doi.org/10.1007/s44163-024-00135-2 Article
  • 39. Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE step 2 CK and clinical case reports. Sci Rep 2024;14:9330. https://doi.org/10.1038/s41598-024-58760-x ArticlePubMedPMC
  • 40. Fang C, Wu Y, Fu W, Ling J, Wang Y, Liu X, Jiang Y, Wu Y, Chen Y, Zhou J, Zhu Z, Yan Z, Yu P, Liu X. How does ChatGPT-4 preform on non-English national medical licensing examination?: an evaluation in Chinese language. PLOS Digit Health 2023;2:e0000397. https://doi.org/10.1371/journal.pdig.0000397 ArticlePubMedPMC
  • 41. Tong W, Guan Y, Chen J, Huang X, Zhong Y, Zhang C, Zhang H. Artificial intelligence in global health equity: an evaluation and discussion on the application of ChatGPT, in the Chinese National Medical Licensing Examination. Front Med (Lausanne) 2023;10:1237432. https://doi.org/10.3389/fmed.2023.1237432 ArticlePubMedPMC
  • 42. Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, Fan Q, Wu S, Hu W, Li X. ChatGPT performs on the Chinese National Medical Licensing Examination. J Med Syst 2023;47:86. https://doi.org/10.1007/s10916-023-01961-0 ArticlePubMed
  • 43. Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ 2024;24:143. https://doi.org/10.1186/s12909-024-05125-7 ArticlePubMedPMC
  • 44. Flores-Cohaila JA, Garcia-Vicente A, Vizcarra-Jimenez SF, De la Cruz-Galan JP, Gutierrez-Arratia JD, Quiroga Torres BG, Taype-Rondan A. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: cross-sectional study. JMIR Med Educ 2023;9:e48039. https://doi.org/10.2196/48039 ArticlePubMedPMC
  • 45. Torres-Zegarra BC, Rios-Garcia W, Nana-Cordova AM, Arteaga-Cisneros KF, Chalco XC, Ordonez MA, Rios CJ, Godoy CA, Quezada KL, Gutierrez-Arratia JD, Flores-Cohaila JA. Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study. J Educ Eval Health Prof 2023;20:30. https://doi.org/10.3352/jeehp.2023.20.30 ArticlePubMedPMC
  • 46. Aljindan FK, Al Qurashi AA, Albalawi IA, Alanazi AM, Aljuhani HA, Falah Almutairi F, Aldamigh OA, Halawani IR, K Zino Alarki SM. ChatGPT conquers the Saudi Medical Licensing Exam: exploring the accuracy of artificial intelligence in medical knowledge assessment and implications for modern medical education. Cureus 2023;15:e45043. https://doi.org/10.7759/cureus.45043 ArticlePubMedPMC
  • 47. Meo SA, Al-Masri AA, Alotaibi M, Meo MZ, Meo MO. ChatGPT knowledge evaluation in basic and clinical medical sciences: multiple choice question examination-based performance. Healthcare (Basel) 2023;11:2046. https://doi.org/10.3390/healthcare11142046 ArticlePubMedPMC
  • 48. Huang CH, Hsiao HJ, Yeh PC, Wu KC, Kao CH. Performance of ChatGPT on stage 1 of the Taiwanese medical licensing exam. Digit Health 2024;10:20552076241233144. https://doi.org/10.1177/20552076241233144 ArticlePMC
  • 49. Lin SY, Chan PK, Hsu WH, Kao CH. Exploring the proficiency of ChatGPT-4: an evaluation of its performance in the Taiwan advanced medical licensing examination. Digit Health 2024;10:20552076241237678. https://doi.org/10.1177/20552076241237678 ArticlePubMedPMC
  • 50. Lai UH, Wu KS, Hsu TY, Kan JK. Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med (Lausanne) 2023;10:1240915. https://doi.org/10.3389/fmed.2023.1240915 ArticlePubMedPMC
  • 51. Morreel S, Verhoeven V, Mathysen D. Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam. PLOS Digit Health 2024;3:e0000349. https://doi.org/10.1371/journal.pdig.0000349 ArticlePubMedPMC
  • 52. Rojas M, Rojas M, Burgess V, Toro-Perez J, Salehi S. Exploring the performance of ChatGPT versions 3.5, 4, and 4 with vision in the Chilean Medical Licensing Examination: observational study. JMIR Med Educ 2024;10:e55048. https://doi.org/10.2196/55048 ArticlePubMedPMC
  • 53. Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. ChatGPT in Iranian Medical Licensing Examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health Care Inform 2023;30:e100815. https://doi.org/10.1136/bmjhci-2023-100815 Article
  • 54. Li Y, Liu Y, Wang Z, Liang X, Liu L, Wang L, Cui L, Tu Z, Wang L, Zhou L. A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv [Preprint] 2023 Nov 4. https://doi.org/10.1101/2023.11.03.23298067 Article
  • 55. Tarighatnia A, Amanzadeh M, Hamedan M, Mobareke MK, Nader ND. Opportunities and challenges of large language models in medical imaging. Front Biomed Technol 2025;12:205-209. https://doi.org/10.18502/fbt.v12i2.18267 Article

Figure & Data

References

    Citations

    Citations to this article as recorded by  
    • Effective prompt design for large language models in clinical practice
      Steven Callens
      Acta Clinica Belgica.2026; 81(2): 118.     CrossRef
    • Examination of Gemini's ability to answer anatomical questions: An overview
      D. Chytas, G. Noussios, M.-K. Kaseta, D. Chrysikos, A.V. Vasiliadis, C. Lyrtzis, T. Troupis
      Morphologie.2026; 110(369): 101118.     CrossRef
    • Evaluación comparativa de modelos de inteligencia artificial de última generación frente a psiquiatras humanos en el examen nacional de subespecialidad en Perú: un estudio transversal
      Javier A. Flores-Cohaila, Jeff Huarcaya-Victoria, Cesar Copaja-Corzo
      Educación Médica.2026; 27(3): 101179.     CrossRef

    Figure
    • 0
    • 1
    • 2
    Related articles
    Performance of large language models in medical licensing examinations: a systematic review and meta-analysis
    Image Image Image
    Fig. 1. Preferred reporting items PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram for systematic reviews and meta-analyses. LLM, large language model.
    Fig. 2. Forest plot for the pooled accuracy of large language models (LLMs). CI, confidence interval.
    Graphical abstract
    Performance of large language models in medical licensing examinations: a systematic review and meta-analysis
    PICOS Inclusion criteria Exclusion criteria
    Population (P) Studies assessing the accuracy of LLMs in answering medical license examination questions. Studies not involving answers to medical license examination questions.
    Intervention (I) Studies that use LLMs (e.g., GPT-4, Gemini, Claude) to answer questions Research that does not use LLMs.
    Comparator (C) Studies evaluating accuracy of LLM against Human performance (such as medical students) or other AI models -
    Outcome (O) • Studies reporting accuracy of LLMs Studies that do not report accuracy or lack sufficient data for metric calculation.
    • Studies that provide data for calculating accuracy (number of correct answers and total number of questions)
    Study design (S) • Original peer-reviewed studies that examine the performance of LLMs in medical license examinations. Review, editorials, commentaries, letters to the editor, case series, case reports, conference abstracts and preprint articles
    • Studies published in English language, regardless of the language of the examination questions
    Study (year) LLMs Country Language Question type Question format No. of questions ACC (%)
    Rodrigues Alessi et al. [23] (2024) GPT-3.5 Brazil Portuguese MCQs Text 333 68
    Alfertshofer et al. [24] (2024) GPT-4 Germany Italian MCQs Text 300 73
    Aljindan et al. [46] (2023) GPT-4 Saudi Arabia Arabic MCQs Text 220 87
    Bicknell et al. [36] (2024) GPT-4 USA English MCQs Text 750 90
    GPT-3.5 60
    Ebrahimian et al. [53] (2023) GPT-4 Iran English MCQs Text 200 69
    Fang et al. [40] (2023) GPT-4 China Chinese MCQs Text 600 74
    Flores-Cohaila et al. [44] (2023) GPT-4 Peru Spanish MCQs Text 180 86
    GPT-3.5 77
    Funk et al. [12] (2024) GPT-3.5 Germany English MCQs Text 2,700 58
    GPT-4 86
    Garabet et al. [37] (2024) GPT-4 USA English MCQs Text 1,300 86
    Guillen-Grima et al. [13] (2023) GPT-3.5 Spain Spanish MCQs Text-image 182 63
    GPT-4 87
    Haze et al. [28] (2023) GPT-4 Japan Japanese MCQs–SCQs Text 861 81
    GPT-3.5 56
    Huang et al. [48] (2024) GPT-4 Taiwan Chinese MCQs Text-image 600 88
    Jaworski et al. [14] (2024) GPT-3.5 Poland Polish MCQs Text 196 50
    GPT-4 78
    Kleinig et al. [15] (2023) GPT-4 Australia English MCQs Text 150 79
    GPT-3.5 66
    Bing 72
    Knoedler et al. [25] (2024) GPT-4 Germany English MCQs Text-image 299 85
    GPT-3.5 Text 1,840 57
    Kufel et al. [32] (2024) GPT-3.5 Poland Polish MCQs–SCQs Text 117 56
    Lai et al. [50] (2023) GPT-4 UK English SCQs Text 573 76
    Lin et al. [49] (2024) GPT-4 Taiwan Chinese SCQs Text-image 1,280 82
    Liu et al. [9] (2024) GPT-4 Japan Japanese MCQs Text-image 790 89
    Gemini 80
    Claude 82
    Mackey et al. [38] (2024) GPT-4 USA English MCQs Text 900 89
    Meo et al. [47] (2023) GPT-3.5 Saudi Arabia English MCQs Text 100 72
    Meyer et al. [16] (2024) GPT-3.5 Germany German MCQs Text-image 937 58
    GPT-4 85
    Ming et al. [17] (2024) GPT-3.5 China Chinese MCQs–SCQs Text 600 54
    GPT-4 73
    Morreel et al. [51] (2024) GPT-3.5 Belgium English MCQs Text 95 67
    GPT-4 76
    Claude 67
    Bing 76
    Bard 62
    Nakao et al. [30] (2024) GPT-4 Japan Japanese MCQs Text-image 108 68
    Rojas et al. [52] (2024) GPT-3.5 Chile Spanish MCQs Text-image 540 57
    GPT-4 79
    Roos et al. [27] (2023) GPT-4 Germany German MCQs Text-image 630 88
    Bing 86
    GPT-3.5 66
    Shieh et al. [39] (2024) GPT-3.5 USA English MCQs Text 109 48
    GPT-4 78
    Siebielec et al. [33] (2024) GPT-3.5 Poland Polish MCQs Text 980 60
    Suwała et al. [34] (2024) GPT-3.5 Poland Polish MCQs Text 2,138 59
    Tanaka et al. [18] (2024) GPT-3.5 Japan Japanese MCQs Text 397 63
    GPT-4 81
    Tong et al. [41] (2023) GPT-4 China Chinese SCQs Text 160 81
    Torres-Zegarra et al. [45] (2023) GPT-3.5 Peru Spanish MCQs Text 180 69
    GPT-4 87
    Bing 82
    Bard 69
    Claude 72
    Wojcik et al. [35] (2024) GPT-4 Poland Polish MCQs Text 120 67
    Yanagita et al. [31] (2023) GPT-3.5 Japan Japanese MCQs Text 292 43
    GPT-4 81
    Zong et al. [43] (2024) GPT-3.5 China Chinese MCQs Text 600 60
    Subgroup No. (%) Accuracy (%) 95% CI (%) I2 (%) P-value Test of group differences
    Overall 63 (100) 72 70.0–75.0 99 0.001 -
    LLM types 0.001
     GPT-4 30 (48) 81 79.0–83.0 99 0.001
     Bing 4 (6) 79 73.0–85.0 99 0.001
     GPT-3.5 23 (36) 60 57.0–63.0 99 0.001
     Claude 3 (5) 74 65.0–82.0 99 0.001
     Gemini/Bard 3 (5) 70 60.0–81.0 99 0.001
    Question language 0.17
     English 21 (33) 72 64.0–77.0 99 0.001
     Chinese 7 (11) 73 64.0–82.0 99 0.001
     German 5 (8) 77 65.0–89.0 99 0.001
     Japanese 10 (16) 72 63.0–81.0 99 0.001
     Polish 6 (10) 62 54.0–69.0 99 0.001
     Spanish 11 (17) 75 69.0–81.0 99 0.001
     Other 3 (5) 76 65.0–87.0 99 0.001
    Question format 0.03
     Text 47 (75) 71 67.0–74.0 99 0.001
     Text and image 16 (25) 78 72.0–83.0 99 0.001
    Question type 0.001
     MCQ 55 (87) 73 70.0–76.0 99 0.001
     MCQ and SCQ 5 (8) 64 53.0–75.0 99 0.001
     SCQ 3 (5) 72 70.0–75.0 99 0.001
    Source of questions 0.63
     Public database 54 (86) 73 69.0–76.0 99 0.001
     Internal exam 9 (14) 71 65.0–77.0 99 0.001
    Table 1. Inclusion and exclusion criteria based on the PICOS framework

    LLMs, large language models; AI, artificial intelligence.

    Table 2. Characteristics of the included studies

    LLM, large language model; ACC, accuracy; MCQ, multiple choice question; SCQ, single choice question.

    Table 3. The results of subgroup analysis

    CI, confidence interval; LLM, large language model; MCQ, multiple choice question; SCQ, single choice question.


    JEEHP : Journal of Educational Evaluation for Health Professions
    TOP