Abstract
-
Purpose
- This study systematically evaluates and compares the performance of large language models (LLMs) in answering medical licensing examination questions. By conducting subgroup analyses based on language, question format, and model type, this meta-analysis aims to provide a comprehensive overview of LLM capabilities in medical education and clinical decision-making.
-
Methods
- This systematic review, registered in PROSPERO and following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searched MEDLINE (PubMed), Scopus, and Web of Science for relevant articles published up to February 1, 2025. The search strategy included Medical Subject Headings (MeSH) terms and keywords related to (“ChatGPT” OR “GPT” OR “LLM variants”) AND (“medical licensing exam*” OR “medical exam*” OR “medical education” OR “radiology exam*”). Eligible studies evaluated LLM accuracy on medical licensing examination questions. Pooled accuracy was estimated using a random-effects model, with subgroup analyses by LLM type, language, and question format. Publication bias was assessed using Egger’s regression test.
-
Results
- This systematic review identified 2,404 studies. After removing duplicates and excluding irrelevant articles through title and abstract screening, 36 studies were included after full-text review. The pooled accuracy was 72% (95% confidence interval, 70.0% to 75.0%) with high heterogeneity (I2=99%, P<0.001). Among LLMs, GPT-4 achieved the highest accuracy (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%) (P=0.001). Performance differences across languages (range, 62% in Polish to 77% in German) were not statistically significant (P=0.170).
-
Conclusion
- LLMs, particularly GPT-4, can match or exceed medical students’ examination performance and may serve as supportive educational tools. However, due to variability and the risk of errors, they should be used cautiously as complements rather than replacements for traditional learning methods.
-
Keywords: Artificial intelligence; Large language models; Medical education; Medical examination
Graphical abstract
Introduction
- Background
- Large language models (LLMs) represent a category of artificial intelligence (AI) systems that have acquired human-like comprehension and reasoning capabilities through transformer architectures [1,2]. By leveraging deep learning and advanced artificial neural networks, these models can interpret relationships between characters and words and generate coherent text. They are trained on billions of parameters to automatically identify complex patterns and relationships within massive datasets [3]. In recent years, LLMs have demonstrated considerable potential across various domains, including programming, commerce, law, translation, and others [4-6]. This technology has also garnered substantial attention in medical sciences and healthcare, and the utilization of these models among students and faculty for health and medical examinations has recently become a prevalent practice [7,8]. These examinations serve as practical benchmarks for assessing model accuracy, and the results can be used for educational purposes and for comparison with healthcare students [9-11]. In recent years, numerous studies have investigated LLM performance on medical licensing examination questions in various languages and countries [12-18]. Nevertheless, despite the promising potential of these models, their application in healthcare contexts faces notable limitations. Given the sensitivity of the field, the generation of incorrect or misleading outputs due to hallucination, misinterpretation, bias, over-reliance, incomplete training, or lack of transparency continues to hinder full trust and adoption [19-22].
- Objectives
- Although many primary studies have examined the performance of LLMs across different medical examinations, their findings vary widely depending on model version, specialty domain, and question type. These inconsistencies make it difficult to draw a clear conclusion regarding the overall capabilities of LLMs in medical education. Therefore, a systematic review and meta-analysis is warranted. Previous reviews have often focused on a single model—most notably ChatGPT [4,9,23]—without comparing other emerging systems. In contrast, our meta-analysis evaluates multiple LLMs, including ChatGPT (OpenAI), Gemini (Google), Copilot (Microsoft), and Claude (Anthropic). Furthermore, we conducted subgroup analyses based on language, question format, and content type, which have received limited attention in earlier research. Considering these factors, this meta-analysis provides a comprehensive summary of the performance of 4 widely used LLMs across various health-related examinations, without restrictions by country or language. Ultimately, this study explores the capacity of these models to serve as supportive educational tools and offers insights into their potential application in decision-making within medical education.
Methods
- Ethics statement
- This study was based entirely on previously published literature; therefore, ethical approval and informed consent were not required.
- Study design
- This study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Fig. 1) and was registered in PROSPERO (CRD420251055880).
- Eligibility criteria
- Eligibility criteria for study selection were defined based on the PICOS (population, intervention, comparison, outcome, study design) framework (Table 1).
- Information sources
- A comprehensive search of electronic databases, including MEDLINE (PubMed), Scopus, and Web of Science (WOS), was performed to identify relevant studies published up to February 1, 2025.
- Search strategy
- The search strategy used a combination of Medical Subject Headings (MeSH) terms and relevant keywords, including (“ChatGPT” OR “GPT” OR “Generative Pre-trained Transformer” OR “Gemini” OR “Bard” OR “Claude” OR “Copilot” OR “Bing” OR “large language model*” OR “LLM”) AND (“medical licensing exam*” OR “medical exam*” OR “medical license*” OR “medical education”). The complete search strategy is provided in Supplement 1. All retrieved records were managed in EndNote ver. 20.0 (Clarivate), and duplicates were removed prior to screening.
- Selection process
- Two investigators (M.A. and H.N.) independently and in duplicate screened all studies. Titles and abstracts were reviewed first to identify potentially eligible studies, followed by full-text assessments to confirm inclusion. Disagreements were resolved through discussion and consensus; if consensus was not reached, a third reviewer (M.H.) provided the final decision. The screening process was documented, including the number of articles reviewed and reasons for exclusion at each stage (Supplement 2).
- Data collection process
- Two reviewers (M.A. and H.N.) independently extracted data from all included studies using a standardized data-extraction form in Microsoft Excel (Microsoft Corp.). Any discrepancies were resolved through discussion, with arbitration by a third reviewer (M.H.) when necessary. The extraction form is available in Supplement 3.
- Data items
- The following information was extracted from each study: first author’s name, publication year, country, question language, question source, number of questions, question format, type of LLM used, and LLM accuracy. Accuracy was defined as the percentage of correct responses provided by the LLM out of the total number of examination questions. If accuracy was not explicitly reported, it was calculated by dividing the number of correct answers by the total number of questions. In studies evaluating multiple LLMs, data for each model were collected separately (Dataset 1).
- Study risk of bias assessment
- The quality and risk of bias of included studies were evaluated using a modified version of the QUADAS-2 tool (University of Bristol), previously applied in similar research [9]. This adapted framework comprises 21 items across 4 domains: question selection, index model, reference standard, and flow/timing. Two reviewers (M.A. and H.N.) conducted independent assessments, with discrepancies resolved by a third reviewer (M.H.).
- Synthesis methods
- A meta-analysis was performed using a random-effects model to estimate the pooled accuracy of LLMs. Forest plots were generated to visualize overall accuracy along with corresponding 95% confidence intervals (CIs). Heterogeneity across studies was assessed using the I2 statistic, with values greater than 50% indicating substantial heterogeneity. Subgroup analyses were performed based on LLM type (e.g., GPT-4, GPT-3.5, Gemini), question format (text-based or image-based), and question language (English vs. non-English).
- To explore potential sources of heterogeneity, sensitivity analyses were conducted using the leave-one-out method, recalculating pooled accuracy after sequentially excluding each study. Publication bias was evaluated through visual inspection of funnel-plot asymmetry and statistically tested using Egger’s regression test; a significant result indicated potential publication bias. All statistical analyses were performed using the meta and metaprop packages in Stata ver. 17.0 (Stata Corp.).
Results
- Study selection
- In this systematic review, 2,404 studies were identified across multiple databases. After removing duplicates and screening titles and abstracts, 127 articles were selected for full-text review. Following the eligibility assessment, 36 articles met the inclusion criteria and were included in the final analysis. The study selection process is presented in Fig. 1.
- Study characteristics
- The characteristics of the included studies are summarized in Table 2. The studies were published between 2023 and 2024, with 21 appearing in 2024 and 15 in 2023. They originated from various countries, including Germany (n=5) [12,16,24-27], Japan (n=5) [18,28-31], Poland (n=5) [14,32-35], USA (n=4) [36-39], China (n=4) [17,40-43], Peru (n=2) [44,45], Saudi Arabia (n=2) [46,47], Taiwan (n=2) [48,49], Brazil (n=1) [23], UK (n=1) [50], Australia (n=1) [15], Belgium (n=1) [51], Chile (n=1) [52], Iran (n=1) [53], and Spain (n=1) [13]. Regarding the language of the questions, most studies used English (n=11), while others used Chinese (n=6), Japanese (n=5), Polish (n=5), Spanish (n=4), German (n=2), Portuguese (n=1), Arabic (n=1), and Italian (n=1).
- The number of questions per study ranged from 95 to 2,700. Twenty-seven studies used text-based questions exclusively, while 9 combined text and images. In terms of question format, 31 studies employed multiple-choice questions (MCQs), 3 used single-choice questions (SCQs), and 2 incorporated both MCQs and SCQs. Further study details are provided in Supplement 4.
- Risk of bias in studies
- Using the QUADAS-2 method, we evaluated study quality and potential bias. Most studies demonstrated a low risk of bias across all domains; however, some uncertainties were noted in the index test and flow/timing domains. Detailed quality assessment results are provided in Supplements 5 and 6.
- Accuracy of LLMs
- Across the included studies, 63 different LLMs were evaluated in medical examinations. These included GPT-4 (n=30), GPT-3.5 (n=23), Bard/Gemini (n=3), Claude (n=3), and Bing (n=4). Reported LLM accuracy ranged from 43% for GPT-3.5 to 90% for GPT-4. The pooled accuracy of all models was 72% (95% CI, 70.0%–75.0%). The heterogeneity across studies was significant (I2=99.99%, P<0.001). A forest plot summarizing pooled accuracy is shown in Fig. 2.
- Subgroup analysis of the accuracy of LLMs
- Subgroup analyses were conducted based on LLM type, question language, and question format (Table 3). By model type, pooled accuracy was highest for GPT-4 (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%), with statistically significant differences among models (P=0.001). When stratified by question format, studies incorporating both text- and image-based questions achieved slightly higher pooled accuracy compared with text-only studies (78% vs. 71%, P=0.03). In terms of question language, pooled LLM accuracy ranged from 62% to 77%, with the highest accuracy observed for German and the lowest for Polish. These differences, however, were not statistically significant (P=0.17). Detailed subgroup results and corresponding forest plots are provided in Supplements 7–11.
- Sensitivity analysis and publication bias
- A sensitivity analysis using a leave-one-out approach was conducted to explore potential sources of heterogeneity. The results indicated that excluding individual studies did not meaningfully alter the pooled accuracy (Supplement 12). Publication bias was assessed through funnel-plot visualization and Egger’s regression test (Supplement 13). Visual inspection revealed no substantial asymmetry, consistent with Egger’s test results showing no evidence of small-study effects, thereby supporting the absence of publication bias (P=0.75).
Discussion
- In this systematic review and meta-analysis, we examined 36 peer-reviewed studies that evaluated the accuracy of 63 LLMs in answering medical licensing examination questions. The analysis encompassed diverse languages, question formats, and examination contexts. The overall pooled accuracy of the models was 72%, reflecting generally strong performance. Among the LLMs assessed, GPT-4, GPT-3.5, Google Gemini/Bard, and Microsoft Bing Chat were the most frequently evaluated. GPT-4 demonstrated the highest pooled accuracy (81%), significantly outperforming GPT-3.5 (60%), Claude (74%), Gemini/Bard (70%), and Bing (79%). A recent study by Liu et al. [9] reported pooled accuracies of 81% for GPT-4 and 58% for GPT-3, findings consistent with our results. In another study, Waldock et al. [10] found a pooled accuracy of 61% for all LLMs and 64% for ChatGPT. The discrepancy may reflect both temporal improvements in model performance and methodological differences—Waldock et al. [10] combined GPT-3.5 and GPT-4 in their estimates, whereas our analysis reports each model separately. The observed heterogeneity (I2=99%) underscores the variation in study design, question type, language, and model selection, reflecting real-world complexity but complicating direct comparisons. Subgroup analyses confirmed that model type substantially influences performance, with GPT-4 consistently surpassing earlier models. This trend aligns with ongoing advances in natural language processing and highlights the importance of continuous model refinement for specialized medical applications. Our results support previous findings demonstrating GPT-4’s superior performance in clinical reasoning, discipline-specific tasks, and general test-taking ability [9]. For example, studies in ophthalmology and general medicine have shown that GPT-4 performs at or above the level of medical students when responding to board-style questions [11]. These findings collectively suggest that model architecture, training data scope, and domain-specific tuning play critical roles in enhancing diagnostic and interpretive accuracy in medical contexts.
- The subgroup analysis further revealed that incorporating image-based content significantly improved pooled accuracy—78% compared with 71% for text-only formats (P=0.030). This improvement aligns with studies showing that multimodal prompting, particularly in GPT-4, enhances diagnostic reasoning and clinical comprehension [54]. Although LLM performance varied across languages—from 62% in Polish to 77% in German—these differences were not statistically significant, contrasting with earlier reports that emphasized language-related disparities [21,55]. The convergence observed here may reflect improved multilingual training and cross-lingual generalization in recent model architectures, especially GPT-4.
- The remarkable accuracy of models such as GPT-4 and Bing underscores their potential as valuable tools in medical education, particularly in resource-limited settings where access to expert instructors is constrained. Their consistent accuracy across languages and question types also suggests the feasibility of global deployment. Furthermore, their ability to process image-based questions highlights new opportunities for training in clinically relevant skills such as radiologic interpretation. Despite these encouraging results, substantial heterogeneity (I2>95%) indicates that variations in study design, evaluation criteria, and question sources remain influential. Differences in model versions (e.g., GPT-3.5 vs. GPT-4), question difficulty, and dataset size likely contributed to this variability. The inclusion of both official examination questions and public question banks may also account for part of the heterogeneity observed. These methodological differences should therefore be considered when interpreting pooled estimates.
- This study has several limitations. First, the high heterogeneity among studies—likely due to differences in question sources, languages, test formats, and study designs—limits the generalizability of the results. Second, a substantial proportion of the reviewed studies focused on OpenAI models (GPT-3.5 and GPT-4), which may have introduced bias into the analysis. Third, detailed information on prompt design, question difficulty, or the specific model versions used was not available in all studies. Furthermore, because of continuous model updates and new releases, the performance reported at a given point in time may not accurately reflect the future performance of these models. Another limitation is the potential overlap between examination questions and the training data of LLMs. As many of the question banks used in the included studies were publicly available, it is possible that some items were encountered by the models during pretraining, which could have slightly inflated performance estimates. However, our subgroup analysis based on question source showed only a minimal difference in accuracy (71% for internal exams vs. 73% for public databases), suggesting that this effect was limited. Nonetheless, potential data contamination should be taken into account when interpreting the results, particularly for widely trained models such as GPT-4.
- Future research should focus on improving LLM architectures for medical reasoning, exploring their ability to directly interpret imaging data, and evaluating their impact on real-world diagnostic workflows. In addition, longitudinal studies examining the integration of LLMs into medical curricula—particularly in simulation-based training and formative assessments—could provide valuable frameworks for advancing the practical application of this technology in medical education. Ultimately, while current LLMs show considerable promise, their integration into medical education and practice must proceed with careful awareness of their limitations, ensuring that AI supports rather than replaces expert human judgment. Ethical considerations and potential challenges associated with these models must also be carefully addressed.
Conclusion
- This systematic review and meta-analysis highlight the promising performance of LLMs, particularly GPT-4, in answering medical licensing examination questions. Our findings indicate that LLMs hold significant potential to enhance medical education, assessment, and clinical decision-making. Despite substantial progress, the high heterogeneity among studies and the intrinsic limitations of LLMs suggest that these technologies are not yet ready to replace traditional educational resources or to be used independently in formal assessments. Therefore, LLMs should be regarded as supportive educational tools that complement conventional learning methods, guided by ethical principles and scientific standards. Future efforts should aim to develop standardized assessment protocols and conduct comparative research in authentic educational and clinical settings to ensure the safe and effective implementation of these technologies.
Authors’ contributions
Conceptualization: MA, HN, AM, MH, AA, AM. Data curation: MA, HN. Methodology/formal analysis/validation: MA, HN, MH. Project administration: none. Funding acquisition: none. Writing–original draft: MA. Writing–review and editing: HN, AM, MH, AA, AM.
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Funding
None.
Data availability
Data files are available from https://doi.org/10.7910/DVN/KMHVAF
Dataset 1. Extracted data for meta-analysis.
jeehp-22-36-dataset1.xlsx
Acknowledgments
None.
Supplementary materials
Supplement files are available from https://doi.org/10.7910/DVN/KMHVAF
Fig. 1.Preferred reporting items PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram for systematic reviews and meta-analyses. LLM, large language model.
Fig. 2.Forest plot for the pooled accuracy of large language models (LLMs). CI, confidence interval.
Table 1.Inclusion and exclusion criteria based on the PICOS framework
|
PICOS |
Inclusion criteria |
Exclusion criteria |
|
Population (P) |
Studies assessing the accuracy of LLMs in answering medical license examination questions. |
Studies not involving answers to medical license examination questions. |
|
Intervention (I) |
Studies that use LLMs (e.g., GPT-4, Gemini, Claude) to answer questions |
Research that does not use LLMs. |
|
Comparator (C) |
Studies evaluating accuracy of LLM against Human performance (such as medical students) or other AI models |
- |
|
Outcome (O) |
• Studies reporting accuracy of LLMs |
Studies that do not report accuracy or lack sufficient data for metric calculation. |
|
• Studies that provide data for calculating accuracy (number of correct answers and total number of questions) |
|
Study design (S) |
• Original peer-reviewed studies that examine the performance of LLMs in medical license examinations. |
Review, editorials, commentaries, letters to the editor, case series, case reports, conference abstracts and preprint articles |
|
• Studies published in English language, regardless of the language of the examination questions |
Table 2.Characteristics of the included studies
|
Study (year) |
LLMs |
Country |
Language |
Question type |
Question format |
No. of questions |
ACC (%) |
|
Rodrigues Alessi et al. [23] (2024) |
GPT-3.5 |
Brazil |
Portuguese |
MCQs |
Text |
333 |
68 |
|
Alfertshofer et al. [24] (2024) |
GPT-4 |
Germany |
Italian |
MCQs |
Text |
300 |
73 |
|
Aljindan et al. [46] (2023) |
GPT-4 |
Saudi Arabia |
Arabic |
MCQs |
Text |
220 |
87 |
|
Bicknell et al. [36] (2024) |
GPT-4 |
USA |
English |
MCQs |
Text |
750 |
90 |
|
GPT-3.5 |
|
|
|
|
|
60 |
|
Ebrahimian et al. [53] (2023) |
GPT-4 |
Iran |
English |
MCQs |
Text |
200 |
69 |
|
Fang et al. [40] (2023) |
GPT-4 |
China |
Chinese |
MCQs |
Text |
600 |
74 |
|
Flores-Cohaila et al. [44] (2023) |
GPT-4 |
Peru |
Spanish |
MCQs |
Text |
180 |
86 |
|
GPT-3.5 |
|
|
|
|
|
77 |
|
Funk et al. [12] (2024) |
GPT-3.5 |
Germany |
English |
MCQs |
Text |
2,700 |
58 |
|
GPT-4 |
|
|
|
|
|
86 |
|
Garabet et al. [37] (2024) |
GPT-4 |
USA |
English |
MCQs |
Text |
1,300 |
86 |
|
Guillen-Grima et al. [13] (2023) |
GPT-3.5 |
Spain |
Spanish |
MCQs |
Text-image |
182 |
63 |
|
GPT-4 |
|
|
|
|
|
87 |
|
Haze et al. [28] (2023) |
GPT-4 |
Japan |
Japanese |
MCQs–SCQs |
Text |
861 |
81 |
|
GPT-3.5 |
|
|
|
|
|
56 |
|
Huang et al. [48] (2024) |
GPT-4 |
Taiwan |
Chinese |
MCQs |
Text-image |
600 |
88 |
|
Jaworski et al. [14] (2024) |
GPT-3.5 |
Poland |
Polish |
MCQs |
Text |
196 |
50 |
|
GPT-4 |
|
|
|
|
|
78 |
|
Kleinig et al. [15] (2023) |
GPT-4 |
Australia |
English |
MCQs |
Text |
150 |
79 |
|
GPT-3.5 |
|
|
|
|
|
66 |
|
Bing |
|
|
|
|
|
72 |
|
Knoedler et al. [25] (2024) |
GPT-4 |
Germany |
English |
MCQs |
Text-image |
299 |
85 |
|
GPT-3.5 |
|
|
|
Text |
1,840 |
57 |
|
Kufel et al. [32] (2024) |
GPT-3.5 |
Poland |
Polish |
MCQs–SCQs |
Text |
117 |
56 |
|
Lai et al. [50] (2023) |
GPT-4 |
UK |
English |
SCQs |
Text |
573 |
76 |
|
Lin et al. [49] (2024) |
GPT-4 |
Taiwan |
Chinese |
SCQs |
Text-image |
1,280 |
82 |
|
Liu et al. [9] (2024) |
GPT-4 |
Japan |
Japanese |
MCQs |
Text-image |
790 |
89 |
|
Gemini |
|
|
|
|
|
80 |
|
Claude |
|
|
|
|
|
82 |
|
Mackey et al. [38] (2024) |
GPT-4 |
USA |
English |
MCQs |
Text |
900 |
89 |
|
Meo et al. [47] (2023) |
GPT-3.5 |
Saudi Arabia |
English |
MCQs |
Text |
100 |
72 |
|
Meyer et al. [16] (2024) |
GPT-3.5 |
Germany |
German |
MCQs |
Text-image |
937 |
58 |
|
GPT-4 |
|
|
|
|
|
85 |
|
Ming et al. [17] (2024) |
GPT-3.5 |
China |
Chinese |
MCQs–SCQs |
Text |
600 |
54 |
|
GPT-4 |
|
|
|
|
|
73 |
|
Morreel et al. [51] (2024) |
GPT-3.5 |
Belgium |
English |
MCQs |
Text |
95 |
67 |
|
GPT-4 |
|
|
|
|
|
76 |
|
Claude |
|
|
|
|
|
67 |
|
Bing |
|
|
|
|
|
76 |
|
Bard |
|
|
|
|
|
62 |
|
Nakao et al. [30] (2024) |
GPT-4 |
Japan |
Japanese |
MCQs |
Text-image |
108 |
68 |
|
Rojas et al. [52] (2024) |
GPT-3.5 |
Chile |
Spanish |
MCQs |
Text-image |
540 |
57 |
|
GPT-4 |
|
|
|
|
|
79 |
|
Roos et al. [27] (2023) |
GPT-4 |
Germany |
German |
MCQs |
Text-image |
630 |
88 |
|
Bing |
|
|
|
|
|
86 |
|
GPT-3.5 |
|
|
|
|
|
66 |
|
Shieh et al. [39] (2024) |
GPT-3.5 |
USA |
English |
MCQs |
Text |
109 |
48 |
|
GPT-4 |
|
|
|
|
|
78 |
|
Siebielec et al. [33] (2024) |
GPT-3.5 |
Poland |
Polish |
MCQs |
Text |
980 |
60 |
|
Suwała et al. [34] (2024) |
GPT-3.5 |
Poland |
Polish |
MCQs |
Text |
2,138 |
59 |
|
Tanaka et al. [18] (2024) |
GPT-3.5 |
Japan |
Japanese |
MCQs |
Text |
397 |
63 |
|
GPT-4 |
|
|
|
|
|
81 |
|
Tong et al. [41] (2023) |
GPT-4 |
China |
Chinese |
SCQs |
Text |
160 |
81 |
|
Torres-Zegarra et al. [45] (2023) |
GPT-3.5 |
Peru |
Spanish |
MCQs |
Text |
180 |
69 |
|
GPT-4 |
|
|
|
|
|
87 |
|
Bing |
|
|
|
|
|
82 |
|
Bard |
|
|
|
|
|
69 |
|
Claude |
|
|
|
|
|
72 |
|
Wojcik et al. [35] (2024) |
GPT-4 |
Poland |
Polish |
MCQs |
Text |
120 |
67 |
|
Yanagita et al. [31] (2023) |
GPT-3.5 |
Japan |
Japanese |
MCQs |
Text |
292 |
43 |
|
GPT-4 |
|
|
|
|
|
81 |
|
Zong et al. [43] (2024) |
GPT-3.5 |
China |
Chinese |
MCQs |
Text |
600 |
60 |
Table 3.The results of subgroup analysis
|
Subgroup |
No. (%) |
Accuracy (%) |
95% CI (%) |
I2 (%) |
P-value |
Test of group differences |
|
Overall |
63 (100) |
72 |
70.0–75.0 |
99 |
0.001 |
- |
|
LLM types |
|
|
|
|
|
0.001 |
|
GPT-4 |
30 (48) |
81 |
79.0–83.0 |
99 |
0.001 |
|
|
Bing |
4 (6) |
79 |
73.0–85.0 |
99 |
0.001 |
|
|
GPT-3.5 |
23 (36) |
60 |
57.0–63.0 |
99 |
0.001 |
|
|
Claude |
3 (5) |
74 |
65.0–82.0 |
99 |
0.001 |
|
|
Gemini/Bard |
3 (5) |
70 |
60.0–81.0 |
99 |
0.001 |
|
|
Question language |
|
|
|
|
|
0.17 |
|
English |
21 (33) |
72 |
64.0–77.0 |
99 |
0.001 |
|
|
Chinese |
7 (11) |
73 |
64.0–82.0 |
99 |
0.001 |
|
|
German |
5 (8) |
77 |
65.0–89.0 |
99 |
0.001 |
|
|
Japanese |
10 (16) |
72 |
63.0–81.0 |
99 |
0.001 |
|
|
Polish |
6 (10) |
62 |
54.0–69.0 |
99 |
0.001 |
|
|
Spanish |
11 (17) |
75 |
69.0–81.0 |
99 |
0.001 |
|
|
Other |
3 (5) |
76 |
65.0–87.0 |
99 |
0.001 |
|
|
Question format |
|
|
|
|
|
0.03 |
|
Text |
47 (75) |
71 |
67.0–74.0 |
99 |
0.001 |
|
|
Text and image |
16 (25) |
78 |
72.0–83.0 |
99 |
0.001 |
|
|
Question type |
|
|
|
|
|
0.001 |
|
MCQ |
55 (87) |
73 |
70.0–76.0 |
99 |
0.001 |
|
|
MCQ and SCQ |
5 (8) |
64 |
53.0–75.0 |
99 |
0.001 |
|
|
SCQ |
3 (5) |
72 |
70.0–75.0 |
99 |
0.001 |
|
|
Source of questions |
|
|
|
|
|
0.63 |
|
Public database |
54 (86) |
73 |
69.0–76.0 |
99 |
0.001 |
|
|
Internal exam |
9 (14) |
71 |
65.0–77.0 |
99 |
0.001 |
|
References
- 1. Kasneci E, Seßler K, Kuchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Gunnemann S, Hullermeier E, Krusche S. ChatGPT for good?: on opportunities and challenges of large language models for education. Learn Individ Differ 2023;103:102274. https://doi.org/10.1016/j.lindif.2023.102274 Article
- 2. Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak 2025;25:117. https://doi.org/10.1186/s12911-025-02954-4 ArticlePubMedPMC
- 3. Toloka Team. The history, timeline, and future of LLMs: Essential ML Guide [Internet]. Toloka Team; 2023 [cited 2025 Sep 10]. Available from: https://toloka.ai/blog/history-of-llms/
- 4. Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med Educ 2024;24:1013. https://doi.org/10.1186/s12909-024-05944-8 ArticlePubMedPMC
- 5. Okonkwo CW, Ade-Ibijola A. Chatbots applications in education: a systematic review. Comput Educ Artif Intell 2021;2:100033. https://doi.org/10.1016/j.caeai.2021.100033 Article
- 6. Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys Syst 2023;3:121-154. https://doi.org/10.1016/j.iotcps.2023.04.003 Article
- 7. Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ 2024;17:926-931. https://doi.org/10.1002/ase.2270 ArticlePubMed
- 8. Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JP. Large language models for science and medicine. Eur J Clin Invest 2024;54:e14183. https://doi.org/10.1111/eci.14183 ArticlePubMed
- 9. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res 2024;26:e60807. https://doi.org/10.2196/60807 ArticlePubMedPMC
- 10. Waldock WJ, Zhang J, Guni A, Nabeel A, Darzi A, Ashrafian H. The accuracy and capability of artificial intelligence solutions in health care examinations and certificates: systematic review and meta-analysis. J Med Internet Res 2024;26:e56532. https://doi.org/10.2196/56532 ArticlePubMedPMC
- 11. Wu JH, Nishida T, Liu TY. Accuracy of large language models in answering ophthalmology board-style questions: a meta-analysis. Asia Pac J Ophthalmol (Phila) 2024;13:100106. https://doi.org/10.1016/j.apjo.2024.100106 ArticlePubMed
- 12. Funk PF, Hoch CC, Knoedler S, Knoedler L, Cotofana S, Sofo G, Bashiri Dezfouli A, Wollenberg B, Guntinas-Lichius O, Alfertshofer M. ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Investig Health Psychol Educ 2024;14:657-668. https://doi.org/10.3390/ejihpe14030043 ArticlePubMedPMC
- 13. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the efficacy of ChatGPT in navigating the Spanish Medical Residency Entrance Examination (MIR): promising horizons for AI in clinical medicine. Clin Pract 2023;13:1460-1487. https://doi.org/10.3390/clinpract13060130 ArticlePubMedPMC
- 14. Jaworski A, Jasinski D, Jaworski W, Hop A, Janek A, Slawinska B, Konieczniak L, Rzepka M, Jung M, Syslo O, Jarzabek V, Blecha Z, Harazinski K, Jasinska N. Comparison of the performance of artificial intelligence versus medical professionals in the Polish Final Medical Examination. Cureus 2024;16:e66011. https://doi.org/10.7759/cureus.66011 ArticlePubMedPMC
- 15. Kleinig O, Gao C, Bacchi S. This too shall pass: the performance of ChatGPT-3.5, ChatGPT-4 and New Bing in an Australian medical licensing examination. Med J Aust 2023;219:237. https://doi.org/10.5694/mja2.52061 Article
- 16. Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the Written German Medical Licensing Examination: observational study. JMIR Med Educ 2024;10:e50965. https://doi.org/10.2196/50965 ArticlePubMedPMC
- 17. Ming S, Guo Q, Cheng W, Lei B. Influence of model evolution and system roles on ChatGPT’s performance in Chinese Medical Licensing Exams: comparative study. JMIR Med Educ 2024;10:e52784. https://doi.org/10.2196/52784 ArticlePubMedPMC
- 18. Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, Kawai H, Higashino F, Enomoto M, Noda M, Kometani M, Takamura M, Yoneda T, Kakizaki H, Nomura A. Performance of generative pretrained transformer on the National Medical Licensing Examination in Japan. PLOS Digit Health 2024;3:e0000433. https://doi.org/10.1371/journal.pdig.0000433 ArticlePubMedPMC
- 19. Briganti G. A clinician’s guide to large language models. Future Med AI 2023;1:FMAI1. https://doi.org/10.2217/fmai-2023-0003 Article
- 20. Jiang Y, Qiu R, Zhang Y, Zhang PF. Balanced and explainable social media analysis for public health with large language models. In: Bao Z, Borovica-Gajic R, Qiu R, Choudhury F, Yang Z, editors. Databases theory and applications. Proceedings of the 34th Australasian Database Conference, ADC 2023; 2023 Nov 1-3; Melbourne, Australia. Springer-Verlag; 2023. p. 73-86. https://doi.org/10.1007/978-3-031-47843-7_6
- 21. Mahdavi A, Amanzadeh M, Hamedan M. The role of large language models in modern medical education: opportunities and challenges. Shiraz E-Med J 2024;25:e144847. https://doi.org/10.5812/semj-144847 Article
- 22. Nazi ZA, Peng W. Large language models in healthcare and medical domain: a review. Informatics 2024;11:57. https://doi.org/10.3390/informatics11030057 Article
- 23. Rodrigues Alessi M, Gomes HA, Lopes de Castro M, Terumy Okamoto C. Performance of ChatGPT in solving questions from the Progress Test (Brazilian National Medical Exam): a potential artificial intelligence tool in medical practice. Cureus 2024;16:e64924. https://doi.org/10.7759/cureus.64924 ArticlePubMedPMC
- 24. Alfertshofer M, Hoch CC, Funk PF, Hollmann K, Wollenberg B, Knoedler S, Knoedler L. Sailing the seven seas: a multinational comparison of ChatGPT’s performance on medical licensing examinations. Ann Biomed Eng 2024;52:1542-1545. https://doi.org/10.1007/s10439-023-03338-3 ArticlePubMed
- 25. Knoedler L, Alfertshofer M, Knoedler S, Hoch CC, Funk PF, Cotofana S, Maheta B, Frank K, Brebant V, Prantl L, Lamby P. Pure wisdom or Potemkin villages?: a comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE step 3 style questions: quantitative analysis. JMIR Med Educ 2024;10:e51148. https://doi.org/10.2196/51148 ArticlePubMedPMC
- 26. Knoedler L, Knoedler S, Hoch CC, Prantl L, Frank K, Soiderer L, Cotofana S, Dorafshar AH, Schenck T, Vollbach F, Sofo G, Alfertshofer M. In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions. Sci Rep 2024;14:13553. https://doi.org/10.1038/s41598-024-63997-7 ArticlePubMedPMC
- 27. Roos J, Kasapovic A, Jansen T, Kaczmarczyk R. Artificial intelligence in medical education: comparative analysis of ChatGPT, Bing, and medical students in Germany. JMIR Med Educ 2023;9:e46482. https://doi.org/10.2196/46482 ArticlePubMedPMC
- 28. Haze T, Kawano R, Takase H, Suzuki S, Hirawa N, Tamura K. Influence on the accuracy in ChatGPT: differences in the amount of information per medical field. Int J Med Inform 2023;180:105283. https://doi.org/10.1016/j.ijmedinf.2023.105283 ArticlePubMed
- 29. Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, Furukawa E, Kiuchi T. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using Japanese national medical examination. Int J Med Inform 2025;193:105673. https://doi.org/10.1016/j.ijmedinf.2024.105673 ArticlePubMed
- 30. Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: evaluation study. JMIR Med Educ 2024;10:e54393. https://doi.org/10.2196/54393 ArticlePubMedPMC
- 31. Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on medical questions in the National Medical Licensing Examination in Japan: evaluation study. JMIR Form Res 2023;7:e48023. https://doi.org/10.2196/48023 ArticlePubMedPMC
- 32. Kufel J, Bielowka M, Rojek M, Mitrega A, Czogalik L, Kaczynska D, Kondol D, Palkij K, Mielcarska S. Assessing ChatGPT’s performance in national nuclear medicine specialty examination: an evaluative analysis. Iran J Nucl Med 2024;32:60-65. https://doi.org/10.22034/IRJNM.2023.129434.1580 Article
- 33. Siebielec J, Ordak M, Oskroba A, Dworakowska A, Bujalska-Zadrozny M. Assessment study of ChatGPT-3.5’s performance on the final Polish Medical Examination: accuracy in answering 980 questions. Healthcare (Basel) 2024;12:1637. https://doi.org/10.3390/healthcare12161637 ArticlePubMedPMC
- 34. Suwala S, Szulc P, Guzowski C, Kaminska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland’s medical final examination: is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024;12:20503121241257777. https://doi.org/10.1177/20503121241257777 ArticlePubMedPMC
- 35. Wojcik S, Rulkiewicz A, Pruszczyk P, Lisik W, Pobozy M, Domienik-Karlowicz J. Reshaping medical education: performance of ChatGPT on a PES medical examination. Cardiol J 2024;31:442-450. https://doi.org/10.5603/cj.97517 ArticlePubMedPMC
- 36. Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, Spaedy O, Skelton A, Edupuganti N, Dzubinski L, Tate H, Dyess G, Lindeman B, Lehmann LS. ChatGPT-4 omni performance in USMLE disciplines and clinical skills: comparative analysis. JMIR Med Educ 2024;10:e63430. https://doi.org/10.2196/63430 ArticlePubMedPMC
- 37. Garabet R, Mackey BP, Cross J, Weingarten M. ChatGPT-4 performance on USMLE step 1 style questions and its implications for medical education: a comparative study across systems and disciplines. Med Sci Educ 2024;34:145-152. https://doi.org/10.1007/s40670-023-01956-z ArticlePubMed
- 38. Mackey BP, Garabet R, Maule L, Tadesse A, Cross J, Weingarten M. Evaluating ChatGPT-4 in medical education: an assessment of subject exam performance reveals limitations in clinical curriculum support for students. Discov Artif Intell 2024;4:38. https://doi.org/10.1007/s44163-024-00135-2 Article
- 39. Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE step 2 CK and clinical case reports. Sci Rep 2024;14:9330. https://doi.org/10.1038/s41598-024-58760-x ArticlePubMedPMC
- 40. Fang C, Wu Y, Fu W, Ling J, Wang Y, Liu X, Jiang Y, Wu Y, Chen Y, Zhou J, Zhu Z, Yan Z, Yu P, Liu X. How does ChatGPT-4 preform on non-English national medical licensing examination?: an evaluation in Chinese language. PLOS Digit Health 2023;2:e0000397. https://doi.org/10.1371/journal.pdig.0000397 ArticlePubMedPMC
- 41. Tong W, Guan Y, Chen J, Huang X, Zhong Y, Zhang C, Zhang H. Artificial intelligence in global health equity: an evaluation and discussion on the application of ChatGPT, in the Chinese National Medical Licensing Examination. Front Med (Lausanne) 2023;10:1237432. https://doi.org/10.3389/fmed.2023.1237432 ArticlePubMedPMC
- 42. Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, Fan Q, Wu S, Hu W, Li X. ChatGPT performs on the Chinese National Medical Licensing Examination. J Med Syst 2023;47:86. https://doi.org/10.1007/s10916-023-01961-0 ArticlePubMed
- 43. Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ 2024;24:143. https://doi.org/10.1186/s12909-024-05125-7 ArticlePubMedPMC
- 44. Flores-Cohaila JA, Garcia-Vicente A, Vizcarra-Jimenez SF, De la Cruz-Galan JP, Gutierrez-Arratia JD, Quiroga Torres BG, Taype-Rondan A. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: cross-sectional study. JMIR Med Educ 2023;9:e48039. https://doi.org/10.2196/48039 ArticlePubMedPMC
- 45. Torres-Zegarra BC, Rios-Garcia W, Nana-Cordova AM, Arteaga-Cisneros KF, Chalco XC, Ordonez MA, Rios CJ, Godoy CA, Quezada KL, Gutierrez-Arratia JD, Flores-Cohaila JA. Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study. J Educ Eval Health Prof 2023;20:30. https://doi.org/10.3352/jeehp.2023.20.30 ArticlePubMedPMC
- 46. Aljindan FK, Al Qurashi AA, Albalawi IA, Alanazi AM, Aljuhani HA, Falah Almutairi F, Aldamigh OA, Halawani IR, K Zino Alarki SM. ChatGPT conquers the Saudi Medical Licensing Exam: exploring the accuracy of artificial intelligence in medical knowledge assessment and implications for modern medical education. Cureus 2023;15:e45043. https://doi.org/10.7759/cureus.45043 ArticlePubMedPMC
- 47. Meo SA, Al-Masri AA, Alotaibi M, Meo MZ, Meo MO. ChatGPT knowledge evaluation in basic and clinical medical sciences: multiple choice question examination-based performance. Healthcare (Basel) 2023;11:2046. https://doi.org/10.3390/healthcare11142046 ArticlePubMedPMC
- 48. Huang CH, Hsiao HJ, Yeh PC, Wu KC, Kao CH. Performance of ChatGPT on stage 1 of the Taiwanese medical licensing exam. Digit Health 2024;10:20552076241233144. https://doi.org/10.1177/20552076241233144 ArticlePMC
- 49. Lin SY, Chan PK, Hsu WH, Kao CH. Exploring the proficiency of ChatGPT-4: an evaluation of its performance in the Taiwan advanced medical licensing examination. Digit Health 2024;10:20552076241237678. https://doi.org/10.1177/20552076241237678 ArticlePubMedPMC
- 50. Lai UH, Wu KS, Hsu TY, Kan JK. Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med (Lausanne) 2023;10:1240915. https://doi.org/10.3389/fmed.2023.1240915 ArticlePubMedPMC
- 51. Morreel S, Verhoeven V, Mathysen D. Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam. PLOS Digit Health 2024;3:e0000349. https://doi.org/10.1371/journal.pdig.0000349 ArticlePubMedPMC
- 52. Rojas M, Rojas M, Burgess V, Toro-Perez J, Salehi S. Exploring the performance of ChatGPT versions 3.5, 4, and 4 with vision in the Chilean Medical Licensing Examination: observational study. JMIR Med Educ 2024;10:e55048. https://doi.org/10.2196/55048 ArticlePubMedPMC
- 53. Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. ChatGPT in Iranian Medical Licensing Examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health Care Inform 2023;30:e100815. https://doi.org/10.1136/bmjhci-2023-100815 Article
- 54. Li Y, Liu Y, Wang Z, Liang X, Liu L, Wang L, Cui L, Tu Z, Wang L, Zhou L. A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv [Preprint] 2023 Nov 4. https://doi.org/10.1101/2023.11.03.23298067 Article
- 55. Tarighatnia A, Amanzadeh M, Hamedan M, Mobareke MK, Nader ND. Opportunities and challenges of large language models in medical imaging. Front Biomed Technol 2025;12:205-209. https://doi.org/10.18502/fbt.v12i2.18267 Article
Citations
Citations to this article as recorded by

- Effective prompt design for large language models in clinical practice
Steven Callens
Acta Clinica Belgica.2026; 81(2): 118. CrossRef - Examination of Gemini's ability to answer anatomical questions: An overview
D. Chytas, G. Noussios, M.-K. Kaseta, D. Chrysikos, A.V. Vasiliadis, C. Lyrtzis, T. Troupis
Morphologie.2026; 110(369): 101118. CrossRef - Evaluación comparativa de modelos de inteligencia artificial de última generación frente a psiquiatras humanos en el examen nacional de subespecialidad en Perú: un estudio transversal
Javier A. Flores-Cohaila, Jeff Huarcaya-Victoria, Cesar Copaja-Corzo
Educación Médica.2026; 27(3): 101179. CrossRef