Abstract
-
Purpose
- This study aims to evaluate the performance of Chat Generative Pre-Trained Transformer 4 (ChatGPT-4) on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination and to assess its role as a supplementary resource in helping residents prepare for the qualification examination in plastic surgery.
-
Methods
- This descriptive study evaluated ChatGPT-4’s performance on 213 items from the October 2024 French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. Responses were assessed for accuracy, logical reasoning, internal and external information use, and were categorized for fallacies by independent reviewers. Statistical analyses included chi-square tests and Fisher’s exact test for significance.
-
Results
- ChatGPT-4 answered all questions across the 10 modules, achieving an overall accuracy rate of 77.5%. The model applied logical reasoning in 98.1% of the questions, utilized internal information in 94.4%, and incorporated external information in 91.1%.
-
Conclusion
- ChatGPT-4 performs satisfactorily on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. Its accuracy met the minimum passing standards for the exam. While responses generally align with expected knowledge, careful verification remains necessary, particularly for questions involving image interpretation. As artificial intelligence continues to evolve, ChatGPT-4 is expected to become an increasingly reliable tool for medical education. At present, it remains a valuable resource for assisting plastic surgery residents in their training.
-
Keywords: Artificial intelligence; Medical education; Internship and residency; Plastic surgery; Software; France
Graphical abstract
Introduction
- Background
- Large language models (LLMs) have demonstrated remarkable abilities in answering complex medical questions, summarizing literature [1], and performing standardized medical exams across various specialties [2,3]. Several studies have evaluated LLM performance on national medical licensing exams, such as the United States Medical Licensing Examination [4], with varying degrees of accuracy. Artificial intelligence (AI)-powered tools are increasingly being integrated into medical education and clinical decision-making, offering potential benefits such as enhanced learning, rapid information retrieval, and assistance with exam preparation. Among these models, Chat Generative Pre-Trained Transformer (ChatGPT), developed by OpenAI, stands out as an advanced LLM capable of generating human-like responses conversationally by predicting word sequences based on prior input. One of its latest iterations, ChatGPT-4, represents a significant improvement over previous versions, particularly in accuracy and contextual understanding, with added multimodal capabilities such as image analysis [5]. During this study, ChatGPT-4 was trained on publicly available internet sources up to September 2023.
- Over the past 3 years, researchers have evaluated ChatGPT’s medical knowledge using licensing examinations from various countries and across different specialties [6]. The specificity and complexity of plastic surgery board exams, especially in non-English contexts such as the French system, pose unique challenges for AI models because they often involve highly technical knowledge and case-based scenarios requiring deep domain expertise and understanding.
- Objectives
- This study aims to evaluate ChatGPT-4’s performance on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. This assessment may help define its potential as an interactive educational tool to support French residents’ learning and exam preparation.
Methods
- Ethics statement
- No human subjects or healthcare data were used in this investigation, and ethical approval was not required.
- Study design
- We conducted a descriptive study evaluating the performance of ChatGPT-4 on items from the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination.
- Setting
- We evaluated questions from the October 2024 French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. The exam was divided into 10 modules: anatomy, head and neck, general principles, pediatrics, thorax and abdomen, upper limb, lower limb, oncology, aesthetic surgery, and Medico-legal aspects. Each question was submitted to ChatGPT-4 exactly as it appeared on exam day, including language, except for those containing images (Fig. 1). For image-based questions, the software required additional details. For example, in annotation tasks, each arrow had to be specified, and the question reformulated accordingly for each label (Fig. 2). All interactions with ChatGPT-4 were carried out using the freely available web-based interface provided by OpenAI (https://chat.openai.com) during December 2024. We deliberately used the free version to ensure reproducibility and accessibility for users without a paid subscription, such as medical students or trainees. No plugins or API integrations were employed.
- Variables
- The primary outcome measured was the accuracy of ChatGPT-4 on the French Plastic Surgery board examination.
- Data sources/measurement
- The accuracy of ChatGPT-4 was evaluated for every question. Each question was pasted into the chatbox exactly as it appeared on the exam, with narrative text followed by multiple-choice options. The model’s answer was compared with the correct answer determined by examiners and scored according to the official grading scale. Two independent reviewers (E.D.B. and U.L.) assessed each response, with a third reviewer (A.K.) resolving any discrepancies. ChatGPT-4’s answers were further analyzed using the framework of natural coherence, a cognitive model describing how humans construct meaning from text by integrating different types of information [7]. This evaluation considered 3 characteristics: (1) logical reasoning: use of stepwise logic to generate an answer based on information in the question stem; (2) internal information: reliance on information contained within the question stem; and (3) external information: incorporation of external knowledge, including literature references.
- Incorrect responses were categorized as follows: (1) logical fallacy: stepwise reasoning present but the final answer incorrect; (2) informational fallacy: logic applied but failure to use a key piece of information from the question stem; (3) and explicit fallacy: absence of logical reasoning and inability to use information from the question stem. (Dataset 1.)
- Bias
- Each question was submitted to ChatGPT-4 in a new chat session to avoid answer contamination across items. Exceptions were made for clinical case-based questions, where items built logically on one another. In these cases, questions were entered sequentially within the same session to preserve contextual continuity.
- Study size
- ChatGPT-4 was tested on all 213 items, representing the complete set of written examination questions from the 2024 French Board of Plastic Surgery.
- Statistical methods
- Statistical analyses were performed using R software ver. 4.3.2 (The R Foundation for Statistical Computing) (Supplement 1). The unpaired chi-square test was applied to assess statistical significance for categorical variables. For image-based questions, the Fisher's exact test was used, as expected frequencies in the “image present” group were fewer than 5. This ensured appropriate statistical evaluation of observed differences. A 5% significance level was applied for all tests.
Results
- As described above, ChatGPT-4 was instructed to answer each of the 10 sections of the exam. To ensure greater precision, responses to multiple-choice questions were analyzed individually, resulting in a total assessment of 213 items. Overall, ChatGPT-4 was accurate for 165 responses, yielding an accuracy rate of 77.5% and a final score of 58.72/100 based on the examination grading scale (Table 1). Compared with student results, ChatGPT-4 would have ranked 36th out of 38 candidates. The software applied logical reasoning in 98.1% of questions, used internal information in 94.4%, and incorporated external information in 91.1%. Among incorrect responses, the most common fallacy was informational fallacy (81.3%), followed by logical fallacy (71.2%) and explicit fallacy (12.5%).
- When responses were stratified according to correctness, ChatGPT-4 incorporated logical reasoning, internal information, and external information in 100% of correct answers. In contrast, for incorrect responses, 91.7% included logical reasoning, 75.0% used internal information, and 60.4% relied on external information. Comparing correct and incorrect answers showed statistically significant differences for logical reasoning (P=0.002), internal information (P<0.001), and external information (P<0.001) (Table 2).
- A small subset of questions (n=4; 2%) included images in the question stem. For these items, ChatGPT-4 provided the correct answer in 25% (n=1) of cases (Table 3). For questions without images, the model achieved an accuracy rate of 78.4%. This difference was statistically significant (P=0.04).
Discussion
- Key results
- This study aimed to quantitatively assess ChatGPT-4’s ability to interpret complex surgical and clinical information and to evaluate its potential implications for surgical education and training. We found that ChatGPT-4 performed at a satisfactory level on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination, with an accuracy rate that met the minimum passing standard.
- Interpretation
- ChatGPT-4 used internal and external information in more than 90% of cases and applied logical reasoning in 98%. We identified statistically significant differences in the use of logical reasoning, internal information, and external information between correct and incorrect responses. The significance of all 3 categories suggests that ChatGPT-4’s optimal performance depends on the interaction of these fundamental cognitive mechanisms. This may indicate that the model is not only retrieving facts but also synthesizing and contextualizing knowledge effectively. These results are consistent with prior studies that reported improved item accuracy in standardized exams, particularly with ChatGPT-4 [3,8].
- Although ChatGPT-4 passed the examination, it ranked 36th out of 38 candidates, with a score of 58.72/100 according to the official grading system. While sufficient to meet the minimum passing requirement, this score remains below the average human candidate’s performance. This discrepancy likely reflects the model's limitations in interpreting nuanced contextual cues, resolving ambiguous phrasing, or applying test-taking strategies commonly used by experienced candidates. Nevertheless, when items were considered independently, ChatGPT-4 achieved a 77.5% success rate, indicating a solid grasp of core surgical knowledge.
- Most existing studies comparing ChatGPT versions (3.5 vs. 4) have not emphasized image-based analysis. Our study incorporated this element and demonstrated a notable decrease in accuracy for image-based questions: 25% compared with 78.4% for text-only items. This highlights a major limitation of ChatGPT-4 in visually intensive specialties such as plastic surgery, in line with previous findings [9,10]. However, this observation should be interpreted cautiously, as only 4 image-related questions were included in the 2024 exam.
- Comparison with previous studies
- We selected ChatGPT-4 for this study because of its demonstrated superiority in medical applications, as documented in recent literature [11,12]. To our knowledge, this is the first application of ChatGPT-4 to plastic surgery in France. Considerable variability has been observed among surgical subspecialties with respect to written board examinations. Our findings are consistent with those of Gupta et al. [1], who reported that ChatGPT-3.5 achieved a satisfactory score on the Plastic Surgery In-Service Training Exam (PSITE), with further improvements observed with ChatGPT-4 [13]. In contrast to our findings, however, their statistical analysis revealed significant differences only in the use of external information between correct and incorrect responses, and image stratification showed no significant effect. These differences may reflect both improvements in model architecture and the unique characteristics of exam design. The French board exam is structured around thematically integrated, clinically oriented modules, requiring greater contextual interpretation. In contrast, the PSITE may rely more on factual recall, where external information alone can suffice. This could explain the stronger role of internal information and logical coherence in our study.
- Limitations
- As with all ChatGPT-related studies, one limitation is the variability of responses over time. Chen et al. [14] demonstrated that GPT-3.5 and GPT-4 performance can fluctuate significantly across testing periods. Our data were collected in September 2024 and thus reflect the model’s performance at that time. Additionally, each question was asked only once, so intra-model variability was not assessed and may have influenced some responses. Furthermore, ChatGPT-4’s training data were last updated in September 2023, potentially limiting accuracy for more recent developments. This study was also limited to the 2024 edition of the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination, restricting generalizability across exam years. Only ChatGPT-4 was tested; no comparisons were made with other LLMs. We intentionally used the freely available public interface to simulate the experience of typical learners. Although ChatGPT-5 has since been released with reported improvements in reasoning and contextual analysis, we selected ChatGPT-4 for its stability, accessibility, and reliability at the time. Paid versions or enterprise integrations may also offer additional features that were not evaluated in this study.
- Furthermore, although a statistically significant result was observed in the stratified analysis based on image use, the number of image-based questions was minimal. Therefore, caution is warranted when interpreting these findings regarding ChatGPT’s ability to process visual content. Lastly, since language models such as ChatGPT may be influenced by prior exposure to publicly available materials, the availability of past exam content online could have contributed to an overestimation of performance.
- Generalizability
- These findings may have implications for medical education beyond the French board examination. The ability of a language model to perform reasonably well on a highly specialized assessment suggests that such tools could serve as valuable learning companions. In particular, they may provide support for exam preparation, content review, and instructional material development, especially in settings where access to expert faculty is limited. Similar applications have already been explored in other medical specialties. For example, Ebel et al. [15] demonstrated that ChatGPT-4o could create exam items of varying difficulty to train medical students and interventional radiologists. In plastic surgery, AI could be integrated into national board preparation programs by offering adaptive quizzes, realistic case simulations, and image-based exercises tailored to each resident’s training level and stage. Faculty review would remain essential to ensure accuracy and alignment with board standards. Although results from a single national exam cannot be universally generalized, they invite further investigation into the role of AI in health professions education across different languages, systems, and specialties. At the same time, concerns remain about the accuracy of generated content, the risk of learner overreliance, and the spread of outdated or incorrect information. Importantly, ChatGPT-4 cannot verify its own sources. Ethical issues—including transparency, data privacy, and equitable access—must also be carefully addressed as these technologies expand in use.
- Suggestions
- Future studies should compare multiple AI models to evaluate their relative effectiveness and accuracy in diverse educational contexts. Incorporating more visual materials, such as annotated images, would also enhance the assessment of AI’s ability to process and interpret visual data—an essential skill in surgical education. Additionally, longitudinal reviews of past exams could reveal performance trends and areas for improvement, helping to establish a framework for evaluating the long-term impact of AI-driven educational tools in health professional training. Another promising direction would be the implementation of user-selectable options within ChatGPT to prioritize peer-reviewed or highly reliable sources when generating scientific or medical information. Finally, with the release of ChatGPT-5, direct comparisons with ChatGPT-4 would be valuable. However, given the restricted functionality of the free version, such comparisons would likely require access to paid versions such as ChatGPT Plus or Pro.
- Conclusion
- Our study demonstrates that ChatGPT-4 is capable of passing the French plastic surgery board examination, though its performance remains below that of the average human candidate. These results suggest that ChatGPT-4 can serve as an interactive educational tool, helping residents review specific topics, reinforce key concepts, and practice exam-style questions. However, responses must be critically evaluated by learners and cannot be relied upon as a sole source of knowledge. Looking forward, continued progress in AI development may allow such systems to evolve into comprehensive training tools for plastic surgery residents.
Authors’ contributions
Conceptualization: EDB. Data curation: EDB, UL. Methodology/formal analysis/validation: EDB, AK, UL. Project administration: PP. Funding acquisition: none. Writing–original draft: EDB. Writing–review & editing: EDB, AK, FT, PP, UL.
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Funding
None.
Data availability
Data files are available from https://doi.org/10.7910/DVN/ROXC5Y
Dataset 1. Answers classification.
jeehp-22-27-dataset1.xlsx
Acknowledgments
None.
Supplementary materials
Supplement files are available from https://doi.org/10.7910/DVN/ROXC5Y
Fig. 1.Example of a multiple-choice submitted to ChatGPT-4, along with its response.
Fig. 2.Example of a question containing an image.
Table 1.Qualitative analysis of ChatGPT-4’s responses to the 2024 French Board of Plastic Reconstructive and Aesthetic Surgery written examination
|
Variable |
Overall |
Correct responses |
Question characteristic |
Incorrect responses |
Reason for incorrect answers |
|
Logical |
Internal |
External |
Logical fallacy |
Informational fallacy |
Explicit fallacy |
|
Anatomy |
21 |
16 |
21 |
21 |
21 |
5 |
4 |
2 |
0 |
|
General principles |
6 |
3 |
4 |
4 |
3 |
3 |
0 |
3 |
0 |
|
Head and neck |
9 |
2 |
8 |
8 |
2 |
7 |
6 |
7 |
0 |
|
Pediatrics |
13 |
11 |
13 |
13 |
13 |
2 |
2 |
1 |
0 |
|
Thorax and abdomen |
49 |
40 |
48 |
49 |
49 |
9 |
5 |
5 |
5 |
|
Oncology |
13 |
11 |
13 |
13 |
13 |
2 |
1 |
2 |
0 |
|
Lower limb |
12 |
9 |
12 |
9 |
9 |
3 |
3 |
3 |
0 |
|
Aesthetic |
14 |
8 |
14 |
8 |
8 |
6 |
6 |
6 |
0 |
|
Upper limb |
42 |
35 |
42 |
42 |
42 |
7 |
7 |
6 |
1 |
|
Medico-legal |
34 |
30 |
34 |
34 |
34 |
4 |
4 |
4 |
0 |
|
Total |
213 |
165 |
209 |
201 |
194 |
48 |
38 |
39 |
6 |
|
% |
|
77.5 |
98.1 |
94.4 |
91.1 |
22.5 |
71.2 |
81.3 |
12.5 |
Table 2.Types of responses generated by ChatGPT-4, stratified based on correctness (n=213)
|
Total |
Correct answers (n=165) |
Incorrect answers (n=48) |
P-value |
|
Logical |
209 (98.1) |
165 (100.0) |
44 (91.7) |
0.002 |
|
Internal |
201 (94.4) |
165 (100.0) |
36 (75.0) |
<0.001 |
|
External |
194 (91.1) |
165 (100.0) |
29 (60.4) |
<0.001 |
Table 3.Accuracy stratified by image presence in the question stem
|
Image presence |
Correct |
Incorrect |
Accuracy rate (%) |
P-value |
|
Yes (n=4) |
1 |
3 |
25.0 |
0.04*
|
|
No (n=209) |
164 |
45 |
78.4 |
|
References
- 1. Gupta R, Herzog I, Park JB, Weisberger J, Firouzbakht P, Ocon V, Chao J, Lee ES, Mailey BA. Performance of ChatGPT on the Plastic Surgery Inservice Training Examination. Aesthet Surg J 2023;43:NP1078-NP1082. https://doi.org/10.1093/asj/sjad128 ArticlePubMed
- 2. Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res 2023;104:269-273. https://doi.org/10.4174/astr.2023.104.5.269 ArticlePubMedPMC
- 3. Yudovich MS, Makarova E, Hague CM, Raman JD. Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study. J Educ Eval Health Prof 2024;21:17. https://doi.org/10.3352/jeehp.2024.21.17 ArticlePubMedPMC
- 4. Tan S, Xin X, Wu D. ChatGPT in medicine: prospects and challenges: a review article. Int J Surg 2024;110:3701-3706. https://doi.org/10.1097/JS9.0000000000001312 ArticlePubMedPMC
- 5. OpenAI Platform [Internet]. OpenAI; 2025 [cited 2025 Feb 18]. Available from: https://platform.openai.com
- 6. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. J Med Internet Res 2024;26:e60807. https://doi.org/10.2196/60807 ArticlePubMedPMC
- 7. Trabasso T. The development of coherence in narratives by understanding intentional action. Adv Psychol 1991;79:297-314. https://doi.org/10.1016/S0166-4115(08)61559-9 Article
- 8. Aljindan FK, Shawosh MH, Altamimi L, Arif S, Mortada H. Utilization of ChatGPT-4 in plastic and reconstructive surgery: a narrative review. Plast Reconstr Surg Glob Open 2023;11:e5305. https://doi.org/10.1097/GOX.0000000000005305 ArticlePubMedPMC
- 9. Posner KM, Bakus C, Basralian G, Chester G, Zeiman M, O’Malley GR, Klein GR. Evaluating ChatGPT’s capabilities on orthopedic training examinations: an analysis of new image processing features. Cureus 2024;16:e55945. https://doi.org/10.7759/cureus.55945 ArticlePubMedPMC
- 10. Sato H, Ogasawara K. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. J Educ Eval Health Prof 2024;21:4. https://doi.org/10.3352/jeehp.2024.21.4 ArticlePubMedPMC
- 11. DiDonna N, Shetty PN, Khan K, Damitz L. Unveiling the potential of AI in plastic surgery education: a comparative study of leading AI platforms’ performance on in-training examinations. Plast Reconstr Surg Glob Open 2024;12:e5929. https://doi.org/10.1097/GOX.0000000000005929 ArticlePubMedPMC
- 12. Zong H, Wu R, Cha J, Wang J, Wu E, Li J, Zhou Y, Zhang C, Feng W, Shen B. Large language models in worldwide medical exams: platform development and comprehensive analysis. J Med Internet Res 2024;26:e66114. https://doi.org/10.2196/66114 ArticlePubMedPMC
- 13. Gupta R, Park JB, Herzog I, Yosufi N, Mangan A, Firouzbakht PK, Mailey BA. Applying GPT-4 to the Plastic Surgery Inservice Training Examination. J Plast Reconstr Aesthet Surg 2023;87:78-82. https://doi.org/10.1016/j.bjps.2023.09.027 ArticlePubMed
- 14. Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? arXiv [Preprint] 2023 Oct 31. https://doi.org/10.48550/arXiv.2307.09009 Article
- 15. Ebel S, Ehrengut C, Denecke T, Goessmann H, Beeskow AB. GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study. J Educ Eval Health Prof 2024;21:21. https://doi.org/10.3352/jeehp.2024.21.21 ArticlePubMedPMC
Citations
Citations to this article as recorded by
