Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Educ Eval Health Prof > Volume 22; 2025 > Article
Research article
Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study
Emma Dejean-Bouyer1*orcid, Anoujat Kanlagna1orcid, François Thuau1,2orcid, Pierre Perrot1,2orcid, Ugo Lancien1,2orcid

DOI: https://doi.org/10.3352/jeehp.2025.22.27
Published online: September 30, 2025

1Plastic, Reconstructive, and Aesthetic Surgery Unit, Nantes University Hospital, Nantes, France

2Laboratory Regenerative Medicine and Skeleton (RMeS), UMRS 1229, INSERM, Nantes, France

*Corresponding email: emma.dejeanbouyer@chu-nantes.fr

Editor: A Ra Cho, The Catholic University of Korea, Korea

• Received: July 25, 2025   • Accepted: September 10, 2025

© 2025 Korea Health Personnel Licensing Examination Institute

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 1,101 Views
  • 151 Download
  • Purpose
    This study aims to evaluate the performance of Chat Generative Pre-Trained Transformer 4 (ChatGPT-4) on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination and to assess its role as a supplementary resource in helping residents prepare for the qualification examination in plastic surgery.
  • Methods
    This descriptive study evaluated ChatGPT-4’s performance on 213 items from the October 2024 French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. Responses were assessed for accuracy, logical reasoning, internal and external information use, and were categorized for fallacies by independent reviewers. Statistical analyses included chi-square tests and Fisher’s exact test for significance.
  • Results
    ChatGPT-4 answered all questions across the 10 modules, achieving an overall accuracy rate of 77.5%. The model applied logical reasoning in 98.1% of the questions, utilized internal information in 94.4%, and incorporated external information in 91.1%.
  • Conclusion
    ChatGPT-4 performs satisfactorily on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. Its accuracy met the minimum passing standards for the exam. While responses generally align with expected knowledge, careful verification remains necessary, particularly for questions involving image interpretation. As artificial intelligence continues to evolve, ChatGPT-4 is expected to become an increasingly reliable tool for medical education. At present, it remains a valuable resource for assisting plastic surgery residents in their training.
Background
Large language models (LLMs) have demonstrated remarkable abilities in answering complex medical questions, summarizing literature [1], and performing standardized medical exams across various specialties [2,3]. Several studies have evaluated LLM performance on national medical licensing exams, such as the United States Medical Licensing Examination [4], with varying degrees of accuracy. Artificial intelligence (AI)-powered tools are increasingly being integrated into medical education and clinical decision-making, offering potential benefits such as enhanced learning, rapid information retrieval, and assistance with exam preparation. Among these models, Chat Generative Pre-Trained Transformer (ChatGPT), developed by OpenAI, stands out as an advanced LLM capable of generating human-like responses conversationally by predicting word sequences based on prior input. One of its latest iterations, ChatGPT-4, represents a significant improvement over previous versions, particularly in accuracy and contextual understanding, with added multimodal capabilities such as image analysis [5]. During this study, ChatGPT-4 was trained on publicly available internet sources up to September 2023.
Over the past 3 years, researchers have evaluated ChatGPT’s medical knowledge using licensing examinations from various countries and across different specialties [6]. The specificity and complexity of plastic surgery board exams, especially in non-English contexts such as the French system, pose unique challenges for AI models because they often involve highly technical knowledge and case-based scenarios requiring deep domain expertise and understanding.
Objectives
This study aims to evaluate ChatGPT-4’s performance on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. This assessment may help define its potential as an interactive educational tool to support French residents’ learning and exam preparation.
Ethics statement
No human subjects or healthcare data were used in this investigation, and ethical approval was not required.
Study design
We conducted a descriptive study evaluating the performance of ChatGPT-4 on items from the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination.
Setting
We evaluated questions from the October 2024 French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination. The exam was divided into 10 modules: anatomy, head and neck, general principles, pediatrics, thorax and abdomen, upper limb, lower limb, oncology, aesthetic surgery, and Medico-legal aspects. Each question was submitted to ChatGPT-4 exactly as it appeared on exam day, including language, except for those containing images (Fig. 1). For image-based questions, the software required additional details. For example, in annotation tasks, each arrow had to be specified, and the question reformulated accordingly for each label (Fig. 2). All interactions with ChatGPT-4 were carried out using the freely available web-based interface provided by OpenAI (https://chat.openai.com) during December 2024. We deliberately used the free version to ensure reproducibility and accessibility for users without a paid subscription, such as medical students or trainees. No plugins or API integrations were employed.
Variables
The primary outcome measured was the accuracy of ChatGPT-4 on the French Plastic Surgery board examination.
Data sources/measurement
The accuracy of ChatGPT-4 was evaluated for every question. Each question was pasted into the chatbox exactly as it appeared on the exam, with narrative text followed by multiple-choice options. The model’s answer was compared with the correct answer determined by examiners and scored according to the official grading scale. Two independent reviewers (E.D.B. and U.L.) assessed each response, with a third reviewer (A.K.) resolving any discrepancies. ChatGPT-4’s answers were further analyzed using the framework of natural coherence, a cognitive model describing how humans construct meaning from text by integrating different types of information [7]. This evaluation considered 3 characteristics: (1) logical reasoning: use of stepwise logic to generate an answer based on information in the question stem; (2) internal information: reliance on information contained within the question stem; and (3) external information: incorporation of external knowledge, including literature references.
Incorrect responses were categorized as follows: (1) logical fallacy: stepwise reasoning present but the final answer incorrect; (2) informational fallacy: logic applied but failure to use a key piece of information from the question stem; (3) and explicit fallacy: absence of logical reasoning and inability to use information from the question stem. (Dataset 1.)
Bias
Each question was submitted to ChatGPT-4 in a new chat session to avoid answer contamination across items. Exceptions were made for clinical case-based questions, where items built logically on one another. In these cases, questions were entered sequentially within the same session to preserve contextual continuity.
Study size
ChatGPT-4 was tested on all 213 items, representing the complete set of written examination questions from the 2024 French Board of Plastic Surgery.
Statistical methods
Statistical analyses were performed using R software ver. 4.3.2 (The R Foundation for Statistical Computing) (Supplement 1). The unpaired chi-square test was applied to assess statistical significance for categorical variables. For image-based questions, the Fisher's exact test was used, as expected frequencies in the “image present” group were fewer than 5. This ensured appropriate statistical evaluation of observed differences. A 5% significance level was applied for all tests.
As described above, ChatGPT-4 was instructed to answer each of the 10 sections of the exam. To ensure greater precision, responses to multiple-choice questions were analyzed individually, resulting in a total assessment of 213 items. Overall, ChatGPT-4 was accurate for 165 responses, yielding an accuracy rate of 77.5% and a final score of 58.72/100 based on the examination grading scale (Table 1). Compared with student results, ChatGPT-4 would have ranked 36th out of 38 candidates. The software applied logical reasoning in 98.1% of questions, used internal information in 94.4%, and incorporated external information in 91.1%. Among incorrect responses, the most common fallacy was informational fallacy (81.3%), followed by logical fallacy (71.2%) and explicit fallacy (12.5%).
When responses were stratified according to correctness, ChatGPT-4 incorporated logical reasoning, internal information, and external information in 100% of correct answers. In contrast, for incorrect responses, 91.7% included logical reasoning, 75.0% used internal information, and 60.4% relied on external information. Comparing correct and incorrect answers showed statistically significant differences for logical reasoning (P=0.002), internal information (P<0.001), and external information (P<0.001) (Table 2).
A small subset of questions (n=4; 2%) included images in the question stem. For these items, ChatGPT-4 provided the correct answer in 25% (n=1) of cases (Table 3). For questions without images, the model achieved an accuracy rate of 78.4%. This difference was statistically significant (P=0.04).
Key results
This study aimed to quantitatively assess ChatGPT-4’s ability to interpret complex surgical and clinical information and to evaluate its potential implications for surgical education and training. We found that ChatGPT-4 performed at a satisfactory level on the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination, with an accuracy rate that met the minimum passing standard.
Interpretation
ChatGPT-4 used internal and external information in more than 90% of cases and applied logical reasoning in 98%. We identified statistically significant differences in the use of logical reasoning, internal information, and external information between correct and incorrect responses. The significance of all 3 categories suggests that ChatGPT-4’s optimal performance depends on the interaction of these fundamental cognitive mechanisms. This may indicate that the model is not only retrieving facts but also synthesizing and contextualizing knowledge effectively. These results are consistent with prior studies that reported improved item accuracy in standardized exams, particularly with ChatGPT-4 [3,8].
Although ChatGPT-4 passed the examination, it ranked 36th out of 38 candidates, with a score of 58.72/100 according to the official grading system. While sufficient to meet the minimum passing requirement, this score remains below the average human candidate’s performance. This discrepancy likely reflects the model's limitations in interpreting nuanced contextual cues, resolving ambiguous phrasing, or applying test-taking strategies commonly used by experienced candidates. Nevertheless, when items were considered independently, ChatGPT-4 achieved a 77.5% success rate, indicating a solid grasp of core surgical knowledge.
Most existing studies comparing ChatGPT versions (3.5 vs. 4) have not emphasized image-based analysis. Our study incorporated this element and demonstrated a notable decrease in accuracy for image-based questions: 25% compared with 78.4% for text-only items. This highlights a major limitation of ChatGPT-4 in visually intensive specialties such as plastic surgery, in line with previous findings [9,10]. However, this observation should be interpreted cautiously, as only 4 image-related questions were included in the 2024 exam.
Comparison with previous studies
We selected ChatGPT-4 for this study because of its demonstrated superiority in medical applications, as documented in recent literature [11,12]. To our knowledge, this is the first application of ChatGPT-4 to plastic surgery in France. Considerable variability has been observed among surgical subspecialties with respect to written board examinations. Our findings are consistent with those of Gupta et al. [1], who reported that ChatGPT-3.5 achieved a satisfactory score on the Plastic Surgery In-Service Training Exam (PSITE), with further improvements observed with ChatGPT-4 [13]. In contrast to our findings, however, their statistical analysis revealed significant differences only in the use of external information between correct and incorrect responses, and image stratification showed no significant effect. These differences may reflect both improvements in model architecture and the unique characteristics of exam design. The French board exam is structured around thematically integrated, clinically oriented modules, requiring greater contextual interpretation. In contrast, the PSITE may rely more on factual recall, where external information alone can suffice. This could explain the stronger role of internal information and logical coherence in our study.
Limitations
As with all ChatGPT-related studies, one limitation is the variability of responses over time. Chen et al. [14] demonstrated that GPT-3.5 and GPT-4 performance can fluctuate significantly across testing periods. Our data were collected in September 2024 and thus reflect the model’s performance at that time. Additionally, each question was asked only once, so intra-model variability was not assessed and may have influenced some responses. Furthermore, ChatGPT-4’s training data were last updated in September 2023, potentially limiting accuracy for more recent developments. This study was also limited to the 2024 edition of the French Board of Plastic, Reconstructive, and Aesthetic Surgery written examination, restricting generalizability across exam years. Only ChatGPT-4 was tested; no comparisons were made with other LLMs. We intentionally used the freely available public interface to simulate the experience of typical learners. Although ChatGPT-5 has since been released with reported improvements in reasoning and contextual analysis, we selected ChatGPT-4 for its stability, accessibility, and reliability at the time. Paid versions or enterprise integrations may also offer additional features that were not evaluated in this study.
Furthermore, although a statistically significant result was observed in the stratified analysis based on image use, the number of image-based questions was minimal. Therefore, caution is warranted when interpreting these findings regarding ChatGPT’s ability to process visual content. Lastly, since language models such as ChatGPT may be influenced by prior exposure to publicly available materials, the availability of past exam content online could have contributed to an overestimation of performance.
Generalizability
These findings may have implications for medical education beyond the French board examination. The ability of a language model to perform reasonably well on a highly specialized assessment suggests that such tools could serve as valuable learning companions. In particular, they may provide support for exam preparation, content review, and instructional material development, especially in settings where access to expert faculty is limited. Similar applications have already been explored in other medical specialties. For example, Ebel et al. [15] demonstrated that ChatGPT-4o could create exam items of varying difficulty to train medical students and interventional radiologists. In plastic surgery, AI could be integrated into national board preparation programs by offering adaptive quizzes, realistic case simulations, and image-based exercises tailored to each resident’s training level and stage. Faculty review would remain essential to ensure accuracy and alignment with board standards. Although results from a single national exam cannot be universally generalized, they invite further investigation into the role of AI in health professions education across different languages, systems, and specialties. At the same time, concerns remain about the accuracy of generated content, the risk of learner overreliance, and the spread of outdated or incorrect information. Importantly, ChatGPT-4 cannot verify its own sources. Ethical issues—including transparency, data privacy, and equitable access—must also be carefully addressed as these technologies expand in use.
Suggestions
Future studies should compare multiple AI models to evaluate their relative effectiveness and accuracy in diverse educational contexts. Incorporating more visual materials, such as annotated images, would also enhance the assessment of AI’s ability to process and interpret visual data—an essential skill in surgical education. Additionally, longitudinal reviews of past exams could reveal performance trends and areas for improvement, helping to establish a framework for evaluating the long-term impact of AI-driven educational tools in health professional training. Another promising direction would be the implementation of user-selectable options within ChatGPT to prioritize peer-reviewed or highly reliable sources when generating scientific or medical information. Finally, with the release of ChatGPT-5, direct comparisons with ChatGPT-4 would be valuable. However, given the restricted functionality of the free version, such comparisons would likely require access to paid versions such as ChatGPT Plus or Pro.
Conclusion
Our study demonstrates that ChatGPT-4 is capable of passing the French plastic surgery board examination, though its performance remains below that of the average human candidate. These results suggest that ChatGPT-4 can serve as an interactive educational tool, helping residents review specific topics, reinforce key concepts, and practice exam-style questions. However, responses must be critically evaluated by learners and cannot be relied upon as a sole source of knowledge. Looking forward, continued progress in AI development may allow such systems to evolve into comprehensive training tools for plastic surgery residents.

Authors’ contributions

Conceptualization: EDB. Data curation: EDB, UL. Methodology/formal analysis/validation: EDB, AK, UL. Project administration: PP. Funding acquisition: none. Writing–original draft: EDB. Writing–review & editing: EDB, AK, FT, PP, UL.

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Funding

None.

Data availability

Data files are available from https://doi.org/10.7910/DVN/ROXC5Y

Dataset 1. Answers classification.

jeehp-22-27-dataset1.xlsx

Acknowledgments

None.

Supplement files are available from https://doi.org/10.7910/DVN/ROXC5Y
Supplement 1. R code for the analysis of data.
jeehp-22-27-suppl1.docx
Supplement 2. Audio recording of the abstract.
jeehp-22-27-abstract-recording.avi
Fig. 1.
Example of a multiple-choice submitted to ChatGPT-4, along with its response.
jeehp-22-27f1.jpg
Fig. 2.
Example of a question containing an image.
jeehp-22-27f2.jpg
jeehp-22-27f3.jpg
Table 1.
Qualitative analysis of ChatGPT-4’s responses to the 2024 French Board of Plastic Reconstructive and Aesthetic Surgery written examination
Variable Overall Correct responses Question characteristic Incorrect responses Reason for incorrect answers
Logical Internal External Logical fallacy Informational fallacy Explicit fallacy
Anatomy 21 16 21 21 21 5 4 2 0
General principles 6 3 4 4 3 3 0 3 0
Head and neck 9 2 8 8 2 7 6 7 0
Pediatrics 13 11 13 13 13 2 2 1 0
Thorax and abdomen 49 40 48 49 49 9 5 5 5
Oncology 13 11 13 13 13 2 1 2 0
Lower limb 12 9 12 9 9 3 3 3 0
Aesthetic 14 8 14 8 8 6 6 6 0
Upper limb 42 35 42 42 42 7 7 6 1
Medico-legal 34 30 34 34 34 4 4 4 0
Total 213 165 209 201 194 48 38 39 6
% 77.5 98.1 94.4 91.1 22.5 71.2 81.3 12.5
Table 2.
Types of responses generated by ChatGPT-4, stratified based on correctness (n=213)
Total Correct answers (n=165) Incorrect answers (n=48) P-value
Logical 209 (98.1) 165 (100.0) 44 (91.7) 0.002
Internal 201 (94.4) 165 (100.0) 36 (75.0) <0.001
External 194 (91.1) 165 (100.0) 29 (60.4) <0.001

Values are presented as number (%), unless otherwise stated.

Table 3.
Accuracy stratified by image presence in the question stem
Image presence Correct Incorrect Accuracy rate (%) P-value
Yes (n=4) 1 3 25.0 0.04*
No (n=209) 164 45 78.4

* P<0.05.

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      Figure
      • 0
      • 1
      • 2
      Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study
      Image Image Image
      Fig. 1. Example of a multiple-choice submitted to ChatGPT-4, along with its response.
      Fig. 2. Example of a question containing an image.
      Graphical abstract
      Performance of ChatGPT-4 on the French Board of Plastic Reconstructive and Aesthetic Surgery written exam: a descriptive study
      Variable Overall Correct responses Question characteristic Incorrect responses Reason for incorrect answers
      Logical Internal External Logical fallacy Informational fallacy Explicit fallacy
      Anatomy 21 16 21 21 21 5 4 2 0
      General principles 6 3 4 4 3 3 0 3 0
      Head and neck 9 2 8 8 2 7 6 7 0
      Pediatrics 13 11 13 13 13 2 2 1 0
      Thorax and abdomen 49 40 48 49 49 9 5 5 5
      Oncology 13 11 13 13 13 2 1 2 0
      Lower limb 12 9 12 9 9 3 3 3 0
      Aesthetic 14 8 14 8 8 6 6 6 0
      Upper limb 42 35 42 42 42 7 7 6 1
      Medico-legal 34 30 34 34 34 4 4 4 0
      Total 213 165 209 201 194 48 38 39 6
      % 77.5 98.1 94.4 91.1 22.5 71.2 81.3 12.5
      Total Correct answers (n=165) Incorrect answers (n=48) P-value
      Logical 209 (98.1) 165 (100.0) 44 (91.7) 0.002
      Internal 201 (94.4) 165 (100.0) 36 (75.0) <0.001
      External 194 (91.1) 165 (100.0) 29 (60.4) <0.001
      Image presence Correct Incorrect Accuracy rate (%) P-value
      Yes (n=4) 1 3 25.0 0.04*
      No (n=209) 164 45 78.4
      Table 1. Qualitative analysis of ChatGPT-4’s responses to the 2024 French Board of Plastic Reconstructive and Aesthetic Surgery written examination

      Table 2. Types of responses generated by ChatGPT-4, stratified based on correctness (n=213)

      Values are presented as number (%), unless otherwise stated.

      Table 3. Accuracy stratified by image presence in the question stem

      P<0.05.


      JEEHP : Journal of Educational Evaluation for Health Professions
      TOP