, Yasmine El Moussaoui
, Laila Lahlou
, Imad Chakri
, Hicham Nassik
Research and Innovation Laboratory in Health Science, Faculty of Medicine and Pharmacy, Ibn Zohr University, Agadir, Morocco
© 2025 Korea Health Personnel Licensing Examination Institute
This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Authors’ contributions
Conceptualization: SL, YE, LL, HN. Data curation: SL, YE, LL. Methodology/formal analysis/validation: SL, YE, LL, IC, HN. Project administration: SL, LL. Funding acquisition: SL, YE, LL. Writing–original draft: SL, YE, IC. Writing–review & editing: SL, YE, LL, IC, HN.
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Funding
None.
Data availability
Not applicable.
Acknowledgments
None.
| Authors (year) (country) | Study design | Sample size & population | Intervention | Study outcomes | Assessment tool | AI approach | Key findings | MMAT criteria met | |
|---|---|---|---|---|---|---|---|---|---|
| Experimental | Control | ||||||||
| Kron et al. [22] (2017) (USA) | RCT | 421 second-year medical students | MPathic-VR | Self-directed online CBL module | Interprofessional & intercultural communication | - MPathic-VR | Symbolic | Significant effect on overall communication skills, with MPathic-VR students scoring higher (mean=0.806, SD=0.201) than CBL students (mean=0.752, SD=0.198); F(1,414)=6.09, P=0.014, η²=0.0145 | 3/5 Some criteria unmet |
| - OSCE | |||||||||
| Fazlollahi et al. [17] (2022) (Canada) | RCT | 70 first- or second-year medical students | VOA | Simulation without feedback | Emotional regulation | MES | Computational | Positive emotions increased (mean difference=+0.36; P<0.001) and negative emotions decreased (mean difference=–0.59; P<0.001) after simulation training, with no significant difference between groups (P>0.05). | 5/5 All criteria met |
| Wang et al. [26] (2022) (China) | Quasi-experimental | 15 medical graduate students | Real records + AI feedback | None | Clinical thinking | AIteach system evaluation | Computational | Significant improvement in clinical thinking scores between pre-test (mean=69.87, SD=14.69) and post-test (mean=85.6, SD=11.31) (P<0.01) | 3/5 Some criteria unmet |
| Borg et al. [15] (2024) (Sweden) | Qualitative | 23 third-year medical students | Social robot | Computer-based VP | Clinical reasoning | Semi-structured interviews | Computational | The social robotics platform improved clinical reasoning through symptom-based reasoning, hypothesis generation, and adapting to new patient information. | 5/5 All criteria met |
| Holderried et al. [20] (2024) (Germany) | Mixed methods | 26 medical students | GPT-3.5 chatbot | None | Communication (history-taking) | QAPs | Computational | GPT-3.5 chatbot enabled realistic patient interaction for practicing communication, with 97.9% of answers rated as plausible. | 4/5 Most criteria met |
| Zheng et al. [29] (2024) (China) | Quasi-experimental | 66 first-year medical students | ChatGPT | Theory (PowerPoint) and lab sessions | Critical thinking, communication | - Mini-CEX | Computational | The experimental group scored significantly higher in clinical critical thinking (mean=92.94, SD=2.13 vs. mean=89.31, SD=2.53; P<0.001) and communication skills (mean=76.24, SD=12.30 vs. mean=70.19, SD=11.26; P<0.05) than the control group. | 5/5 All criteria met |
| - Clinical Critical Thinking Scale | |||||||||
| Yang & Shulghuf [28] (2019) (Taiwan) | Quasi-experimental | 72 medical interns | WKS-2RII AI-enabled suturing system | Standard clinical training | Self-confidence | Questionnaire | Computational | Self-confidence in suturing/ligature skills increased in all groups, with significant gains in the AI-assisted group after repeated practice (1 session: mean=3.8, SD=0.3 → 3 sessions: mean=4.7, SD=1.2; out of 5 points; P<0.05 compared with regular and expert-led groups). | 5/5 All criteria met |
| Gutterman et al. [18] (2019) (USA) | Mixed methods | 417 second-year medical students | MPathic-VR | Standard computer module | Communication | - MPathic-VR | Symbolic | Significant effects of the MPathic-VR intervention on communication were observed (intercultural: mean=11.7 → 5.9; interprofessional: mean=7.6 → 4.6; P<0.001), compared to the control group. | 5/5 All criteria met |
| - OSCE | |||||||||
| Mestre et al. [23] (2022) (Portugal) | Quasi-experimental | 293 medical students (1st–3rd year) | Body Interact | None | Clinical reasoning, decision-making, communication, confidence, conflict management | Questionnaire | Symbolic | Significant improvements in clinical reasoning (mean=5.09 → 5.46), decision-making (mean=4.72 → 5.20), communication (mean=4.62 → 5.03), confidence (mean=4.75 → 5.18), and conflict management (mean=4.34 → 4.89) | 5/5 All criteria met |
| Watari et al. [27] (2020) (Japan) | Quasi-experimental | 169 fourth-year medical students | Body Interact | None | Clinical reasoning | Questionnaire | Symbolic | Significant improvements in clinical reasoning (mean=5.39 → 7.81; P<0.001). | 5/5 All criteria met |
| Jeddi et al. [21] (2024) (Iran) | Quasi-experimental | 80 medical interns (6th–7th year) | HAMTA computer case-based simulation | Traditional lectures | Decision-making | Scenario-based tests | Symbolic | No statistically significant differences existed between the groups in clinical decision-making (diagnosis and antibiotic prescription) (P>0.21). | 5/5 All criteria met |
| Borg et al. [14] (2025) (Sweden) | Mixed methods | 62 third-year medical students | Social robotic platform + LLM | None | Clinical reasoning, communication, decision-making | Questionnaire | Computational | The social robotic platform improved communication, clinical reasoning, and decision-making (clinical reasoning: mean=4.4 vs. 4.1, P=0.01; decision-making: mean=4.4 vs. 3.9, P=0.03). | 4/5 Most criteria met |
| Mukadam et al. [24] (2025) (UK) | Mixed methods | 27 fourth- and fifth-year medical students | ChatGPT-4o voice AI standardized patient | None | Communication, self-confidence | - Self-reported assessments | Computational | ChatGPT improved the perceived usefulness of communication (median 3 → 4; P=0.010) and increased students’ confidence in managing difficult patients, delivering bad news, and counselling anxious patients (P<0.001). | 5/5 All criteria met |
| - ITEM | |||||||||
| Holderried et al. [19] (2024) (Germany) | Quasi-experimental | 106 third-year medical students | GPT-4 virtual patient | None | Communication (history-taking) | GPT-4–based communication skills assessment system | Computational | High communication performance, with strong agreement between GPT-4 and the human expert (κ=0.832) | 3/5 Some criteria unmet |
| Aster et al. [13] (2025) (Germany) | Quasi-experimental | 35 third-year medical students | ChatGPT 3.5 virtual patient | None | Empathic history-taking | ECCS | Computational | Students demonstrated adequate history taking and empathic communication, with 14% of interactions identified as empathic (ECCS coding, ICC=0.770). | 3/5 Some criteria unmet |
| Derakhshan et al. [16] (2025) (Iran) | Quasi-experimental | 404 third-semester medical students | Safir humanoid robot | Non-robotic virtual agent Medobot | Empathy, communication | - OSVEs | Symbolic | Interaction with the AI robot improved students' empathy and communication skills compared to the control group (mean listening score=15.86 vs. 13.65; mean speaking score 16.15 vs. 14.21; P<0.001). | 5/5 All criteria met |
| - Mini-CEX | |||||||||
| Pears et al. [25] (2024) (UK) | Mixed methods | 27 urology trainees | GPT-4 | None | Communication, critical thinking, empathy | - Self-reported questionnaires | Computational | Significantly higher scores were observed for communication (linguistic terminology: U=155.5, P=0.003; complexity: U=184, P=0.020), critical thinking (U=496, P=0.020, ES=0.398), and empathy (mean 3.63 vs. 3.04; P=0.021). | 5/5 All criteria met |
| - Structured custom form (by expert) | |||||||||
| Yamamoto et al. [32] (2024) (Japan) | Quasi-experimental | 145 fourth-year medical students | GPT-4 Turbo via miibo | Traditional program without AI | Communication (medical interview) | OSCE | Computational | Significant improvements in medical interview scores were observed in the AI group compared to the control group (mean=28.1 vs. 27.1; P<0.05). | 3/5 Some criteria unmet |
| Brügge et al. [30] (2024) (Germany) | RCT | 21 medical students (mostly 3rd semester) | ChatGPT 3.5 | Simulated patient conversation (without AI feedback) | Clinical reasoning, decision-making | CRI-HTI | Computational | Significant improvements in clinical reasoning and decision-making were observed in the AI feedback group compared to the control group (F(1,20)=4.44, P=0.049, η²=0.198). | 4/5 Most criteria met |
| Wang et al. [31] (2025) (China) | RCT | 56 fifth-year medical students | GPT-4 | Traditional role-playing with experienced instructors | Communication (history taking), clinical reasoning | OSCE | Computational | Significant improvements in history taking and clinical reasoning were observed in the GPT group (mean=86.79, SD=5.46) compared to the control group (mean=73.64, SD=4.76; t=9.60, P<0.001). | 5/5 All criteria met |
The “MMAT criteria met” column shows the number of criteria fulfilled; the qualitative summary reflects the extent to which criteria were met. No overall score or percentage was calculated, in line with MMAT guidance.
AI, artificial intelligence; MMAT, Mixed Methods Appraisal Tool; RCT, randomized controlled trial; VR, virtual reality; OSCE, objective structured clinical examination; SD, standard deviation; CBL, computer-based learning; VOA, Virtual Operative Assistant; MES, Medical Emotion Scale; VP, virtual patient; GPT-3, Generative Pre-trained Transformer 3; QAPs, question–answer pairs; Mini-CEX, Mini Clinical Evaluation Exercise; WKS-2RII, Waseda–Kyoto–Kagaka Suture No. 2 Refined II; LLM, large language model; ITEM, Immersive Technology Evaluation Measure; ECCS, Empathic Communication Coding System; ICC, intraclass correlation coefficient; OSVEs, Objective Structured Video Examinations; ES, effect size; CRI-HTI, Clinical Reasoning Indicator–History Taking Inventory.
| PICOS items | Inclusion and exclusion criteria |
|---|---|
| Population | Undergraduate and postgraduate medical learners, regardless of their academic level or grades. |
| Intervention | All type of simulation-based artificial intelligence |
| Comparison | If a control group was included, it referred to those who had used either the simulation technique or the traditional teaching methods |
| Outcome | The included studies must assess the non-technical skills under investigation |
| Study design | Quantitative, qualitative and mixed-methods studies |
| Contents | |
|---|---|
| Cognitive skills | Critical thinking, reflection, reflective practice, decision-making, problem-solving, clinical reasoning |
| Interpersonal skills | Communication, teamwork, leadership |
| Emotional-social skills | Empathy, emotional regulation, self-confidence, stress management |
| Authors (year) (country) | Study design | Sample size & population | Intervention | Study outcomes | Assessment tool | AI approach | Key findings | MMAT criteria met | |
|---|---|---|---|---|---|---|---|---|---|
| Experimental | Control | ||||||||
| Kron et al. [22] (2017) (USA) | RCT | 421 second-year medical students | MPathic-VR | Self-directed online CBL module | Interprofessional & intercultural communication | - MPathic-VR | Symbolic | Significant effect on overall communication skills, with MPathic-VR students scoring higher (mean=0.806, SD=0.201) than CBL students (mean=0.752, SD=0.198); F(1,414)=6.09, P=0.014, η²=0.0145 | 3/5 Some criteria unmet |
| - OSCE | |||||||||
| Fazlollahi et al. [17] (2022) (Canada) | RCT | 70 first- or second-year medical students | VOA | Simulation without feedback | Emotional regulation | MES | Computational | Positive emotions increased (mean difference=+0.36; P<0.001) and negative emotions decreased (mean difference=–0.59; P<0.001) after simulation training, with no significant difference between groups (P>0.05). | 5/5 All criteria met |
| Wang et al. [26] (2022) (China) | Quasi-experimental | 15 medical graduate students | Real records + AI feedback | None | Clinical thinking | AIteach system evaluation | Computational | Significant improvement in clinical thinking scores between pre-test (mean=69.87, SD=14.69) and post-test (mean=85.6, SD=11.31) (P<0.01) | 3/5 Some criteria unmet |
| Borg et al. [15] (2024) (Sweden) | Qualitative | 23 third-year medical students | Social robot | Computer-based VP | Clinical reasoning | Semi-structured interviews | Computational | The social robotics platform improved clinical reasoning through symptom-based reasoning, hypothesis generation, and adapting to new patient information. | 5/5 All criteria met |
| Holderried et al. [20] (2024) (Germany) | Mixed methods | 26 medical students | GPT-3.5 chatbot | None | Communication (history-taking) | QAPs | Computational | GPT-3.5 chatbot enabled realistic patient interaction for practicing communication, with 97.9% of answers rated as plausible. | 4/5 Most criteria met |
| Zheng et al. [29] (2024) (China) | Quasi-experimental | 66 first-year medical students | ChatGPT | Theory (PowerPoint) and lab sessions | Critical thinking, communication | - Mini-CEX | Computational | The experimental group scored significantly higher in clinical critical thinking (mean=92.94, SD=2.13 vs. mean=89.31, SD=2.53; P<0.001) and communication skills (mean=76.24, SD=12.30 vs. mean=70.19, SD=11.26; P<0.05) than the control group. | 5/5 All criteria met |
| - Clinical Critical Thinking Scale | |||||||||
| Yang & Shulghuf [28] (2019) (Taiwan) | Quasi-experimental | 72 medical interns | WKS-2RII AI-enabled suturing system | Standard clinical training | Self-confidence | Questionnaire | Computational | Self-confidence in suturing/ligature skills increased in all groups, with significant gains in the AI-assisted group after repeated practice (1 session: mean=3.8, SD=0.3 → 3 sessions: mean=4.7, SD=1.2; out of 5 points; P<0.05 compared with regular and expert-led groups). | 5/5 All criteria met |
| Gutterman et al. [18] (2019) (USA) | Mixed methods | 417 second-year medical students | MPathic-VR | Standard computer module | Communication | - MPathic-VR | Symbolic | Significant effects of the MPathic-VR intervention on communication were observed (intercultural: mean=11.7 → 5.9; interprofessional: mean=7.6 → 4.6; P<0.001), compared to the control group. | 5/5 All criteria met |
| - OSCE | |||||||||
| Mestre et al. [23] (2022) (Portugal) | Quasi-experimental | 293 medical students (1st–3rd year) | Body Interact | None | Clinical reasoning, decision-making, communication, confidence, conflict management | Questionnaire | Symbolic | Significant improvements in clinical reasoning (mean=5.09 → 5.46), decision-making (mean=4.72 → 5.20), communication (mean=4.62 → 5.03), confidence (mean=4.75 → 5.18), and conflict management (mean=4.34 → 4.89) | 5/5 All criteria met |
| Watari et al. [27] (2020) (Japan) | Quasi-experimental | 169 fourth-year medical students | Body Interact | None | Clinical reasoning | Questionnaire | Symbolic | Significant improvements in clinical reasoning (mean=5.39 → 7.81; P<0.001). | 5/5 All criteria met |
| Jeddi et al. [21] (2024) (Iran) | Quasi-experimental | 80 medical interns (6th–7th year) | HAMTA computer case-based simulation | Traditional lectures | Decision-making | Scenario-based tests | Symbolic | No statistically significant differences existed between the groups in clinical decision-making (diagnosis and antibiotic prescription) (P>0.21). | 5/5 All criteria met |
| Borg et al. [14] (2025) (Sweden) | Mixed methods | 62 third-year medical students | Social robotic platform + LLM | None | Clinical reasoning, communication, decision-making | Questionnaire | Computational | The social robotic platform improved communication, clinical reasoning, and decision-making (clinical reasoning: mean=4.4 vs. 4.1, P=0.01; decision-making: mean=4.4 vs. 3.9, P=0.03). | 4/5 Most criteria met |
| Mukadam et al. [24] (2025) (UK) | Mixed methods | 27 fourth- and fifth-year medical students | ChatGPT-4o voice AI standardized patient | None | Communication, self-confidence | - Self-reported assessments | Computational | ChatGPT improved the perceived usefulness of communication (median 3 → 4; P=0.010) and increased students’ confidence in managing difficult patients, delivering bad news, and counselling anxious patients (P<0.001). | 5/5 All criteria met |
| - ITEM | |||||||||
| Holderried et al. [19] (2024) (Germany) | Quasi-experimental | 106 third-year medical students | GPT-4 virtual patient | None | Communication (history-taking) | GPT-4–based communication skills assessment system | Computational | High communication performance, with strong agreement between GPT-4 and the human expert (κ=0.832) | 3/5 Some criteria unmet |
| Aster et al. [13] (2025) (Germany) | Quasi-experimental | 35 third-year medical students | ChatGPT 3.5 virtual patient | None | Empathic history-taking | ECCS | Computational | Students demonstrated adequate history taking and empathic communication, with 14% of interactions identified as empathic (ECCS coding, ICC=0.770). | 3/5 Some criteria unmet |
| Derakhshan et al. [16] (2025) (Iran) | Quasi-experimental | 404 third-semester medical students | Safir humanoid robot | Non-robotic virtual agent Medobot | Empathy, communication | - OSVEs | Symbolic | Interaction with the AI robot improved students' empathy and communication skills compared to the control group (mean listening score=15.86 vs. 13.65; mean speaking score 16.15 vs. 14.21; P<0.001). | 5/5 All criteria met |
| - Mini-CEX | |||||||||
| Pears et al. [25] (2024) (UK) | Mixed methods | 27 urology trainees | GPT-4 | None | Communication, critical thinking, empathy | - Self-reported questionnaires | Computational | Significantly higher scores were observed for communication (linguistic terminology: U=155.5, P=0.003; complexity: U=184, P=0.020), critical thinking (U=496, P=0.020, ES=0.398), and empathy (mean 3.63 vs. 3.04; P=0.021). | 5/5 All criteria met |
| - Structured custom form (by expert) | |||||||||
| Yamamoto et al. [32] (2024) (Japan) | Quasi-experimental | 145 fourth-year medical students | GPT-4 Turbo via miibo | Traditional program without AI | Communication (medical interview) | OSCE | Computational | Significant improvements in medical interview scores were observed in the AI group compared to the control group (mean=28.1 vs. 27.1; P<0.05). | 3/5 Some criteria unmet |
| Brügge et al. [30] (2024) (Germany) | RCT | 21 medical students (mostly 3rd semester) | ChatGPT 3.5 | Simulated patient conversation (without AI feedback) | Clinical reasoning, decision-making | CRI-HTI | Computational | Significant improvements in clinical reasoning and decision-making were observed in the AI feedback group compared to the control group (F(1,20)=4.44, P=0.049, η²=0.198). | 4/5 Most criteria met |
| Wang et al. [31] (2025) (China) | RCT | 56 fifth-year medical students | GPT-4 | Traditional role-playing with experienced instructors | Communication (history taking), clinical reasoning | OSCE | Computational | Significant improvements in history taking and clinical reasoning were observed in the GPT group (mean=86.79, SD=5.46) compared to the control group (mean=73.64, SD=4.76; t=9.60, P<0.001). | 5/5 All criteria met |
PICOS, population, intervention, comparison, outcome, and study design.
The “MMAT criteria met” column shows the number of criteria fulfilled; the qualitative summary reflects the extent to which criteria were met. No overall score or percentage was calculated, in line with MMAT guidance. AI, artificial intelligence; MMAT, Mixed Methods Appraisal Tool; RCT, randomized controlled trial; VR, virtual reality; OSCE, objective structured clinical examination; SD, standard deviation; CBL, computer-based learning; VOA, Virtual Operative Assistant; MES, Medical Emotion Scale; VP, virtual patient; GPT-3, Generative Pre-trained Transformer 3; QAPs, question–answer pairs; Mini-CEX, Mini Clinical Evaluation Exercise; WKS-2RII, Waseda–Kyoto–Kagaka Suture No. 2 Refined II; LLM, large language model; ITEM, Immersive Technology Evaluation Measure; ECCS, Empathic Communication Coding System; ICC, intraclass correlation coefficient; OSVEs, Objective Structured Video Examinations; ES, effect size; CRI-HTI, Clinical Reasoning Indicator–History Taking Inventory.