Background ChatGPT is a large language model (LLM) based on artificial intelligence (AI) capable of responding in multiple languages and generating nuanced and highly complex responses. While ChatGPT holds promising applications in medical education, its limitations and potential risks cannot be ignored.
Methods A scoping review was conducted for English articles discussing ChatGPT in the context of medical education published after 2022. A literature search was performed using PubMed/MEDLINE, Embase, and Web of Science databases, and information was extracted from the relevant studies that were ultimately included.
Results ChatGPT exhibits various potential applications in medical education, such as providing personalized learning plans and materials, creating clinical practice simulation scenarios, and assisting in writing articles. However, challenges associated with academic integrity, data accuracy, and potential harm to learning were also highlighted in the literature. The paper emphasizes certain recommendations for using ChatGPT, including the establishment of guidelines. Based on the review, 3 key research areas were proposed: cultivating the ability of medical students to use ChatGPT correctly, integrating ChatGPT into teaching activities and processes, and proposing standards for the use of AI by medical students.
Conclusion ChatGPT has the potential to transform medical education, but careful consideration is required for its full integration. To harness the full potential of ChatGPT in medical education, attention should not only be given to the capabilities of AI but also to its impact on students and teachers.
Ariel Shana Frey-Vogel, Kristina Dzara, Kimberly Anne Gifford, Yoon Soo Park, Justin Berk, Allison Heinly, Darcy Wolcott, Daniel Adam Hall, Shannon Elliott Scott-Vernaglia, Katherine Anne Sparger, Erica Ye-pyng Chung
J Educ Eval Health Prof. 2024;21:3. Published online February 23, 2024
Purpose Despite educational mandates to assess resident teaching competence, limited instruments with validity evidence exist for this purpose. Existing instruments do not allow faculty to assess resident-led teaching in a large group format or whether teaching was interactive. This study gathers validity evidence on the use of the Resident-led Large Group Teaching Assessment Instrument (Relate), an instrument used by faculty to assess resident teaching competency. Relate comprises 23 behaviors divided into 6 elements: learning environment, goals and objectives, content of talk, promotion of understanding and retention, session management, and closure.
Methods Messick’s unified validity framework was used for this study. Investigators used video recordings of resident-led teaching from 3 pediatric residency programs to develop Relate and a rater guidebook. Faculty were trained on instrument use through frame-of-reference training. Resident teaching at all sites was video-recorded during 2018–2019. Two trained faculty raters assessed each video. Descriptive statistics on performance were obtained. Validity evidence sources include: rater training effect (response process), reliability and variability (internal structure), and impact on Milestones assessment (relations to other variables).
Results Forty-eight videos, from 16 residents, were analyzed. Rater training improved inter-rater reliability from 0.04 to 0.64. The Φ-coefficient reliability was 0.50. There was a significant correlation between overall Relate performance and the pediatric teaching Milestone (r=0.34, P=0.019).
Conclusion Relate provides validity evidence with sufficient reliability to measure resident-led large-group teaching competence.
Purpose Coronavirus disease 2019 (COVID-19) has heavily impacted medical clinical education in Taiwan. Medical curricula have been altered to minimize exposure and limit transmission. This study investigated the effect of COVID-19 on Taiwanese medical students’ clinical performance using online standardized evaluation systems and explored the factors influencing medical education during the pandemic.
Methods Medical students were scored from 0 to 100 based on their clinical performance from 1/1/2018 to 6/31/2021. The students were placed into pre-COVID-19 (before 2/1/2020) and midst-COVID-19 (on and after 2/1/2020) groups. Each group was further categorized into COVID-19-affected specialties (pulmonary, infectious, and emergency medicine) and other specialties. Generalized estimating equations (GEEs) were used to compare and examine the effects of relevant variables on student performance.
Results In total, 16,944 clinical scores were obtained for COVID-19-affected specialties and other specialties. For the COVID-19-affected specialties, the midst-COVID-19 score (88.513.52) was significantly lower than the pre-COVID-19 score (90.143.55) (P<0.0001). For the other specialties, the midst-COVID-19 score (88.323.68) was also significantly lower than the pre-COVID-19 score (90.063.58) (P<0.0001). There were 1,322 students (837 males and 485 females). Male students had significantly lower scores than female students (89.333.68 vs. 89.993.66, P=0.0017). GEE analysis revealed that the COVID-19 pandemic (unstandardized beta coefficient=-1.99, standard error [SE]=0.13, P<0.0001), COVID-19-affected specialties (B=0.26, SE=0.11, P=0.0184), female students (B=1.10, SE=0.20, P<0.0001), and female attending physicians (B=-0.19, SE=0.08, P=0.0145) were independently associated with students’ scores.
Conclusion COVID-19 negatively impacted medical students' clinical performance, regardless of their specialty. Female students outperformed male students, irrespective of the pandemic.
Purpose The purpose of this project was to implement a process for learner-driven, formative, prospective, ad-hoc, entrustment assessment in Doctor of Physical Therapy clinical education. Our goals were to develop an innovative entrustment assessment tool, and then explore whether the tool detected (1) differences between learners at different stages of development and (2) differences within learners across the course of a clinical education experience. We also investigated whether there was a relationship between the number of assessments and change in performance.
Methods A prospective, observational, cohort of clinical instructors (CIs) was recruited to perform learner-driven, formative, ad-hoc, prospective, entrustment assessments. Two entrustable professional activities (EPAs) were used: (1) gather a history and perform an examination and (2) implement and modify the plan of care, as needed. CIs provided a rating on the entrustment scale and provided narrative support for their rating.
Results Forty-nine learners participated across 4 clinical experiences (CEs), resulting in 453 EPA learner-driven assessments. For both EPAs, statistically significant changes were detected both between learners at different stages of development and within learners across the course of a CE. Improvement within each CE was significantly related to the number of feedback opportunities.
Conclusion The results of this pilot study provide preliminary support for the use of learner-driven, formative, ad-hoc assessments of competence based on EPAs with a novel entrustment scale. The number of formative assessments requested correlated with change on the EPA scale, suggesting that formative feedback may augment performance improvement.
Purpose This study aimed to evaluate the impact of a transcultural nursing course on enhancing the cultural competency of graduate nursing students in Korea. We hypothesized that participants’ cultural competency would significantly improve in areas such as communication, biocultural ecology and family, dietary habits, death rituals, spirituality, equity, and empowerment and intermediation after completing the course. Furthermore, we assessed the participants’ overall satisfaction with the course.
Methods A before-and-after study was conducted with graduate nursing students at Hallym University, Chuncheon, Korea, from March to June 2023. A transcultural nursing course was developed based on Giger & Haddad’s transcultural nursing model and Purnell’s theoretical model of cultural competence. Data was collected using a cultural competence scale for registered nurses developed by Kim and his colleagues. A total of 18 students participated, and the paired t-test was employed to compare pre-and post-intervention scores.
Results The study revealed significant improvements in all 7 categories of cultural nursing competence (P<0.01). Specifically, the mean differences in scores (pre–post) ranged from 0.74 to 1.09 across the categories. Additionally, participants expressed high satisfaction with the course, with an average score of 4.72 out of a maximum of 5.0.
Conclusion The transcultural nursing course effectively enhanced the cultural competency of graduate nursing students. Such courses are imperative to ensure quality care for the increasing multicultural population in Korea.
ChatGPT (GPT-3.5) has entered higher education and there is a need to determine how to use it effectively. This descriptive study compared the ability of GPT-3.5 and teachers to answer questions from dental students and construct detailed intended learning outcomes. When analyzed according to a Likert scale, we found that GPT-3.5 answered the questions from dental students in a similar or even more elaborate way compared to the answers that had previously been provided by a teacher. GPT-3.5 was also asked to construct detailed intended learning outcomes for a course in microbial pathogenesis, and when these were analyzed according to a Likert scale they were, to a large degree, found irrelevant. Since students are using GPT-3.5, it is important that instructors learn how to make the best use of it both to be able to advise students and to benefit from its potential.
Citations
Citations to this article as recorded by
Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review Xiaojun Xu, Yixiao Chen, Jing Miao Journal of Educational Evaluation for Health Professions.2024; 21: 6. CrossRef
Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study Hyunju Lee, Soobin Park Journal of Educational Evaluation for Health Professions.2023; 20: 39. CrossRef
Purpose This study presents item analysis results of the 26 health personnel licensing examinations managed by the Korea Health Personnel Licensing Examination Institute (KHPLEI) in 2022.
Methods The item difficulty index, item discrimination index, and reliability were calculated. The item discrimination index was calculated using a discrimination index based on the upper and lower 27% rule and the item-total correlation.
Results Out of 468,352 total examinees, 418,887 (89.4%) passed. The pass rates ranged from 27.3% for health educators level 1 to 97.1% for oriental medical doctors. Most examinations had a high average difficulty index, albeit to varying degrees, ranging from 61.3% for prosthetists and orthotists to 83.9% for care workers. The average discrimination index based on the upper and lower 27% rule ranged from 0.17 for oriental medical doctors to 0.38 for radiological technologists. The average item-total correlation ranged from 0.20 for oriental medical doctors to 0.38 for radiological technologists. The Cronbach α, as a measure of reliability, ranged from 0.872 for health educators-level 3 to 0.978 for medical technologists. The correlation coefficient between the average difficulty index and average discrimination index was -0.2452 (P=0.1557), that between the average difficulty index and the average item-total correlation was 0.3502 (P=0.0392), and that between the average discrimination index and the average item-total correlation was 0.7944 (P<0.0001).
Conclusion This technical report presents the item analysis results and reliability of the recent examinations by the KHPLEI, demonstrating an acceptable range of difficulty index and discrimination index values, as well as good reliability.
Purpose We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).
Methods This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).
Results GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.
Conclusion Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.
Citations
Citations to this article as recorded by
Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki JMIR Medical Education.2024; 10: e57054. CrossRef
Response to Letter to the Editor re: “Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT ‘Wins' Rhinoplasty Consultations: Should We Be Worried? [1]” by Durairaj et al. Kay Durairaj, Omer Baker Facial Plastic Surgery & Aesthetic Medicine.2024;[Epub] CrossRef
Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review Xiaojun Xu, Yixiao Chen, Jing Miao Journal of Educational Evaluation for Health Professions.2024; 21: 6. CrossRef
Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study Hyunju Lee, Soobin Park Journal of Educational Evaluation for Health Professions.2023; 20: 39. CrossRef
Purpose This study aimed to assess the performance of ChatGPT (GPT-3.5 and GPT-4) as a study tool in solving biostatistical problems and to identify any potential drawbacks that might arise from using ChatGPT in medical education, particularly in solving practical biostatistical problems.
Methods ChatGPT was tested to evaluate its ability to solve biostatistical problems from the Handbook of Medical Statistics by Peacock and Peacock in this descriptive study. Tables from the problems were transformed into textual questions. Ten biostatistical problems were randomly chosen and used as text-based input for conversation with ChatGPT (versions 3.5 and 4).
Results GPT-3.5 solved 5 practical problems in the first attempt, related to categorical data, cross-sectional study, measuring reliability, probability properties, and the t-test. GPT-3.5 failed to provide correct answers regarding analysis of variance, the chi-square test, and sample size within 3 attempts. GPT-4 also solved a task related to the confidence interval in the first attempt and solved all questions within 3 attempts, with precise guidance and monitoring.
Conclusion The assessment of both versions of ChatGPT performance in 10 biostatistical problems revealed that GPT-3.5 and 4’s performance was below average, with correct response rates of 5 and 6 out of 10 on the first attempt. GPT-4 succeeded in providing all correct answers within 3 attempts. These findings indicate that students must be aware that this tool, even when providing and calculating different statistical analyses, can be wrong, and they should be aware of ChatGPT’s limitations and be careful when incorporating this model into medical education.
Citations
Citations to this article as recorded by
Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science? Xiaoming Zhai, Matthew Nyaaba, Wenchao Ma Science & Education.2024;[Epub] CrossRef
Purpose This study aimed to identify the effects of a 12-week interprofessional simulation program, operated between February 2020 and January 2021, on the patient safety competencies of healthcare professionals in Switzerland.
Methods The simulation training was based on 2 scenarios of hospitalized patients with septic shock and respiratory failure, and trainees were expected to demonstrate patient safety competencies. A single-group before and after study was conducted after the intervention—simulation program, using a measurement tool (the Health Professional Education in Patient Safety Survey) to measure the perceived competencies of physicians, nurses, and nursing assistants. Out of 57 participants, 37 answered the questionnaire surveys 4 times: 48 hours before the training, followed by post-surveys at 24 hours, 6 weeks, and 12 weeks after the training. The linear mixed effect model was applied for the analysis.
Results Four components out of 6 perceived patient safety competencies improved at 6 weeks but returned to a similar level before training at 12 weeks. Competencies of “communicating effectively,” “managing safety risks,” “understanding human and environmental factors that influence patient safety,” and “recognize and respond to remove immediate risks of harm” are statistically significant both overall and in the comparison between before the training and 6 weeks after the training.
Conclusion Interprofessional simulation programs contributed to developing some areas of patient safety competencies of healthcare professionals, but only for a limited time. Interprofessional simulation programs should be repeated and combined with other forms of support, including case discussions and debriefings, to ensure lasting effects.
Purpose This study investigated the validity of introducing a clinical skills examination (CSE) to the Korean Oriental Medicine Licensing Examination through a mixed-method modified Delphi study.
Methods A 3-round Delphi study was conducted between September and November 2022. The expert panel comprised 21 oriental medicine education experts who were officially recommended by relevant institutions and organizations. The questionnaires included potential content for the CSE and a detailed implementation strategy. Subcommittees were formed to discuss concerns around the introduction of the CSE, which were collected as open-ended questions. In this study, a 66.7% or greater agreement rate was defined as achieving a consensus.
Results The expert panel’s evaluation of the proposed clinical presentations and basic clinical skills suggested their priorities. Of the 10 items investigated for building a detailed implementation strategy for the introduction of the CSE to the Korean Oriental Medicine Licensing Examination, a consensus was achieved on 9. However, the agreement rate on the timing of the introduction of the CSE was low. Concerns around 4 clinical topics were discussed in the subcommittees, and potential solutions were proposed.
Conclusion This study offers preliminary data and raises some concerns that can be used as a reference while discussing the introduction of the CSE to the Korean Oriental Medicine Licensing Examination.
Purpose This study aims to suggest the number of test items in each of 8 nursing activity categories of the Korean Nursing Licensing Examination, which comprises 134 activity statements including 275 items. The examination will be able to evaluate the minimum ability that nursing graduates must have to perform their duties. Methods: Two opinion surveys involving the members of 7 academic societies were conducted from March 19 to May 14, 2021. The survey results were reviewed by members of 4 expert associations from May 21 to June 4, 2021. The results for revised numbers of items in each category were compared with those reported by Tak and his colleagues and the National Council License Examination for Registered Nurses of the United States. Results: Based on 2 opinion surveys and previous studies, the suggestions for item allocation to 8 nursing activity categories of the Korean Nursing Licensing Examination in this study are as follows: 50 items for management of care and improvement of professionalism, 33 items for safety and infection control, 40 items for management of potential risk, 28 items for basic care, 47 items for physiological integrity and maintenance, 33 items for pharmacological and parenteral therapies, 24 items for psychosocial integrity and maintenance, and 20 items for health promotion and maintenance. Twenty other items related to health and medical laws were not included due to their mandatory status. Conclusion: These suggestions for the number of test items for each activity category will be helpful in developing new items for the Korean Nursing Licensing Examination.
Learning about one’s implicit bias is crucial for improving one’s cultural competency and thereby reducing health inequity. To evaluate bias among medical students following a previously developed cultural training program targeting New Zealand Māori, we developed a text-based, self-evaluation tool called the Similarity Rating Test (SRT). The development process of the SRT was resource-intensive, limiting its generalizability and applicability. Here, we explored the potential of ChatGPT, an automated chatbot, to assist in the development process of the SRT by comparing ChatGPT’s and students’ evaluations of the SRT. Despite results showing non-significant equivalence and difference between ChatGPT’s and students’ ratings, ChatGPT’s ratings were more consistent than students’ ratings. The consistency rate was higher for non-stereotypical than for stereotypical statements, regardless of rater type. Further studies are warranted to validate ChatGPT’s potential for assisting in SRT development for implementation in medical education and evaluation of ethnic stereotypes and related topics.
Citations
Citations to this article as recorded by
Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study Aleksandra Ignjatović, Lazar Stevanović Journal of Educational Evaluation for Health Professions.2023; 20: 28. CrossRef
Purpose This study aimed to detect relationships between undergraduate students’ attitudes toward communication skills learning and demographic variables (such as age, academic year, and gender). Understanding these relationships could provide information for communication skills facilitators and curriculum planners on structuring course delivery and integrating communication skills training into the medical curriculum.
Methods The descriptive study involved a survey of 369 undergraduate students from 2 medical schools in Zambia who participated in communication skills training stratified by academic year using the Communication Skills Attitude Scale. Data were collected between October and December 2021 and analyzed using IBM SPSS for Windows version 28.0.
Results One-way analysis of variance revealed a significant difference in attitude between at least 5 academic years. There was a significant difference in attitudes between the 2nd and 5th academic years (t=5.95, P˂0.001). No significant difference in attitudes existed among the academic years on the negative subscale; the 2nd and 3rd (t=3.82, P=0.004), 4th (t=3.61, P=0.011), 5th (t=8.36, P˂0.001), and 6th (t=4.20, P=0.001) academic years showed significant differences on the positive subscale. Age showed no correlation with attitudes. There was a more favorable attitude to learning communication skills among the women participants than among the men participants (P=0.006).
Conclusion Despite positive general attitudes toward learning communication skills, the difference in attitude between the genders, academic years 2 and 5, and the subsequent classes suggest a re-evaluation of the curriculum and teaching methods to facilitate appropriate course structure according to the academic years and a learning process that addressees gender differences.
Citations
Citations to this article as recorded by
Attitudes toward learning communication skills among Iranian medical students Naser Yousefzadeh Kandevani, Ali Labaf, Azim Mirzazadeh, Pegah Salimi Pormehr BMC Medical Education.2024;[Epub] CrossRef
Purpose It aims to find students’ performance of and perspectives on an objective structured practical examination (OSPE) for assessment of laboratory and preclinical skills in biomedical laboratory science (BLS). It also aims to investigate the perception, acceptability, and usefulness of OSPE from the students’ and examiners’ point of view.
Methods This was a longitudinal study to implement an OSPE in BLS. The student group consisted of 198 BLS students enrolled in semester 4, 2015–2019 at Karolinska University Hospital Huddinge, Sweden. Fourteen teachers evaluated the performance by completing a checklist and global rating scales. A student survey questionnaire was administered to the participants to evaluate the student perspective. To assess quality, 4 independent observers were included to monitor the examiners.
Results Almost 50% of the students passed the initial OSPE. During the repeat OSPE, 73% of the students passed the OSPE. There was a statistically significant difference between the first and the second repeat OSPE (P<0.01) but not between the first and the third attempt (P=0.09). The student survey questionnaire was completed by 99 of the 198 students (50%) and only 63 students responded to the free-text questions (32%). According to these responses, some stations were perceived as more difficult, albeit they considered the assessment to be valid. The observers found the assessment protocols and examiner’s instructions assured the objectivity of the examination.
Conclusion The introduction of an OSPE in the education of biomedical laboratory scientists was a reliable, and useful examination of practical skills.