Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Search

Page Path
HOME > Search
165 "Education"
Filter
Filter
Article category
Keywords
Publication year
Authors
Funded articles
Research articles
Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
Max Samuel Yudovich, Elizaveta Makarova, Christian Michael Hague, Jay Dilip Raman
J Educ Eval Health Prof. 2024;21:17.   Published online July 8, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.17    [Epub ahead of print]
  • 186 View
  • 51 Download
AbstractAbstract PDF
Purpose
This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods
In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results
GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P>0.0001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions
ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.
Events related to medication errors and related factors involving nurses’ behavior to reduce medication errors in Japan: a Bayesian network modeling-based factor analysis and scenario analysis  
Naotaka Sugimura, Katsuhiko Ogasawara
J Educ Eval Health Prof. 2024;21:12.   Published online June 11, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.12
  • 335 View
  • 121 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to identify the relationships between medication errors and the factors affecting nurses’ knowledge and behavior in Japan using Bayesian network modeling. It also aimed to identify important factors through scenario analysis with consideration of nursing students’ and nurses’ education regarding patient safety and medications.
Methods
We used mixed methods. First, error events related to medications and related factors were qualitatively extracted from 119 actual incident reports in 2022 from the database of the Japan Council for Quality Health Care. These events and factors were then quantitatively evaluated in a flow model using Bayesian network, and a scenario analysis was conducted to estimate the posterior probabilities of events when the prior probabilities of some factors were 0%.
Results
There were 10 types of events related to medication errors. A 5-layer flow model was created using Bayesian network analysis. The scenario analysis revealed that “failure to confirm the 5 rights,” “unfamiliarity with operations of medications,” “insufficient knowledge of medications,” and “assumptions and forgetfulness” were factors that were significantly associated with the occurrence of medical errors.
Conclusion
This study provided an estimate of the effects of mitigating nurses’ behavioral factors that trigger medication errors. The flow model itself can also be used as an educational tool to reflect on behavior when incidents occur. It is expected that patient safety education will be recognized as a major element of nursing education worldwide and that an integrated curriculum will be developed.
Challenges and potential improvements in the Accreditation Standards of the Korean Institute of Medical Education and Evaluation 2019 (ASK2019) derived through meta-evaluation: a cross-sectional study  
Yoonjung Lee, Min-jung Lee, Junmoo Ahn, Chungwon Ha, Ye Ji Kang, Cheol Woong Jung, Dong-Mi Yoo, Jihye Yu, Seung-Hee Lee
J Educ Eval Health Prof. 2024;21:8.   Published online April 2, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.8
  • 768 View
  • 222 Download
  • 1 Web of Science
  • 1 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to identify challenges and potential improvements in Korea's medical education accreditation process according to the Accreditation Standards of the Korean Institute of Medical Education and Evaluation 2019 (ASK2019). Meta-evaluation was conducted to survey the experiences and perceptions of stakeholders, including self-assessment committee members, site visit committee members, administrative staff, and medical school professors.
Methods
A cross-sectional study was conducted using surveys sent to 40 medical schools. The 332 participants included self-assessment committee members, site visit team members, administrative staff, and medical school professors. The t-test, one-way analysis of variance and the chi-square test were used to analyze and compare opinions on medical education accreditation between the categories of participants.
Results
Site visit committee members placed greater importance on the necessity of accreditation than faculty members. A shared positive view on accreditation’s role in improving educational quality was seen among self-evaluation committee members and professors. Administrative staff highly regarded the Korean Institute of Medical Education and Evaluation’s reliability and objectivity, unlike the self-evaluation committee members. Site visit committee members positively perceived the clarity of accreditation standards, differing from self-assessment committee members. Administrative staff were most optimistic about implementing standards. However, the accreditation process encountered challenges, especially in duplicating content and preparing self-evaluation reports. Finally, perceptions regarding the accuracy of final site visit reports varied significantly between the self-evaluation committee members and the site visit committee members.
Conclusion
This study revealed diverse views on medical education accreditation, highlighting the need for improved communication, expectation alignment, and stakeholder collaboration to refine the accreditation process and quality.

Citations

Citations to this article as recorded by  
  • The new placement of 2,000 entrants at Korean medical schools in 2025: is the government’s policy evidence-based?
    Sun Huh
    The Ewha Medical Journal.2024;[Epub]     CrossRef
Review
Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review  
Xiaojun Xu, Yixiao Chen, Jing Miao
J Educ Eval Health Prof. 2024;21:6.   Published online March 15, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.6
  • 1,874 View
  • 345 Download
  • 2 Crossref
AbstractAbstract PDFSupplementary Material
Background
ChatGPT is a large language model (LLM) based on artificial intelligence (AI) capable of responding in multiple languages and generating nuanced and highly complex responses. While ChatGPT holds promising applications in medical education, its limitations and potential risks cannot be ignored.
Methods
A scoping review was conducted for English articles discussing ChatGPT in the context of medical education published after 2022. A literature search was performed using PubMed/MEDLINE, Embase, and Web of Science databases, and information was extracted from the relevant studies that were ultimately included.
Results
ChatGPT exhibits various potential applications in medical education, such as providing personalized learning plans and materials, creating clinical practice simulation scenarios, and assisting in writing articles. However, challenges associated with academic integrity, data accuracy, and potential harm to learning were also highlighted in the literature. The paper emphasizes certain recommendations for using ChatGPT, including the establishment of guidelines. Based on the review, 3 key research areas were proposed: cultivating the ability of medical students to use ChatGPT correctly, integrating ChatGPT into teaching activities and processes, and proposing standards for the use of AI by medical students.
Conclusion
ChatGPT has the potential to transform medical education, but careful consideration is required for its full integration. To harness the full potential of ChatGPT in medical education, attention should not only be given to the capabilities of AI but also to its impact on students and teachers.

Citations

Citations to this article as recorded by  
  • Chatbots in neurology and neuroscience: Interactions with students, patients and neurologists
    Stefano Sandrone
    Brain Disorders.2024; 15: 100145.     CrossRef
  • ChatGPT in education: unveiling frontiers and future directions through systematic literature review and bibliometric analysis
    Buddhini Amarathunga
    Asian Education and Development Studies.2024;[Epub]     CrossRef
Research articles
Development and validity evidence for the resident-led large group teaching assessment instrument in the United States: a methodological study  
Ariel Shana Frey-Vogel, Kristina Dzara, Kimberly Anne Gifford, Yoon Soo Park, Justin Berk, Allison Heinly, Darcy Wolcott, Daniel Adam Hall, Shannon Elliott Scott-Vernaglia, Katherine Anne Sparger, Erica Ye-pyng Chung
J Educ Eval Health Prof. 2024;21:3.   Published online February 23, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.3
  • 781 View
  • 184 Download
AbstractAbstract PDFSupplementary Material
Purpose
Despite educational mandates to assess resident teaching competence, limited instruments with validity evidence exist for this purpose. Existing instruments do not allow faculty to assess resident-led teaching in a large group format or whether teaching was interactive. This study gathers validity evidence on the use of the Resident-led Large Group Teaching Assessment Instrument (Relate), an instrument used by faculty to assess resident teaching competency. Relate comprises 23 behaviors divided into 6 elements: learning environment, goals and objectives, content of talk, promotion of understanding and retention, session management, and closure.
Methods
Messick’s unified validity framework was used for this study. Investigators used video recordings of resident-led teaching from 3 pediatric residency programs to develop Relate and a rater guidebook. Faculty were trained on instrument use through frame-of-reference training. Resident teaching at all sites was video-recorded during 2018–2019. Two trained faculty raters assessed each video. Descriptive statistics on performance were obtained. Validity evidence sources include: rater training effect (response process), reliability and variability (internal structure), and impact on Milestones assessment (relations to other variables).
Results
Forty-eight videos, from 16 residents, were analyzed. Rater training improved inter-rater reliability from 0.04 to 0.64. The Φ-coefficient reliability was 0.50. There was a significant correlation between overall Relate performance and the pediatric teaching Milestone (r=0.34, P=0.019).
Conclusion
Relate provides validity evidence with sufficient reliability to measure resident-led large-group teaching competence.
Negative effects on medical students’ scores for clinical performance during the COVID-19 pandemic in Taiwan: a comparative study  
Eunice Jia-Shiow Yuan, Shiau-Shian Huang, Chia-An Hsu, Jiing-Feng Lirng, Tzu-Hao Li, Chia-Chang Huang, Ying-Ying Yang, Chung-Pin Li, Chen-Huan Chen
J Educ Eval Health Prof. 2023;20:37.   Published online December 26, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.37
  • 1,286 View
  • 84 Download
  • 1 Web of Science
  • 1 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
Coronavirus disease 2019 (COVID-19) has heavily impacted medical clinical education in Taiwan. Medical curricula have been altered to minimize exposure and limit transmission. This study investigated the effect of COVID-19 on Taiwanese medical students’ clinical performance using online standardized evaluation systems and explored the factors influencing medical education during the pandemic.
Methods
Medical students were scored from 0 to 100 based on their clinical performance from 1/1/2018 to 6/31/2021. The students were placed into pre-COVID-19 (before 2/1/2020) and midst-COVID-19 (on and after 2/1/2020) groups. Each group was further categorized into COVID-19-affected specialties (pulmonary, infectious, and emergency medicine) and other specialties. Generalized estimating equations (GEEs) were used to compare and examine the effects of relevant variables on student performance.
Results
In total, 16,944 clinical scores were obtained for COVID-19-affected specialties and other specialties. For the COVID-19-affected specialties, the midst-COVID-19 score (88.513.52) was significantly lower than the pre-COVID-19 score (90.143.55) (P<0.0001). For the other specialties, the midst-COVID-19 score (88.323.68) was also significantly lower than the pre-COVID-19 score (90.063.58) (P<0.0001). There were 1,322 students (837 males and 485 females). Male students had significantly lower scores than female students (89.333.68 vs. 89.993.66, P=0.0017). GEE analysis revealed that the COVID-19 pandemic (unstandardized beta coefficient=-1.99, standard error [SE]=0.13, P<0.0001), COVID-19-affected specialties (B=0.26, SE=0.11, P=0.0184), female students (B=1.10, SE=0.20, P<0.0001), and female attending physicians (B=-0.19, SE=0.08, P=0.0145) were independently associated with students’ scores.
Conclusion
COVID-19 negatively impacted medical students' clinical performance, regardless of their specialty. Female students outperformed male students, irrespective of the pandemic.

Citations

Citations to this article as recorded by  
  • The emergence of generative artificial intelligence platforms in 2023, journal metrics, appreciation to reviewers and volunteers, and obituary
    Sun Huh
    Journal of Educational Evaluation for Health Professions.2024; 21: 9.     CrossRef
Use of learner-driven, formative, ad-hoc, prospective assessment of competence in physical therapist clinical education in the United States: a prospective cohort study  
Carey Holleran, Jeffrey Konrad, Barbara Norton, Tamara Burlis, Steven Ambler
J Educ Eval Health Prof. 2023;20:36.   Published online December 8, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.36
  • 1,155 View
  • 127 Download
AbstractAbstract PDFSupplementary Material
Purpose
The purpose of this project was to implement a process for learner-driven, formative, prospective, ad-hoc, entrustment assessment in Doctor of Physical Therapy clinical education. Our goals were to develop an innovative entrustment assessment tool, and then explore whether the tool detected (1) differences between learners at different stages of development and (2) differences within learners across the course of a clinical education experience. We also investigated whether there was a relationship between the number of assessments and change in performance.
Methods
A prospective, observational, cohort of clinical instructors (CIs) was recruited to perform learner-driven, formative, ad-hoc, prospective, entrustment assessments. Two entrustable professional activities (EPAs) were used: (1) gather a history and perform an examination and (2) implement and modify the plan of care, as needed. CIs provided a rating on the entrustment scale and provided narrative support for their rating.
Results
Forty-nine learners participated across 4 clinical experiences (CEs), resulting in 453 EPA learner-driven assessments. For both EPAs, statistically significant changes were detected both between learners at different stages of development and within learners across the course of a CE. Improvement within each CE was significantly related to the number of feedback opportunities.
Conclusion
The results of this pilot study provide preliminary support for the use of learner-driven, formative, ad-hoc assessments of competence based on EPAs with a novel entrustment scale. The number of formative assessments requested correlated with change on the EPA scale, suggesting that formative feedback may augment performance improvement.
Effect of a transcultural nursing course on improving the cultural competency of nursing graduate students in Korea: a before-and-after study  
Kyung Eui Bae, Geum Hee Jeong
J Educ Eval Health Prof. 2023;20:35.   Published online December 4, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.35
  • 1,367 View
  • 160 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to evaluate the impact of a transcultural nursing course on enhancing the cultural competency of graduate nursing students in Korea. We hypothesized that participants’ cultural competency would significantly improve in areas such as communication, biocultural ecology and family, dietary habits, death rituals, spirituality, equity, and empowerment and intermediation after completing the course. Furthermore, we assessed the participants’ overall satisfaction with the course.
Methods
A before-and-after study was conducted with graduate nursing students at Hallym University, Chuncheon, Korea, from March to June 2023. A transcultural nursing course was developed based on Giger & Haddad’s transcultural nursing model and Purnell’s theoretical model of cultural competence. Data was collected using a cultural competence scale for registered nurses developed by Kim and his colleagues. A total of 18 students participated, and the paired t-test was employed to compare pre-and post-intervention scores.
Results
The study revealed significant improvements in all 7 categories of cultural nursing competence (P<0.01). Specifically, the mean differences in scores (pre–post) ranged from 0.74 to 1.09 across the categories. Additionally, participants expressed high satisfaction with the course, with an average score of 4.72 out of a maximum of 5.0.
Conclusion
The transcultural nursing course effectively enhanced the cultural competency of graduate nursing students. Such courses are imperative to ensure quality care for the increasing multicultural population in Korea.
Brief report
ChatGPT (GPT-3.5) as an assistant tool in microbial pathogenesis studies in Sweden: a cross-sectional comparative study  
Catharina Hultgren, Annica Lindkvist, Volkan Özenci, Sophie Curbo
J Educ Eval Health Prof. 2023;20:32.   Published online November 22, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.32
  • 1,246 View
  • 110 Download
  • 2 Web of Science
  • 2 Crossref
AbstractAbstract PDFSupplementary Material
ChatGPT (GPT-3.5) has entered higher education and there is a need to determine how to use it effectively. This descriptive study compared the ability of GPT-3.5 and teachers to answer questions from dental students and construct detailed intended learning outcomes. When analyzed according to a Likert scale, we found that GPT-3.5 answered the questions from dental students in a similar or even more elaborate way compared to the answers that had previously been provided by a teacher. GPT-3.5 was also asked to construct detailed intended learning outcomes for a course in microbial pathogenesis, and when these were analyzed according to a Likert scale they were, to a large degree, found irrelevant. Since students are using GPT-3.5, it is important that instructors learn how to make the best use of it both to be able to advise students and to benefit from its potential.

Citations

Citations to this article as recorded by  
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study
    Hyunju Lee, Soobin Park
    Journal of Educational Evaluation for Health Professions.2023; 20: 39.     CrossRef
Technical report
Item difficulty index, discrimination index, and reliability of the 26 health professions licensing examinations in 2022, Korea: a psychometric study
Yoon Hee Kim, Bo Hyun Kim, Joonki Kim, Bokyoung Jung, Sangyoung Bae
J Educ Eval Health Prof. 2023;20:31.   Published online November 22, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.31
  • 943 View
  • 81 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study presents item analysis results of the 26 health personnel licensing examinations managed by the Korea Health Personnel Licensing Examination Institute (KHPLEI) in 2022.
Methods
The item difficulty index, item discrimination index, and reliability were calculated. The item discrimination index was calculated using a discrimination index based on the upper and lower 27% rule and the item-total correlation.
Results
Out of 468,352 total examinees, 418,887 (89.4%) passed. The pass rates ranged from 27.3% for health educators level 1 to 97.1% for oriental medical doctors. Most examinations had a high average difficulty index, albeit to varying degrees, ranging from 61.3% for prosthetists and orthotists to 83.9% for care workers. The average discrimination index based on the upper and lower 27% rule ranged from 0.17 for oriental medical doctors to 0.38 for radiological technologists. The average item-total correlation ranged from 0.20 for oriental medical doctors to 0.38 for radiological technologists. The Cronbach α, as a measure of reliability, ranged from 0.872 for health educators-level 3 to 0.978 for medical technologists. The correlation coefficient between the average difficulty index and average discrimination index was -0.2452 (P=0.1557), that between the average difficulty index and the average item-total correlation was 0.3502 (P=0.0392), and that between the average discrimination index and the average item-total correlation was 0.7944 (P<0.0001).
Conclusion
This technical report presents the item analysis results and reliability of the recent examinations by the KHPLEI, demonstrating an acceptable range of difficulty index and discrimination index values, as well as good reliability.
Research articles
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study  
Betzy Clariza Torres-Zegarra, Wagner Rios-Garcia, Alvaro Micael Ñaña-Cordova, Karen Fatima Arteaga-Cisneros, Xiomara Cristina Benavente Chalco, Marina Atena Bustamante Ordoñez, Carlos Jesus Gutierrez Rios, Carlos Alberto Ramos Godoy, Kristell Luisa Teresa Panta Quezada, Jesus Daniel Gutierrez-Arratia, Javier Alejandro Flores-Cohaila
J Educ Eval Health Prof. 2023;20:30.   Published online November 20, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.30
  • 1,814 View
  • 184 Download
  • 6 Web of Science
  • 6 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).
Methods
This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).
Results
GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.
Conclusion
Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

Citations

Citations to this article as recorded by  
  • Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
    Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki
    JMIR Medical Education.2024; 10: e57054.     CrossRef
  • Response to Letter to the Editor re: “Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT ‘Wins' Rhinoplasty Consultations: Should We Be Worried? [1]” by Durairaj et al
    Kay Durairaj, Omer Baker
    Facial Plastic Surgery & Aesthetic Medicine.2024; 26(3): 276.     CrossRef
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: A Systematic Review and Meta-Analysis (Preprint)
    Mingxin Liu, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, Takahiro Kiuchi
    Journal of Medical Internet Research.2024;[Epub]     CrossRef
  • Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study
    Giacomo Rossettini, Lia Rodeghiero, Federica Corradi, Chad Cook, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Stefania Chiappinotto, Silvia Gianola, Alvisa Palese
    BMC Medical Education.2024;[Epub]     CrossRef
  • Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study
    Hyunju Lee, Soobin Park
    Journal of Educational Evaluation for Health Professions.2023; 20: 39.     CrossRef
Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study  
Aleksandra Ignjatović, Lazar Stevanović
J Educ Eval Health Prof. 2023;20:28.   Published online October 16, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.28
  • 2,558 View
  • 180 Download
  • 4 Web of Science
  • 7 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to assess the performance of ChatGPT (GPT-3.5 and GPT-4) as a study tool in solving biostatistical problems and to identify any potential drawbacks that might arise from using ChatGPT in medical education, particularly in solving practical biostatistical problems.
Methods
ChatGPT was tested to evaluate its ability to solve biostatistical problems from the Handbook of Medical Statistics by Peacock and Peacock in this descriptive study. Tables from the problems were transformed into textual questions. Ten biostatistical problems were randomly chosen and used as text-based input for conversation with ChatGPT (versions 3.5 and 4).
Results
GPT-3.5 solved 5 practical problems in the first attempt, related to categorical data, cross-sectional study, measuring reliability, probability properties, and the t-test. GPT-3.5 failed to provide correct answers regarding analysis of variance, the chi-square test, and sample size within 3 attempts. GPT-4 also solved a task related to the confidence interval in the first attempt and solved all questions within 3 attempts, with precise guidance and monitoring.
Conclusion
The assessment of both versions of ChatGPT performance in 10 biostatistical problems revealed that GPT-3.5 and 4’s performance was below average, with correct response rates of 5 and 6 out of 10 on the first attempt. GPT-4 succeeded in providing all correct answers within 3 attempts. These findings indicate that students must be aware that this tool, even when providing and calculating different statistical analyses, can be wrong, and they should be aware of ChatGPT’s limitations and be careful when incorporating this model into medical education.

Citations

Citations to this article as recorded by  
  • Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science?
    Xiaoming Zhai, Matthew Nyaaba, Wenchao Ma
    Science & Education.2024;[Epub]     CrossRef
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy
    Ambadasu Bharatha, Nkemcho Ojeh, Ahbab Mohammad Fazle Rabbi, Michael Campbell, Kandamaran Krishnamurthy, Rhaheem Layne-Yarde, Alok Kumar, Dale Springer, Kenneth Connell, Md Anwarul Majumder
    Advances in Medical Education and Practice.2024; Volume 15: 393.     CrossRef
  • Revolutionizing Cardiology With Words: Unveiling the Impact of Large Language Models in Medical Science Writing
    Abhijit Bhattaru, Naveena Yanamala, Partho P. Sengupta
    Canadian Journal of Cardiology.2024;[Epub]     CrossRef
  • ChatGPT in medicine: prospects and challenges: a review article
    Songtao Tan, Xin Xin, Di Wu
    International Journal of Surgery.2024; 110(6): 3701.     CrossRef
  • In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions
    Leonard Knoedler, Samuel Knoedler, Cosima C. Hoch, Lukas Prantl, Konstantin Frank, Laura Soiderer, Sebastian Cotofana, Amir H. Dorafshar, Thilo Schenck, Felix Vollbach, Giuseppe Sofo, Michael Alfertshofer
    Scientific Reports.2024;[Epub]     CrossRef
  • Evaluating the quality of responses generated by ChatGPT
    Danimir Mandić, Gordana Miščević, Ljiljana Bujišić
    Metodicka praksa.2024; 27(1): 5.     CrossRef
Effect of an interprofessional simulation program on patient safety competencies of healthcare professionals in Switzerland: a before and after study  
Sylvain Boloré, Thomas Fassier, Nicolas Guirimand
J Educ Eval Health Prof. 2023;20:25.   Published online August 28, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.25
  • 1,741 View
  • 156 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to identify the effects of a 12-week interprofessional simulation program, operated between February 2020 and January 2021, on the patient safety competencies of healthcare professionals in Switzerland.
Methods
The simulation training was based on 2 scenarios of hospitalized patients with septic shock and respiratory failure, and trainees were expected to demonstrate patient safety competencies. A single-group before and after study was conducted after the intervention—simulation program, using a measurement tool (the Health Professional Education in Patient Safety Survey) to measure the perceived competencies of physicians, nurses, and nursing assistants. Out of 57 participants, 37 answered the questionnaire surveys 4 times: 48 hours before the training, followed by post-surveys at 24 hours, 6 weeks, and 12 weeks after the training. The linear mixed effect model was applied for the analysis.
Results
Four components out of 6 perceived patient safety competencies improved at 6 weeks but returned to a similar level before training at 12 weeks. Competencies of “communicating effectively,” “managing safety risks,” “understanding human and environmental factors that influence patient safety,” and “recognize and respond to remove immediate risks of harm” are statistically significant both overall and in the comparison between before the training and 6 weeks after the training.
Conclusion
Interprofessional simulation programs contributed to developing some areas of patient safety competencies of healthcare professionals, but only for a limited time. Interprofessional simulation programs should be repeated and combined with other forms of support, including case discussions and debriefings, to ensure lasting effects.
Implementation strategy for introducing a clinical skills examination to the Korean Oriental Medicine Licensing Examination: a mixed-method modified Delphi study  
Chan-Young Kwon, Sanghoon Lee, Min Hwangbo, Chungsik Cho, Sangwoo Shin, Dong-Hyeon Kim, Aram Jeong, Hye-Yoon Lee
J Educ Eval Health Prof. 2023;20:23.   Published online July 17, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.23
  • 1,927 View
  • 145 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study investigated the validity of introducing a clinical skills examination (CSE) to the Korean Oriental Medicine Licensing Examination through a mixed-method modified Delphi study.
Methods
A 3-round Delphi study was conducted between September and November 2022. The expert panel comprised 21 oriental medicine education experts who were officially recommended by relevant institutions and organizations. The questionnaires included potential content for the CSE and a detailed implementation strategy. Subcommittees were formed to discuss concerns around the introduction of the CSE, which were collected as open-ended questions. In this study, a 66.7% or greater agreement rate was defined as achieving a consensus.
Results
The expert panel’s evaluation of the proposed clinical presentations and basic clinical skills suggested their priorities. Of the 10 items investigated for building a detailed implementation strategy for the introduction of the CSE to the Korean Oriental Medicine Licensing Examination, a consensus was achieved on 9. However, the agreement rate on the timing of the introduction of the CSE was low. Concerns around 4 clinical topics were discussed in the subcommittees, and potential solutions were proposed.
Conclusion
This study offers preliminary data and raises some concerns that can be used as a reference while discussing the introduction of the CSE to the Korean Oriental Medicine Licensing Examination.
Suggestion for item allocation to 8 nursing activity categories of the Korean Nursing Licensing Examination: a survey-based descriptive study  
Kyunghee Kim, So Young Kang, Younhee Kang, Youngran Kweon, Hyunjung Kim, Youngshin Song, Juyeon Cho, Mi-Young Choi, Hyun Su Lee
J Educ Eval Health Prof. 2023;20:18.   Published online June 12, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.18
  • 1,784 View
  • 114 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study aims to suggest the number of test items in each of 8 nursing activity categories of the Korean Nursing Licensing Examination, which comprises 134 activity statements including 275 items. The examination will be able to evaluate the minimum ability that nursing graduates must have to perform their duties. Methods: Two opinion surveys involving the members of 7 academic societies were conducted from March 19 to May 14, 2021. The survey results were reviewed by members of 4 expert associations from May 21 to June 4, 2021. The results for revised numbers of items in each category were compared with those reported by Tak and his colleagues and the National Council License Examination for Registered Nurses of the United States. Results: Based on 2 opinion surveys and previous studies, the suggestions for item allocation to 8 nursing activity categories of the Korean Nursing Licensing Examination in this study are as follows: 50 items for management of care and improvement of professionalism, 33 items for safety and infection control, 40 items for management of potential risk, 28 items for basic care, 47 items for physiological integrity and maintenance, 33 items for pharmacological and parenteral therapies, 24 items for psychosocial integrity and maintenance, and 20 items for health promotion and maintenance. Twenty other items related to health and medical laws were not included due to their mandatory status. Conclusion: These suggestions for the number of test items for each activity category will be helpful in developing new items for the Korean Nursing Licensing Examination.

JEEHP : Journal of Educational Evaluation for Health Professions