Abstract
-
Purpose
- Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4’s potential in grading pharmacy students’ exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4’s consistency over time, and determined its error rates in grading pharmacy students’ exam responses.
-
Methods
- We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.
-
Results
- GPT-4’s ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617–0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=–5.71–4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.
-
Conclusion
- With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students’ exam responses, particularly for objective tasks. However, its limitations—including errors and variability in grading high-scoring responses—require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.
-
Keywords: Artificial intelligence; Education; Reproducibility of results; Pharmacy; Universities
Graphical abstract
Introduction
- Background
- Grading is a labor-intensive process that requires significant time and effort from educators. This workload limits the time available for teaching and other academic responsibilities, which may negatively affect the quality of instruction and student engagement [1]. Additionally, ensuring fairness in grading adds complexity, as perceptions of fairness can vary depending on assessment type, gender, and instructional approach [2].
- Evidence indicates that traditional human grading often lacks consistency. For example, the UK’s National Assessment Agency reported variability in marking reliability across subjects and formats [3]. Similarly, studies have shown that biases and differences in mark distributions can occur in human grading, affecting perceptions of both fairness and reliability [4,5]. Recognition of these issues has led educators to seek improvements in assessment practices to foster greater trust and equity.
- To enhance grading reliability, moderation and double marking have been introduced. However, these methods are time-consuming and lack strong evidence for their effectiveness [4]. The limitations of traditional systems highlight the need for innovative approaches. Generative artificial intelligence (GenAI) models are systems capable of creating novel content in text and other formats [6]. GenAI tools such as OpenAI ChatGPT, Google Gemini, and Microsoft Copilot offer potential for improving grading efficiency and consistency. Their performance has been assessed across various disciplines, revealing both strengths and limitations.
- In information technology, GenAI tools have been used to grade structured query language tasks but show reduced reliability in nuanced or complex scenarios [7]. Similarly, Li et al. [8] reported variability in grading coding assignments, indicating the need for further refinement. Conversely, Jukiewicz [9] found strong alignment between GenAI-generated scores and human evaluations in coding assessments.
- For text-based items in language, education, and healthcare, GenAI tools have also been studied. Yavuz et al. [10] examined their use in grading English essays, finding consistent scores but a limited ability to interpret cultural nuances and complex meanings. In medical education, GenAI has shown potential for grading short-answer items, though it remains an imperfect substitute for human raters [11]. In public health education, ChatGPT demonstrated consistent performance in marking reflective essays [8]. However, Awidi [12] observed that ChatGPT applied stricter evaluative criteria than human raters, raising concerns about misalignment with human grading standards. Similarly, Mizumoto and Eguchi [13] noted that ChatGPT’s grading did not fully align with human scores, underscoring the need for continued human oversight to ensure fairness and accuracy.
- Grading pharmacy students’ exam responses poses unique challenges due to their highly specialized nature and the need for context-specific validation. While some studies have explored different prompt types, few have systematically optimized prompts—a critical limitation, as GenAI performance is closely tied to prompt quality [7,10].
- Objectives
- This study investigated the effectiveness of GPT-4 in grading pharmacy students’ exam responses when augmented with optimized prompts. Optimized prompts were created for each item type through systematic modification until GPT-4’s responses sufficiently aligned with the grading criteria. Specifically, the study evaluated the alignment between GPT-4 and human raters, its consistency over time and areas of divergence, and assessed GPT-4’s evaluation error rates to determine its reliability as a grading tool.
Methods
- Ethics statement
- This study was approved by the Monash University Human Research Ethics Committee (Project #43846). Potential identifiers in the collected data were carefully screened for and removed to ensure complete anonymity and confidentiality. Additionally, any option on the ChatGPT platform permitting data use for model training was disabled.
- Study design
- We conducted a comparative study using selected past exam items and students’ text-based answers. Responses had been graded and cross-checked by a team of qualified graders following the university’s standard practices. All responses were randomized and anonymized before evaluation by GPT-4. GPT-4 was accessed via a Plus account on the ChatGPT website between April and September 2024.
- Setting
- This study was conducted within the Bachelor of Pharmacy program at Monash University Malaysia, a 4-year program covering pharmaceutical sciences, basic sciences, clinical and therapeutic sciences, and pharmacy practice. The study focused on items from a drug delivery systems unit and a pharmacy practice unit. The overall approach of the study is summarized in Fig. 1.
- Phase 1: Prompt optimization
- This phase aimed to develop prompts enabling GPT-4 to produce rational evaluations closely matching those of experienced graders. We tested prompts using 16 responses—14 randomly selected student responses and 2 zero-scoring control responses. Each prompt began with a persona description for GPT-4, followed by the exam item and marking criteria. Prompt optimization was performed iteratively, addressing issues identified in earlier versions to maximize intraclass correlation coefficient (ICC) values and minimize GPT-4 scoring errors. A general example is provided in Supplement 1.
- The prompting sequence involved confirming GPT-4’s understanding of instructions before sending responses for evaluation, one at a time, until all 16 were completed.
- Phase 2: Prompt delivery evaluation
- To determine the most efficient methods for evaluating large response samples, we tested 3 prompt delivery methods on a complete response set for one item using the finalized prompt: (1) Method 1: One-time prompt, then responses passed one-by-one until complete. (2) Method 2: Prompt followed by 15 responses evaluated one-by-one, then re-prompted after each batch of 15. (3) Method 3: New browser window and prompt for every 15 responses.
- All methods followed the same basic prompting sequence as in Phase 1.
- Phase 3: Item grading performance
- We applied the optimized prompt and prompt delivery method across the full set of responses for all 4 items, evaluating ICC scores between human graders and GPT-4, the effect of stratification by score categories on ICC, the reproducibility of different GPT-4 attempts, and finally, the error rates and types of errors observed during evaluation. The total number of responses evaluated is shown in Table 1.
- Subjects
- Four structured items with varying attributes were selected for this study, as shown in Table 1.
- Statistical methods
- We applied ICC analyses to evaluate both the consistency and agreement of scores between GPT-4 and human graders, as well as across multiple GPT-4 attempts. The ICC is a widely used statistical measure for assessing the reliability of quantitative ratings among different raters or measurement instances [14]. ICC values below 0.50 indicate poor reliability, 0.50–0.75 moderate reliability, 0.75–0.90 good reliability, and values above 0.90 indicate excellent reliability [14]. To assess whether reliability versus human raters improved with repeated GPT-4 attempts, we compared ICC values (inter-rater, 2-way random effects model) for 1, 2, and 3 GPT-4 trials, averaging scores when multiple attempts were used. We also evaluated the consistency of scores across the 3 GPT-4 trials (intra-rater, 2-way mixed effects model).
- The Wilcoxon signed-rank test was used to compare differences between GPT-4 and human raters. Scoring errors by GPT-4 were tracked and categorized as either summing mistakes or rubric mismatches. Examples of scoring errors are provided in Supplement 2. Summing mistakes were corrected before analysis. All statistical analyses were performed using IBM SPSS ver. 28.0 (IBM Corp.).
Results
- Phase 1: Prompt optimization
- As seen in Fig. 2, between 5 and 12 prompt versions were created for each item before GPT-4’s grading satisfactorily aligned with human graders. The final version for each item was used for inter-day reproducibility testing, and ICC values remained consistent across days. In reproducibility tests, Items 1 and 2 showed no scoring errors, while Items 3 and 4 had relatively low error rates. Scores obtained during prompt optimization are available in Dataset 1, and complete ICC analyses are provided in Supplement 3.
- Phase 2: Prompt delivery evaluation
- To identify an effective method for rating full response sets (n=125–132) in GPT-4, 3 prompt delivery methods were tested using Item 1, as described in Section 2.2.2. Method 3 produced the lowest ICC value and was excluded. Method 1 showed consecutive scoring errors in the final quarter of responses, which inflated its error rate. Method 2 yielded a higher ICC (0.617; 95% CI, 0.455–0.732; P<0.001), a lower error frequency (8.8%), and acceptable evaluation quality. Therefore, Method 2 was used for evaluating all 4 exam items. Complete results are available in Supplement 4.
- Phase 3: Grading performance
Reliability between GPT-4 and human raters
- ICC values between GPT-4 and human raters were measured for all items and stratified by score bands. Items 1 and 2 showed moderate agreement, Item 4 showed good agreement, and Item 3 demonstrated excellent agreement, as detailed in Supplement 5. When stratified into low, intermediate, and high-scoring bands, agreement decreased, particularly for high-scoring responses. Averaging 2 or 3 GPT-4 attempts slightly increased ICC values overall, but this trend was inconsistent across score bands. Scores obtained during the full trial are available in Dataset 2.
Reliability of GPT-4 grading between attempts
- Intra-rater ICCs across 3 GPT-4 trials varied. Single measures showed moderate reliability for Item 1, good reliability for Items 2 and 4, and excellent reliability for Item 3. Average measures indicated good reliability for Item 1 and excellent reliability for the remaining items. Further results can be found in Supplement 6.
Difference between GPT-4 and human raters in pharmacy students’ exam responses
- The Wilcoxon signed-rank test showed significant differences in GPT-4 versus human scores for Items 1 (Z=–4.329, P<0.001), 2 (Z=8.708, P<0.001), and 4 (Z=5.710, P<0.001). Item 3 showed no significant difference (Z=–0.226, P=0.821). When stratified by score bands, all items showed significant differences in the high band. Differences in the low and intermediate bands varied by item, as shown in Supplement 7.
Score calculation errors and other errors identified in GPT-4’s evaluations
- Score calculation errors were highest in Item 3 (34.07%±8.75%), followed by Item 4 (7.82%±8.60%), Item 1 (6.67%±2.44%), and Item 2 (2.88%±2.40%), as shown in Supplement 8.
- A review of GPT-4’s evaluation rationale revealed recurring issues across all items. Despite explicit instructions, it sometimes applied its own discretion, awarding marks beyond the rubric scope or for restated ideas. It was more lenient with flexible marking criteria and occasionally over-allocated marks. Other errors included failing to recognize correct responses, missing parts of responses, altering responses to fit the rubric, or misinterpreting language. Mean frequencies of these issues are summarized in Supplement 9.
Discussion
- Key results
- This study emphasized both the potential and limitations of GPT-4 in grading pharmacy students’ exam responses, particularly in its alignment with human grading practices. By systematically optimizing prompts and evaluating performance across several metrics, we found that GPT-4 can achieve moderate to excellent inter-rater reliability depending on the item type. However, variability in agreement—especially for high-scoring answers—underscores the challenge of matching artificial intelligence (AI) evaluations to human judgment in discipline-specific settings.
- Interpretation/comparison with previous studies
Effects of prompt optimization
- Prompt optimization was critical for improving GPT-4’s grading reliability and reducing errors. Li et al. [8] observed significant score variation across 5 prompts when grading coding and reflective essay assignments, emphasizing the importance of task-specific prompts. In our study, iterative, item-specific refinement led to substantial ICC improvements, with most exam items achieving moderate to excellent reliability. Yavuz et al. [10] similarly reported that fine-tuned prompts with low temperature settings produced more consistent grading than default prompts. While refinement requires upfront effort, it improves long-term reliability. Approaches like those of Jukiewicz [9], which employ developer-informed prompt engineering strategies, could accelerate this process.
- Combining optimized prompts with structured rubrics further enhanced consistency between GPT-4 and human grading. Jukiewicz [9] and Morjaria et al. [11] both highlighted the value of well-structured inputs for improving AI grading reliability. The former attributed high correlations between AI and human raters in programming tasks to well-designed prompts, while the latter demonstrated that detailed rubrics reduced score discrepancies in medical student assessments.
Reasons for variability in GPT-4’s evaluations
- In terms of marking performance stratified by response quality, we found larger discrepancies in scores for high-performing students, consistent with trends noted in evaluations of English essays by Yavuz et al. [10]. Conversely, some studies found that score discrepancies were higher for low-performing students when GenAI was used to grade reflective essays and coding assignments [8]. This suggests that assessment type may play a role in grading variability. In addition, poor structure of responses may also contribute to this scoring variability [7].
- We observed excellent intra-rater reliability across multiple GPT-4 grading attempts, indicating consistent performance within the model. Furthermore, the ICC between GPT-4 and human raters improved when the average of 2 or more GPT-4 attempts was used. This finding, also reported by Jukiewicz [9] and Yavuz et al. [10], suggests that averaging multiple grading attempts by GPT-4 is advantageous. Hamtini and Assaf [7] noted erratic behavior in score calculations and the awarding of partial marks by GenAI when not explicitly instructed to do so—a challenge we also identified. Limitations in GPT-4’s “working memory” may have contributed to these issues [15].
Potential barriers to implementing GPT-4 as a grading tool
- Several challenges emerged when using GPT-4 for grading. A major issue was the need to tailor prompts for each exam item to ensure consistency. The accuracy of GPT-4’s grading was also influenced by the clarity of the rubric, as vague or overly complex criteria often led to inconsistent scores. Another issue was the need to validate and correct AI-generated scores due to GPT-4’s simple calculation errors, which negated some of the time savings achieved through automation. Furthermore, account usage limits restricted GPT-4’s capacity to grade in bulk, with only about 30 responses processed every 4 hours. This posed scalability issues for high-volume grading. Additionally, GPT-4’s processing limitations hindered its ability to evaluate longer, more complex responses, particularly those with multi-part answers, indicating a need for model improvement.
- Limitations
- GPT-4’s grades were benchmarked against human-graded scores, which are inherently variable. Also, the study was conducted between April and September 2024, and AI models evolve quickly, potentially reducing generalizability. Prompt optimization was tested on a small subset of responses and focused on ICC and scoring errors. Larger datasets and broader metrics could improve future applications. Nonetheless, our method effectively compared optimized versus unrefined prompts.
- Implications
- GPT-4 is best used as a supportive tool rather than as an independent grading solution. A practical approach is to have GPT-4 grade in parallel with human graders to check for consistency and fairness. Educators should develop item-specific prompts and ensure grading criteria are clear and specific. Averaging multiple GPT-4 attempts (as in this study, with up to 3 attempts) enhances reliability. GPT-4 performs best on structured, straightforward items—such as calculations—and less effectively on subjective or open-ended tasks. Human oversight remains essential to validate scores and correct errors, balancing accuracy with AI-driven efficiency. For batch limitations, automated submission scheduling aligned with GPT-4’s usage window may help. Overall, GenAI can support educators by reducing grading time, increasing grading consistency, and improving fairness in assessment.
- Suggestions for further studies
- Several areas merit further investigation. Comparing various GenAI models and more advanced versions may reveal improved grading capabilities, especially for batch processing and complex responses. Assessing GPT-4’s capacity to grade non-text assessments (e.g., images or videos) could also extend its utility. Creating a shared prompt bank of optimized, item-specific templates tailored to common question types may streamline GenAI-assisted marking, ensure grading consistency, and save educators time in prompt optimization.
- Conclusion
- GPT-4 shows promise in improving efficiency and consistency in pharmacy assessment grading. Tailored prompts significantly enhance their reliability, especially for objective tasks like calculations. However, variability in scoring across item types and error tendencies highlight its limitations as a standalone grading tool. GenAI is best positioned to support, not replace, human graders. Future research should focus on alternative models, prompt refinement, and broader assessment formats.
Authors’ contributions
Conceptualization/design: RFSL, SWHL, PSS, WJW. Data collection: WSY. Methodology/formal analysis: RFSL, WSY. Visualization: RFSL. Statistical consultation: SWHL. Writing–original draft: WSY, PSS, RFSL. Writing–review and editing: all authors.
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Funding
This study was funded by the Monash University Malaysia Learning & Teaching Grant Scheme (N-M010-INA-000010).
Data availability
Data files are available from https://doi.org/10.7910/DVN/CMT1TF
Dataset 1. Scores obtained for each prompt iteration during the prompt optimization phase (items 1–4).
jeehp-22-20-dataset1.docx
Dataset 2. Scores obtained using the prompts optimized for items 1–4 during the full trial.
jeehp-22-20-dataset1.docx
Acknowledgments
None.
Supplementary materials
Supplement files are available from https://doi.org/10.7910/DVN/CMT1TF
Supplement 1. Outline of a finalized prompt used to instruct GPT-4’s evaluations of the student responses.
jeehp-22-20-suppl1.docx
Supplement 3. Mean and standard deviation (SD) of intraclass correlation coefficients (ICCs) and frequency of score calculation errors of 3 inter-day trials using finalized prompts.
jeehp-22-20-suppl3.docx
Supplement 5. Inter-rater reliability intraclass correlation coefficients (ICCs) for absolute agreement (2-way random effects model, single rater), comparing GPT-4’s scores with the human assessors stratified by low, intermediate, and high scorers. Scores from up to 3 GPT-4 trials were averaged for comparison.
jeehp-22-20-suppl5.docx
Supplement 6. Intra-rater reliability intraclass correlation coefficients (ICCs) for absolute agreement between 3 GPT-4 attempts (2-way mixed effects model).
jeehp-22-20-suppl6.docx
Supplement 7. Wilcoxon signed-rank test comparing the average of GPT-4’s scores with the human assessors stratified by low, intermediate, and high scorers.
jeehp-22-20-suppl7.docx
Supplement 9. Types of issues encountered and their mean frequencies in GPT-4’s evaluations
jeehp-22-20-suppl9.docx
Fig. 1.Flowchart of the approach used in this study.
Fig. 2.Part (A) shows the intraclass correlation coefficient (ICC) and Part (B) presents the frequency of score calculation errors (%) of prompts during iterations of prompt optimization. Part (C) shows the ICC and Part (D) presents the frequency of score calculation errors when optimized prompts were evaluated for inter-day reliability.
Table 1.Descriptions of the selected exam items
|
Label |
Item type |
Subject matter |
Knowledge assessed (Blooms level) |
Mark allocation |
No. of student responses evaluateda)
|
|
Exam item 1 |
Comparative analysis |
Science |
Comparing attributes of polymer-based drug delivery systems (analysis) |
- 10 marks |
125 |
|
- Two-part item |
|
- Every correct point 1 mark |
|
Exam item 2 |
Evaluative reasoning |
Science |
Judging whether a statement about dosage forms is true or false (evaluation) |
- 10 marks |
127 |
|
- Five-part item |
|
- Every correct point 1 mark |
|
Exam item 3 |
Dosage calculation |
Clinical |
Calculating weight-based dosages for pyrantel squares (application) |
- Six marks |
132 |
|
- Every correct calculation 0.5 marks |
|
- Every correct final answer 1 mark |
|
Exam item 4 |
Critical appraisal |
Regulatory standards |
Reasoning whether a new pharmaceutical product can be stocked (evaluation) |
- Six marks |
132 |
|
- Every correct point 1 mark |
References
- 1. Ward H. Workload: tens of thousands of teachers spend more than 11 hours marking every week. Tes Magazine [Internet]. 2016 Apr 18 [cited 2025 Apr 24]. Available from: https://www.tes.com/magazine/archive/workload-tens-thousands-teachers-spend-more-11-hours-marking-every-week
- 2. Burger R. Student perceptions of the fairness of grading procedures: a multilevel investigation of the role of the academic environment. High Educ 2017;74:301-320. https://doi.org/10.1007/s10734-016-0049-1 Article
- 3. Meadows M, Billington L. A review of the literature on marking reliability. National Assessment Agency; 2005.
- 4. Yorke M. Grading student achievement in higher education: signals and shortcomings. Routledge; 2007.
- 5. Bloxham S. Marking and moderation in the UK: false assumptions and wasted resources. Assess Eval High Educ 2009;34:209-220. https://doi.org/10.1080/02602930801955978 Article
- 6. Sengar SS, Hasan AB, Kumar S, Carroll F. Generative artificial intelligence: a systematic review and applications. Multimed Tools Appl 2025;84:23661-23700. https://doi.org/10.1007/s11042-024-20016-1 Article
- 7. Hamtini T, Assaf AJ. Exploring the efficacy of GenAI in grading SQL query tasks: a case study. Cybern Inf Technol 2024;24:102-111. https://doi.org/10.2478/cait-2024-0027 Article
- 8. Li J, Jangamreddy NK, Hisamoto R, Bhansali R, Dyda A, Zaphir L, Glencross M. AI-assisted marking: functionality and limitations of ChatGPT in written assessment evaluation. Australas J Educ Technol 2024;40:56-72. https://doi.org/10.14742/ajet.9463 Article
- 9. Jukiewicz M. The future of grading programming assignments in education: the role of ChatGPT in automating the assessment and feedback process. Think Skills Creat 2024;52:101522. https://doi.org/10.1016/j.tsc.2024.101522 Article
- 10. Yavuz F, Celik O, Yavas Celik G. Utilizing large language models for EFL essay grading: an examination of reliability and validity in rubric‐based assessments. Br J Educ Technol 2025;56:150-166. https://doi.org/10.1111/bjet.13494 Article
- 11. Morjaria L, Burns L, Bracken K, Levinson AJ, Ngo QN, Lee M, Sibbald M. Examining the efficacy of ChatGPT in marking short-answer assessments in an undergraduate medical program. Int Med Educ 2024;3:32-43. https://doi.org/10.3390/ime3010004 Article
- 12. Awidi IT. Comparing expert tutor evaluation of reflective essays with marking by generative artificial intelligence (AI) tool. Comput Educ Artif Intell 2024;6:100226. https://doi.org/10.1016/j.caeai.2024.100226 Article
- 13. Mizumoto A, Eguchi M. Exploring the potential of using an AI language model for automated essay scoring. Res Methods Appl Linguist 2023;2:100050. https://doi.org/10.1016/j.rmal.2023.100050 Article
- 14. Koo TK, Li MY. A Guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 2016;15:155-163. https://doi.org/10.1016/j.jcm.2016.02.012 ArticlePubMedPMC
- 15. Kortemeyer G, Nohl J, Onishchuk D. Grading assistance for a handwritten thermodynamics exam using artificial intelligence: an exploratory study. Phys Rev Phys Educ Res 2024;20:020144. https://doi.org/10.1103/PhysRevPhysEducRes.20.020144 Article
Citations
Citations to this article as recorded by
