, Junseok Kang2
, Min-Young Kim3
, Jihyun Ahn4*
1College of Nursing, Research Institute of Nursing Innovation, Kyungpook National University, Daegu, Korea
2Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
3Department of Dental Hygiene, Howon University, Gunsan, Korea
4Department of Internal Medicine, Korea Medical Institute, Seoul, Korea
© 2025 Korea Health Personnel Licensing Examination Institute
This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Authors’ contributions
Conceptualization: JA. Data curation: JA. Formal analysis: JK, JA. Methodology: BK, MYK, JA. Project administration: JA. Visualization: JA. Writing–original draft: BK, JA. Writing–review & editing: JK, JA.
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Funding
No funding to declare.
Data availability
Not applicable.
Acknowledgments
Fig. 2 in this manuscript was generated using Google Whisk and Imagen tools for image generation, based on author-provided prompts.
| Approach | Characteristics | Advantages | Disadvantages | Concrete examples |
|---|---|---|---|---|
| Zero-shot prompting | Model generates multiple-choice questions without examples. | Simple, fast; diverse coverage (recall, reasoning, problem-solving). | Often deviates from required formats; implausible distractors; inconsistent option ordering. | Used only for initial ideation, followed immediately by structural checklists and automated cueing detection. |
| Few-shot prompting with exemplars | Supplies 3–5 validated single-best answer items before generation. | Improves structural conformity, distractor plausibility, and clinical realism; reproduces tone/length; adapts to exam conventions. | Requires high-quality, curated exemplars. | KMLE-aligned sequence (demographics → chief complaint → history → exam → labs); improves cultural and linguistic appropriateness. |
| Multi-stage prompting | Stepwise process: topic → vignette → lead-in → options → rationale → self-critique. | Reduces logical errors and answer leakage; enables iterative correction/self-refinement. | Adds workflow complexity. | Six-stage schema with blueprint alignment, vignette drafting, option balancing, rationale logging, and automated self-critique before expert review. |
| CoT prompting | Explicit reasoning steps during generation or validation. | Improves few-shot item generation performance; strengthens logical consistency and distractor plausibility. | Rationales must be redacted from learner-facing materials. | Example: Kawasaki vignette with IVIG as correct key; reasoning retained internally for validation only. |
| Structured formats (markdown/JSON) | JSON schemas for field consistency; markdown rubrics for readability. | Substantially improves accuracy and structure; enables systematic validation. | Requires upfront schema design. | JSON schema with fields (blueprint node, vignette, options A–E, key, rationale_hidden); markdown used for evaluation rubrics. |
| RAG | Grounds prompts in blueprints, curricula, guidelines. | Enforces proportional coverage; reduces scope drift; ensures cultural and jurisdictional alignment. | Requires curated retrieval index. | Aligns items to KMLE blueprint weights; enforces Korean guidelines and reference ranges. |
| Self-evaluation/self-refinement | Checklist-based validation and iterative correction. | Improves clarity, vignette structure, distractor quality, and option ordering. | Cannot replace human review; still requires expert oversight. | Model critiques and regenerates vignettes or options, re-evaluates, then passes to expert editing. |
| Hybrid human-AI workflow | AI drafts and validates; humans finalize and bank items. | Scalable; reduces faculty burden; preserves psychometric rigor. | Still requires dual expert review, pilot testing, and governance protocols. | Division of labor: AI drafts single-best answer pool, human experts verify clinical accuracy, style, and psychometrics before banking. |
| Author goal | Recommended strategy | Key advantages | Precautions/validation requirements |
|---|---|---|---|
| Brainstorming new item ideas | Zero-shot prompting | Generates diverse clinical scenarios quickly. | Use only for ideation; filter with structural checklists and automated cueing detection. |
| Producing format-fidelity single-best answer items | Few-shot prompting with high-quality exemplars | Improves structural conformity, distractor plausibility, tone, and length. | Exemplars must strictly follow local exam rules (e.g., KMLE vignette order, lab notation). |
| Reducing logical errors and answer leakage | Multi-stage prompting/CoT prompting | Strengthens reasoning consistency; prevents leakage; enables iterative correction. | Rationales/CoT reasoning must remain hidden from learner materials. |
| Ensuring blueprint and curriculum alignment | Retrieval-augmented generation | Guarantees proportional coverage and jurisdiction-specific scope. | Requires curated retrieval index; all content must be verified against authoritative sources. |
| Automating pre-review checks | Structured JSON/Markdown + AI self-validation | Enforces uniform schema; supports efficient checklist-based validation. | Complements, but cannot replace, human review. |
| Enhancing distractor quality and option balance | Iterative self-refinement loops | Improves plausibility, grammatical consistency, and option ordering. | Requires systematic re-validation and expert editing. |
| Scaling defensibly | Hybrid human-AI workflow | Expands item pools efficiently with defensible audit trails. | Requires dual expert review, psychometric testing, and adherence to governance frameworks. |
KMLE, Korean Medical Licensing Examination; CoT, chain-of-thought; IVIG, intravenous immunoglobulin; JSON, JavaScript Object Notation; RAG, retrieval-augmented generation; AI, artificial intelligence.
KMLE, Korean Medical Licensing Examination; CoT, chain-of-thought; JSON, JavaScript Object Notation; AI, artificial intelligence.