Can AI Chatbots Write My Exam Questions?

Writing good multiple-choice questions takes longer than most people realise. A batch of exam-ready MCQs for a biomedical programme can consume 96 person-hours once you account for content sourcing, stem drafting, distractor development, peer review, psychometric checks, and revision cycles. That’s weeks of faculty time for a single assessment.

So when ChatGPT can produce a plausible-looking question in seconds, the temptation is obvious. Paste in a learning objective, get back a formatted MCQ. The efficiency gains are real — peer-reviewed studies confirm that AI can halve or more the initial drafting burden in health sciences education.

But faster is not the same as better.

What the Evidence Actually Says

The research is consistent on one point: AI-generated MCQs are not dramatically worse than human-written ones on basic psychometric measures. A 2025 PLOS meta-analysis found that ChatGPT-4 outputs were “not better than people — but also not worse” on difficulty and discrimination indices. The QUEST-AI validation study reported high clinician-rated validity when LLM-generated questions were benchmarked against USMLE standards.

That sounds reassuring. It shouldn’t be.

Matching human performance on psychometrics is a low bar when we know that human-written questions are themselves riddled with structural flaws. Studies consistently report that the majority of MCQs in medical education contain at least one item-writing flaw — cueing, implausible distractors, misaligned cognitive demand, ambiguous stems. Saying AI matches this standard is saying AI is equally flawed.

Where Chatbots Consistently Fail

Three problems recur across every study:

Cognitive collapse. Chatbots default to testing recall, even when prompted for higher-order reasoning. They produce questions that look clinical but resolve to simple fact recognition. The scenario adds complexity without adding cognitive demand. The question appears to test reasoning but can be answered by knowing a single fact.

Hallucination and factual drift. Language models generate plausible-sounding content that is occasionally wrong. In clinical assessment, a hallucinated drug interaction or an incorrect diagnostic criterion is not a minor error. It’s a question that teaches the wrong thing and rewards the wrong answer.

Structural cues and distractor failure. AI-generated distractors frequently betray the correct answer through length disparity, grammatical inconsistency, or semantic implausibility. Test-wise candidates eliminate options without engaging with the clinical content. The question discriminates on exam technique, not knowledge.

These are not edge cases. They are the default behaviour of current language models when applied to assessment without expert oversight.

The Prompt Engineering Trap

Some educators have responded by investing in sophisticated prompting strategies — feeding the model structured templates, specifying Bloom’s taxonomy levels, providing rubrics for distractor quality. This helps, but it introduces a new problem: the quality of AI output becomes dependent on the prompt engineering skill of the individual author.

An academic who understands assessment design well enough to write an effective AI prompt probably understands assessment design well enough to write the question themselves. The chatbot saves time on typing, not on thinking. And thinking is where question quality lives.

The Real Question

The issue is not whether AI can draft exam questions. It can. The issue is whether the output is safe to use without the same expert review process you would apply to any human-written draft.

The answer, consistently, is no.

Every study that reports acceptable AI performance includes expert review as a mandatory step. Remove the expert, and you are publishing unvalidated assessment content generated by a model that does not understand what it is testing.

What This Means for Your Institution

AI-generated questions are not ready to use out of the box. They require the same scrutiny as any other draft — and arguably more, because their surface plausibility masks deeper structural and clinical problems.

The 96-hour time saving is real at the drafting stage. But if expert review, revision, and validation take just as long as they would for a human-written question, the net efficiency gain is smaller than it appears.

The question for assessment leaders is not “can AI write my questions?” It can. The question is “should it?” — and if so, under what conditions?

That’s the subject of the next post in this series.

This is part one of a series on AI in assessment. Next: Should You Trust AI With Your Exam Content? — the security risks of using public chatbots for question development.

Sources

AI-generated multiple-choice questions in health science education. PMC, 2025. pmc.ncbi.nlm.nih.gov/articles/PMC12340502
AI versus human-generated multiple-choice questions for medical education. PubMed, 2025. pubmed.ncbi.nlm.nih.gov/39923067
ChatGPT prompts for generating multiple-choice questions in health sciences. PubMed, 2024. pubmed.ncbi.nlm.nih.gov/38840505
AI versus human-generated MCQs: a systematic comparison. PMC, 2025. pmc.ncbi.nlm.nih.gov/articles/PMC11806894
AI in MCQ generation: a systematic review and network meta-analysis. PLOS One, 2025. journals.plos.org/plosone/article?id=10.1371/journal.pone.0340277
QUEST-AI: a system for question generation, verification, and validation. PubMed, 2024. pubmed.ncbi.nlm.nih.gov/39670361

CrtQ — Sharper questions. Smarter exams. crtq.ai