[8] A Comparative Analysis of Multiple- Choice Questions: ChatGPT-Generated Items vs. Human-Developed Items

Back to Exploring AI in Applied Linguistics

A Comparative Analysis of Multiple- Choice Questions: ChatGPT-Generated Items vs. Human-Developed Items

This chapter is part of: Chapelle, C. A., Beckett, G. H., & Ranalli, J. (Eds.). (2024). Exploring artificial intelligence in applied linguistics. Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.

Download Chapter

Description
This study explores the potential of incorporating ChatGPT (GPT-3.5) to improve test development efficiency. It evaluates the quality of ChatGPT-generated multiple-choice questions (MCQ) compared to those written by human test developers. Additionally, the study seeks to identify the general characteristics of ChatGPT-generated items. A total of 80 items, 40 from ChatGPT and 40 from human writers, were developed from 20 authentic Korean passages. The quality of the items was evaluated by three raters on a five-point Likert scale against a rubric of eight criteria. Both quantitative and qualitative methods have been employed, incorporating an analysis of rating scores and the raters’ written comments. The results indicate an overall comparability between ChatGPT- generated items and those created by human writers. However, ChatGPT’s ability was significantly limited when creating plausible distractors. These findings underscore the importance of human judgment, particularly in the creation of effective distractors, to fully leverage ChatGPT in the test development process.

Description

This study explores the potential of incorporating ChatGPT (GPT-3.5) to improve test development efficiency. It evaluates the quality of ChatGPT-generated multiple-choice questions (MCQ) compared to those written by human test developers. Additionally, the study seeks to identify the general characteristics of ChatGPT-generated items. A total of 80 items, 40 from ChatGPT and 40 from human writers, were developed from 20 authentic Korean passages. The quality of the items was evaluated by three raters on a five-point Likert scale against a rubric of eight criteria. Both quantitative and qualitative methods have been employed, incorporating an analysis of rating scores and the raters’ written comments. The results indicate an overall comparability between ChatGPT- generated items and those created by human writers. However, ChatGPT’s ability was significantly limited when creating plausible distractors. These findings underscore the importance of human judgment, particularly in the creation of effective distractors, to fully leverage ChatGPT in the test development process.

Details

Published	Published By	Pages	DOI
July 31, 2024	Iowa State University Digital Press	19	isudp.2024.154.08
License Information
©2024 The authors. Published under a CC BY license.
Citation
Chun, J. Y., & Barley, N. (2024). A Comparative analysis of multiple-choice questions: ChatGPT-generated items vs. human-developed items. In C. A. Chapelle, G. H. Beckett, & J. Ranalli (Eds.), Exploring artificial intelligence in applied linguistics (pp. 118–136). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.08.