Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
English summary
This study evaluates four frontier LLMs (GPT, Claude Opus, Gemini, and GLM) on grading 1200 real student responses to Linux/bash command questions across four cognitive levels, from information retrieval to advanced system management. Gemini 3.0 Pro with rubric-enhanced prompting achieved the best human-AI agreement (ICC=0.888, MAE=0.10, bias=-0.014). Agreement consistently decreased as question complexity increased, with largest discrepancies at higher taxonomy levels. Rubric quality had a larger impact than model choice, and structured prompts consistently improved results. The work provides a taxonomy-based framework for deciding which questions are suitable for AI-assisted grading and which need human review, along with reusable evaluation protocols and prompt templates.
Chinese summary
该研究评估了GPT、Claude Opus、Gemini和GLM四个前沿大语言模型,对1200份真实学生的Linux/bash命令考试答案进行评分,考题涵盖从信息检索到高级系统管理的四个认知层级。使用带评分标准提示的Gemini 3.0 Pro取得了最高的人机一致性(ICC=0.888,MAE=0.10,偏差=-0.014)。随着题目认知层级升高,一致性持续下降,高层级题目差异最大。评分标准质量的影响大于模型选择,结构化提示始终能提高一致性。该工作提供了一个基于认知分类法的框架,用于判定哪些题目适合AI辅助评分、哪些需人工复核,同时给出了可复用的评估协议与提示模板。
Key points
Four frontier LLMs (GPT, Claude Opus, Gemini, GLM) were evaluated on grading 1200 real Linux/bash exam responses across four cognitive levels.
评估了GPT、Claude Opus、Gemini和GLM四个大语言模型在四个认知层级上对1200份真实Linux/bash考试答案的评分能力。
Gemini 3.0 Pro with rubric-enhanced prompting achieved the best agreement with human experts (ICC=0.888, MAE=0.10, bias=-0.014).
使用带评分标准提示的Gemini 3.0 Pro与人类专家的一致性最高(ICC=0.888,MAE=0.10,偏差=-0.014)。
Agreement between LLMs and human graders consistently declined as the cognitive complexity of the question increased, with the largest errors at advanced system-management tasks.
随着题目认知复杂度提升,大语言模型与人类评分的一致性持续下降,高级系统管理类题目误差最大。
The quality of the rubric and prompting had a larger effect on grading accuracy than the choice of LLM provider; structured prompts consistently improved agreement.
评分标准和提示质量对评分准确性的影响大于模型选择,结构化提示能够持续提高一致性。
The study establishes a transferable, taxonomy-based framework and evaluation protocol to decide which exam questions are suitable for AI-assisted grading.
研究建立了可迁移的、基于认知分类法的框架和评估协议,用于决定哪些考题适合AI辅助评分。