PapersSource: ARXIVJuly 3, 2026Importance: 5/5

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

English summary

This study evaluates four frontier LLMs (GPT, Claude Opus, Gemini, and GLM) on grading 1200 real student responses to Linux/bash command questions across four cognitive levels, from information retrieval to advanced system management. Gemini 3.0 Pro with rubric-enhanced prompting achieved the best human-AI agreement (ICC=0.888, MAE=0.10, bias=-0.014). Agreement consistently decreased as question complexity increased, with largest discrepancies at higher taxonomy levels. Rubric quality had a larger impact than model choice, and structured prompts consistently improved results. The work provides a taxonomy-based framework for deciding which questions are suitable for AI-assisted grading and which need human review, along with reusable evaluation protocols and prompt templates.

Chinese summary

该研究评估了GPT、Claude Opus、Gemini和GLM四个前沿大语言模型，对1200份真实学生的Linux/bash命令考试答案进行评分，考题涵盖从信息检索到高级系统管理的四个认知层级。使用带评分标准提示的Gemini 3.0 Pro取得了最高的人机一致性（ICC=0.888，MAE=0.10，偏差=-0.014）。随着题目认知层级升高，一致性持续下降，高层级题目差异最大。评分标准质量的影响大于模型选择，结构化提示始终能提高一致性。该工作提供了一个基于认知分类法的框架，用于判定哪些题目适合AI辅助评分、哪些需人工复核，同时给出了可复用的评估协议与提示模板。

Key points

Four frontier LLMs (GPT, Claude Opus, Gemini, GLM) were evaluated on grading 1200 real Linux/bash exam responses across four cognitive levels.
评估了GPT、Claude Opus、Gemini和GLM四个大语言模型在四个认知层级上对1200份真实Linux/bash考试答案的评分能力。
Gemini 3.0 Pro with rubric-enhanced prompting achieved the best agreement with human experts (ICC=0.888, MAE=0.10, bias=-0.014).
使用带评分标准提示的Gemini 3.0 Pro与人类专家的一致性最高（ICC=0.888，MAE=0.10，偏差=-0.014）。
Agreement between LLMs and human graders consistently declined as the cognitive complexity of the question increased, with the largest errors at advanced system-management tasks.
随着题目认知复杂度提升，大语言模型与人类评分的一致性持续下降，高级系统管理类题目误差最大。
The quality of the rubric and prompting had a larger effect on grading accuracy than the choice of LLM provider; structured prompts consistently improved agreement.
评分标准和提示质量对评分准确性的影响大于模型选择，结构化提示能够持续提高一致性。
The study establishes a transferable, taxonomy-based framework and evaluation protocol to decide which exam questions are suitable for AI-assisted grading.
研究建立了可迁移的、基于认知分类法的框架和评估协议，用于决定哪些考题适合AI辅助评分。

Open original