Tutorial: Building a Stable Colab Workflow for the Fable 5 Traces Dataset — Parsing Tool Calls, Auditing Data, and Training Naive Bayes Baselines
English summary
This tutorial presents a reproducible Colab workflow for the Fable 5 Traces coding-agent dataset (Glint-Research/Fable-5-traces) from Hugging Face. It manually downloads the merged JSONL trace file, then builds pure-Python utilities to extract tool names, arguments, and text payloads from agent outputs. An audit of the dataset includes detection and redaction of potential secret-like patterns (API keys, tokens) and distribution plots for output types, tools, source roots, and text lengths. Safe no‑CoT chat/SFT exports are created with train/validation/test splits by converting each trace into system-user-assistant messages. A pure NumPy TF-IDF + SVD projection visualises context embeddings, and two pure-Python Naive Bayes classifiers are trained to predict assistant output type and tool name from context, with metrics and top tokens saved. The workflow outputs analysis indices, classifier reports, and a keyword search demo, all without requiring fragile scientific libraries.
Chinese summary
本教程为 Hugging Face 上的 Fable 5 Traces 编码智能体数据集 (Glint-Research/Fable-5-traces) 展示了一套可复现的 Colab 工作流。它手动下载合并的 JSONL 轨迹文件,然后构建纯 Python 工具从智能体输出中提取工具名称、参数和文本载荷。审计环节包括检测和脱敏潜在的 API 密钥、token 等秘密模式,并生成输出类型、工具、来源节点和文本长度的分布图。接着将每条轨迹转换为系统-用户-助手的消息格式,生成安全的无 CoT 对话/SFT 导出(训练/验证/测试划分)。通过纯 NumPy 的 TF-IDF + SVD 投影对上下文嵌入进行可视化,并训练两个纯 Python 朴素贝叶斯分类器,根据上下文预测助手的输出类型和工具名称,同时保存评估指标和关键词。工作流最终输出分析索引、分类器报告和关键词搜索演示,全程不依赖脆弱的科学计算库。
Key points
Works with the Fable 5 Traces dataset (Glint-Research/Fable-5-traces) by manually downloading and parsing the merged JSONL in Colab without using the datasets library.
在 Colab 中无需 datasets 库,通过手动下载并解析合并的 JSONL 文件,处理 Fable 5 Traces 数据集 (Glint-Research/Fable-5-traces)。
Implements utilities to extract tool names, arguments, and text payloads from diverse output structures, and audits the dataset for missing fields, duplicates, and potential secret patterns (which are redacted).
实现工具名称、参数和文本载荷的提取实用程序,审计数据集的缺失字段、重复项和潜在秘密模式(并进行脱敏)。
Generates safe no‑CoT chat/SFT exports (train/validation/test) by formatting each trace into system prompt, user context, and assistant target, with optional reasoning trace export.
通过将每条轨迹格式化为系统提示、用户上下文和助手目标,生成安全的无 CoT 对话/SFT 导出(训练/验证/测试),并可选导出推理轨迹。
Trains two pure-Python Naive Bayes classifiers from scratch: one to predict output type (text vs. tool call) and another to predict the specific tool name, using only the context field.
从零开始训练两个纯 Python 朴素贝叶斯分类器:一个预测输出类型(文本或工具调用),另一个仅依据上下文字段预测具体工具名称。
Produces a pure NumPy TF-IDF + SVD projection of context embeddings, keyword search across traces, and a comprehensive analysis report with all artifacts saved for reproducibility.
使用纯 NumPy 实现 TF-IDF + SVD 的上下文嵌入投影,提供跨轨迹的关键词搜索,并生成包含所有产物的综合分析报告以确保可复现性。