ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset
English summary
This tutorial provides a complete workflow for analyzing the ClawHub Security Signals dataset, covering data loading, exploratory analysis, and machine learning. It examines how different security scanners (VirusTotal, static analysis, SkillSpector) assess AI skills and their agreement patterns. A logistic regression pipeline is built combining SKILL.md text features with numerical scanner signals to predict the ClawScan verdict. The model is evaluated on a test set with a confusion matrix and misclassification analysis. The approach demonstrates a practical end-to-end security signal analysis in a Colab-friendly environment.
Chinese summary
本教程提供了分析ClawHub安全信号数据集的完整工作流程,涵盖数据加载、探索性分析和机器学习。它检查了不同安全扫描器(VirusTotal、静态分析、SkillSpector)如何评估AI技能及其一致模式。构建了一个逻辑回归管道,结合SKILL.md文本特征和数值扫描信号来预测ClawScan判决。在测试集上使用混淆矩阵和误分类分析评估模型。该方法展示了在Colab友好环境中的实用端到端安全信号分析。
Key points
Load ClawHub Security Signals dataset from Hugging Face Parquet conversion, handling shard concatenation.
从Hugging Face Parquet转换中加载ClawHub安全信号数据集,处理分片连接。
Explore verdict distribution, scanner positive rates, and overlap patterns using Jaccard and Cohen's kappa.
使用Jaccard和Cohen's kappa探索判决分布、扫描器阳性率和重叠模式。
Visualize data with count plots, bar charts, and box plots to understand class imbalance and scanner behavior.
使用计数图、条形图和箱线图可视化数据,以了解类别不平衡和扫描器行为。
Build a logistic regression pipeline combining TF-IDF from SKILL.md text with numerical scanner features.
构建一个逻辑回归管道,结合SKILL.md文本的TF-IDF特征和数值扫描特征。
Evaluate the classifier on the test set and examine sample misclassifications.
在测试集上评估分类器并检查样本误分类。