quicktok：比 tiktoken 快2–11倍的C++ BPE分词器，字节级一致且开源

英文摘要

quicktok is a new open-source C++ BPE tokenizer that produces token IDs byte-identical to tiktoken, but with significant speedups. On an Apple M1, it encodes 2–3.6× faster than bpe-openai and 4–11× faster than tiktoken across The Pile, code, and web text benchmarks. The implementation uses a 2-byte trie, dense caches, and a hand-compiled pretokenizer instead of regex to cut memory accesses. It ships with prebuilt vocabularies: cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3. The library is installable via `pip install quicktok-v1` and the code is available on GitHub.

中文摘要

quicktok 是一个新的开源C++ BPE分词器，与 tiktoken 字节级一致，但速度显著提升。在Apple M1上，它在The Pile、代码和网络文本基准测试中，编码速度比 bpe-openai 快2-3.6倍，比 tiktoken 快4-11倍。实现上采用2字节前缀树、密集缓存和手写预分词器（替换正则）以减少内存访问。内置cl100k、o200k、GPT-OSS、Llama-3和Qwen2.5/3等词表。可通过 `pip install quicktok-v1` 安装，代码开源在GitHub上。

关键要点

Byte-identical output to tiktoken for cl100k, o200k, and other LLM vocabularies.
对cl100k、o200k等LLM词表输出与tiktoken逐字节一致。
2–3.6× faster than bpe-openai and 4–11× faster than tiktoken on Apple M1 single-thread benchmarks.
在Apple M1单线程基准测试上，比bpe-openai快2-3.6倍，比tiktoken快4-11倍。
Engineered with a 2-byte trie, dense validity caches, and a hand-compiled pretokenizer for low memory access.
通过2字节前缀树、密集有效性缓存和手写预分词器对内存访问进行优化。
Supports cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3 tokenizers out of the box.
开箱即用支持cl100k、o200k、GPT-OSS、Llama-3和Qwen2.5/3等分词器。
Installable via pip and fully open-source on GitHub.
可通过pip安装，完全开源在GitHub上。

打开原文