quicktok:比 tiktoken 快2–11倍的C++ BPE分词器,字节级一致且开源
英文摘要
quicktok is a new open-source C++ BPE tokenizer that produces token IDs byte-identical to tiktoken, but with significant speedups. On an Apple M1, it encodes 2–3.6× faster than bpe-openai and 4–11× faster than tiktoken across The Pile, code, and web text benchmarks. The implementation uses a 2-byte trie, dense caches, and a hand-compiled pretokenizer instead of regex to cut memory accesses. It ships with prebuilt vocabularies: cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3. The library is installable via `pip install quicktok-v1` and the code is available on GitHub.
中文摘要
quicktok 是一个新的开源C++ BPE分词器,与 tiktoken 字节级一致,但速度显著提升。在Apple M1上,它在The Pile、代码和网络文本基准测试中,编码速度比 bpe-openai 快2-3.6倍,比 tiktoken 快4-11倍。实现上采用2字节前缀树、密集缓存和手写预分词器(替换正则)以减少内存访问。内置cl100k、o200k、GPT-OSS、Llama-3和Qwen2.5/3等词表。可通过 `pip install quicktok-v1` 安装,代码开源在GitHub上。
关键要点
Byte-identical output to tiktoken for cl100k, o200k, and other LLM vocabularies.
对cl100k、o200k等LLM词表输出与tiktoken逐字节一致。
2–3.6× faster than bpe-openai and 4–11× faster than tiktoken on Apple M1 single-thread benchmarks.
在Apple M1单线程基准测试上,比bpe-openai快2-3.6倍,比tiktoken快4-11倍。
Engineered with a 2-byte trie, dense validity caches, and a hand-compiled pretokenizer for low memory access.
通过2字节前缀树、密集有效性缓存和手写预分词器对内存访问进行优化。
Supports cl100k, o200k, GPT-OSS, Llama-3, and Qwen2.5/3 tokenizers out of the box.
开箱即用支持cl100k、o200k、GPT-OSS、Llama-3和Qwen2.5/3等分词器。
Installable via pip and fully open-source on GitHub.
可通过pip安装,完全开源在GitHub上。