TokenPilot: Cache-Efficient Context Management for LLM Agents
English summary
The paper introduces TokenPilot, a dual-granularity context management framework for long-horizon LLM agents that preserves prompt cache continuity while reducing token footprints. It contains a global Ingestion-Aware Compaction that stabilizes prompt prefixes and filters environmental noise, and a local Lifecycle-Aware Eviction that monitors segment utility and evicts only when task relevance expires. On PinchBench and Claw-Eval, TokenPilot reduces costs by 61%/56% in isolated mode and 61%/87% in continuous mode versus prior systems, while maintaining competitive performance. The method has been integrated into the open-source LightMem2 library.
Chinese summary
本文提出TokenPilot,一个面向长时LLM智能体的双粒度上下文管理框架,在保持提示缓存连续性的同时降低token占用。它包含一套全局的摄入感知压缩机制以稳定前缀并滤除环境噪声,以及一套局部的生命周期感知驱逐策略,仅在任务相关性失效时清理内容段。在PinchBench和Claw-Eval基准上,TokenPilot在隔离模式下分别将成本降低了61%和56%,在连续模式下降低61%和87%,且性能与先前系统相当。该方法已集成至开源库LightMem2。
Key points
Proposes TokenPilot, a context management framework with global Ingestion-Aware Compaction and local Lifecycle-Aware Eviction to solve the trade-off between text sparsity and prompt cache continuity.
提出TokenPilot框架,通过全局摄入感知压缩和局部生命周期感知驱逐解决文本稀疏性与提示缓存连续性之间的权衡。
Global compaction stabilizes prompt prefixes and removes environmental noise at the ingestion gate; local eviction monitors residual utility and evicts segments only when relevance expires.
全局压缩在摄入关口稳定前缀并消除环境噪声;局部驱逐监控内容段的残留效用,仅在任务相关性到期后移除。
Experiments on PinchBench and Claw-Eval show cost reductions of 61% and 56% in isolated mode, and 61% and 87% in continuous mode, with competitive performance.
在PinchBench和Claw-Eval上的实验表明,隔离模式下成本分别降低61%和56%,连续模式下降低61%和87%,同时保持有竞争力的性能。
The framework has been implemented in the open-source LightMem2 library (github.com/zjunlp/LightMem2).
该框架已实现在开源库LightMem2中(github.com/zjunlp/LightMem2)。