EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
English summary
This paper introduces EvoArena, a benchmark that evaluates LLM agents under progressive environmental changes across terminal, software, and social domains. Current agents achieve only 39.6% average accuracy on EvoArena. The authors propose EvoMem, a patch-based memory paradigm that records structured update histories to reason about environmental evolution. EvoMem boosts EvoArena accuracy by 1.5 points, and also improves GAIA and LoCoMo benchmarks by 6.1 and 4.8 percentage points, respectively. On chain-level tasks requiring sequences of related subtasks, EvoMem raises accuracy by 3.7 points. Mechanistic analysis shows EvoMem better preserves complete evolving environment states in memory evidence.
Chinese summary
本文推出了EvoArena评测基准,模拟终端、软件和社交领域中渐进式环境变化来测试LLM代理。现有代理在EvoArena上平均准确率仅为39.6%。作者提出EvoMem,一种基于补丁的记忆范式,以结构化更新历史记录环境演化,使得代理能通过记忆变化推理环境动态。EvoMem在EvoArena上带来1.5个百分点的绝对提升,同时在GAIA和LoCoMo基准上分别提高了6.1和4.8个百分点。在需要连续完成多个相关子任务的链式任务中,准确率提升3.7个百分点。机理分析表明EvoMem能更完整地捕获不断变化的环境状态。
Key points
EvoArena benchmark models dynamic environment evolution as progressive updates in terminal, software, and social domains.
EvoArena基准将动态环境演化建模为终端、软件和社交领域的连续更新序列。
Current LLM agents average only 39.6% accuracy on EvoArena, revealing significant gaps in robustness.
现有LLM代理在EvoArena上平均准确率仅39.6%,暴露出鲁棒性严重不足。
EvoMem, a patch-based memory with structured update histories, improves EvoArena accuracy by 1.5 points, GAIA by 6.1, and LoCoMo by 4.8 points.
EvoMem基于补丁的记忆架构记录结构化更新历史,在EvoArena、GAIA和LoCoMo上分别提升1.5、6.1和4.8个百分点。
Chain-level sequence-of-subtasks accuracy increases by 3.7 points thanks to better evidence capture of evolving states.
通过更好地捕获演化状态证据,链式连续子任务序列准确率提升3.7个百分点。