AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
English summary
The paper introduces AgenticSTS, a bounded-memory contract for long-horizon LLM agents where every decision is made from a fresh user message constructed via typed retrieval, appending no raw cross-decision transcript and bounding the prompt independently of run length. This contract is instantiated in the closed-rule deck-building game Slay the Spire 2, which requires hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game achieved zero wins at the lowest difficulty, while the developer-reported human win rate is 16%, indicating the task is hard but not saturated. In an ablation within the authors' harness, a baseline with no triggered strategic skills won 3 out of 10 games, and enabling the skill layer raised the wins to 6 out of 10 (directional, Fisher exact p≈0.37). The authors release a reproducible testbed comprising 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts.
Chinese summary
该论文提出了 AgenticSTS,一个面向长周期 LLM 智能体的有限内存契约,每个决策均通过类型化检索构建的全新用户消息做出,不附加任何跨决策的原始对话记录,从而使提示长度在任意长度的运行中保持有界。该契约在封闭规则牌组构建游戏《杀戮尖塔 2》中实例化,该游戏需要数百次战术和战略决策。同一游戏上的公开在线基准测试中,前沿 LLM 在最低难度下零胜,而开发者报告的人类胜率在同一难度下为 16%,表明任务困难但尚未饱和。在作者的测试框架中,一项消融实验显示,未启用触发式战略技能的基线模型 10 局中赢得 3 局,增加技能层后胜局提升至 6 局(方向性结果,Fisher 精确检验 p≈0.37)。作者发布了一个可复现的测试平台,包含 298 条带条件标签的完整轨迹、冻结的内存/技能快照、提示记录和分析脚本。
Key points
Proposes a bounded-memory contract where each decision prompt is assembled from typed retrieval with no raw transcript appended, keeping the prompt bounded regardless of run length.
提出一种有限内存契约,每个决策提示均由类型化检索组装,不附加原始对话记录,使提示在任意运行长度下保持有界。
Instantiates the contract in Slay the Spire 2, a closed-rule deck-building game requiring hundreds of long-horizon tactical and strategic decisions.
在封闭规则牌组构建游戏《杀戮尖塔 2》中实例化该契约,该游戏需要数百次长周期战术与战略决策。
Public benchmark sees zero wins by frontier LLMs at the lowest difficulty, while the human win rate is 16%, indicating the task is hard but not saturated.
公开基准测试中,前沿 LLM 在最低难度下零胜,人类胜率为 16%,表明任务困难但未饱和。
Ablation shows adding a triggered strategic skills layer to a no-store baseline increases wins from 3/10 to 6/10 (directional, Fisher exact p≈0.37).
消融实验显示,向无存储基线添加触发式战略技能层可将 10 局中的胜场从 3 提升至 6(方向性结果,Fisher 精确检验 p≈0.37)。
Releases a reproducible testbed with 298 completed trajectories, memory/skill snapshots, prompt records, and analysis scripts.
发布可复现的测试平台,包含 298 条完整轨迹、内存/技能快照、提示记录和分析脚本。