Causal Consequence-Penalized Learning: Correcting the TD Target for Stochastic Delay and Action Attribution
中文标题: 因果后果惩罚学习:修正随机延迟与动作归因的TD目标
英文摘要
The paper identifies three fundamental flaws in constrained reinforcement learning: unknown stochastic consequence delay yields provably incorrect TD targets, the agent's causal effect is conflated with consequences already in the pipeline causing systematic over/under-penalization, and embedding the multiplier into a single Q-function renders Bellman targets non-stationary under multiplier updates. CCPL introduces a delay-corrected Bellman operator that learns the full delay distribution and computes an adaptive effective discount, with a novel contraction proof. It proves state-conditioned λ(s) strictly dominates any scalar λ, closing a prior theoretical gap, and replaces the cost estimate with the marginal causal contribution learned via an Interventional Consequence Net pretrained on environment structural causal model labels. CCPL maintains separate reward and constraint Q-functions, keeping targets stationary and combining them only at inference. Empirically, CCPL is the only agent among eight baselines to achieve both high reward (+4.84) and full constraint satisfaction (100%) across six environments including adversarial settings, and its core theorems are machine-verified at every training run.
中文摘要
该论文指出受限强化学习中的三个根本缺陷:未知的随机后果延迟导致TD目标错误,对因果效应与已有后果的混淆造成系统性过罚或欠罚,以及将乘子内嵌入单一Q函数导致贝尔曼目标在乘子更新时非平稳。CCPL提出延迟修正的贝尔曼算子,学习完整延迟分布并计算自适应有效折扣因子,首次给出了该情形下的收缩性证明。证明状态条件的λ(s)严格优于任何标量λ,弥补了现有理论的空白,并用干预后果网络(基于环境结构因果模型真实标签预训练)估计边际因果贡献来代替成本估计。CCPL分离奖励与约束Q函数,保持目标平稳且仅在推理时组合惩罚值。在6个环境(含对抗场景)和8个基线中,CCPL是唯一同时获得高奖励(+4.84)和完全约束满足(100%)的智能体,且核心定理在每次训练中均经机器验证。
关键要点
Identifies three flaws in constrained RL: unknown stochastic consequence delay making TD targets incorrect, conflation of causal effects with pipeline consequences, and non-stationary Bellman targets due to multiplier changes.
指出受限强化学习中的三大缺陷:未知的随机后果延迟导致TD目标错误,将因果效应与先前动作的后果混淆,以及乘子更新导致贝尔曼目标非平稳。
Proposes a delay-corrected Bellman operator that learns full delay distribution and computes an adaptive effective discount; proves it is a contraction with constant γ_eff ≤ γ < 1.
提出延迟修正的贝尔曼算子,学习完整延迟分布并计算自适应有效折扣因子;证明了其收缩性质(收缩因子 γ_eff ≤ γ < 1)。
Proves that state-conditioned λ(s) strictly dominates any scalar λ, closing a gap in Lagrangian constrained RL.
证明状态条件的 λ(s) 严格优于任何标量 λ,弥补了拉格朗日受限RL中的理论缺口。
Replaces cost estimate with marginal causal contribution ΔC(s,a,h) estimated via an Interventional Consequence Net pretrained on ground-truth structural causal model labels.
用边际因果贡献 ΔC(s,a,h) 替代成本估计,通过基于环境结构因果模型真实标签预训练的干预后果网络来估计。
Keeps reward and constraint Q-functions separate, maintaining stationary Bellman targets; penalized value computed only at inference, never baked into TD targets.
分离奖励和约束Q函数,保持贝尔曼目标平稳;惩罚值仅在推理时计算,不嵌入TD目标。
Empirically, CCPL is the only agent among eight baselines achieving both high reward (+4.84) and full constraint satisfaction (100%) across six environments including adversarial settings; theorems are machine-verified each run.
在6个环境(含对抗场景)中,CCPL是唯一在8个基线中同时实现高奖励(+4.84)和完全约束满足(100%)的智能体;定理每次运行均经机器验证。