Ion Matei et al. present a framework for aerial wildfire suppression planning that integrates a hybrid neural-cellular automaton fire spread model with gradient-based optimization. The model predicts spatially varying fire behavior from terrain, fuel, and wind inputs, while the intervention module decides binary drop actions with continuous location and orientation parameters. Water and retardant are represented distinctly, reducing active burning immediately or persistently lowering future spread. Aleatoric uncertainty is captured via Monte Carlo sampling of daily fire states, and epistemic uncertainty via spatially correlated prediction-error perturbations. A case study on the 2020 Bear Fire demonstrates the framework's ability to generate coherent suppression schedules and support uncertainty-aware strategy analysis.
PapersSource: ARXIVImportance: 3/5
The paper reframes shield synthesis in reinforcement learning from a runtime safety mechanism into a design-time analytical tool for assessing network defensibility. It instantiates this via a constrained two-player safety game for network defense, which yields a binary defensibility verdict, the winning region, a shield, and topology-level metrics derived from attractor computation. These formal measures are combined with post-convergence behavior from adversarial multi-agent reinforcement learning to form a defensibility fingerprint. A what-if analysis demonstrates that formal defensibility and operational effectiveness capture distinct aspects of security, with small architectural changes causing large shifts in operational outcomes while leaving formal safety margins nearly unchanged. The work concludes that shield synthesis is most valuable as a framework for answering architectural questions about whether, where, and how a system can be defended.
PapersSource: ARXIVImportance: 4/5
The paper introduces FORGE, a benchmark that measures how often search-augmented LLMs recommend fake products when retrieved web pages are polluted. FORGE rewrites real product descriptions into fake ones across 225 products, 15 categories, and 5 consumer scenarios, then tests 12 commercial and open-weights LLMs. A single polluted page causes fooled recommendation rates up to 27%, and replacing the top-3 search results raises the rate to 73.8%. Vulnerability varies by category, with less familiar products more easily exploited, and reasoning models sometimes worsen the problem by fabricating social proof. Three defenses are evaluated—skepticism prompting, consensus filtering over model priors, and cross-document evidence—but skepticism can backfire and filtering may suppress legitimate recommendations.
This paper introduces a data-centric post-training pipeline that applies interpretability protocols to preference datasets, uncovering latent concepts that distinguish preferred from dispreferred model outputs and making them explicit for user feedback. The approach diagnoses undesirable signals such as over-stylization and sycophancy, and mitigates off-target learning by intervening on the learning signal at the concept level. It unifies several interpretability-based training protocols as ways of shaping rewards through feature or data interventions. Empirically, the method amplifies desired properties like safeguards and model personality, turning opaque scalar reward optimization into an auditable process of sculpting the training signal.
The paper introduces ALIGNBEAM, a training-free method that transfers safety alignment from an anchor model to a target specialist during inference, even when they have different vocabularies. It works by translating anchor logits token-by-token into the target vocabulary at each decoding step, then using a small LLM judge to select the safest among K candidate continuations. No model weights are altered, and the safety-utility trade-off can be tuned at deployment. Across both cross-vocabulary and same-vocabulary settings, ALIGNBEAM significantly increases refusal on adversarial safety benchmarks while maintaining task accuracy and practical inference overhead. The results demonstrate that safety alignment can be effectively transferred between model families at inference time without modifying either model.
The paper proposes a reference architecture for runtime governance of production AI agents, addressing the breakdown of traditional data-boundary controls in agentic workflows. The architecture decomposes governance into five planes: a reasoning plane that adjudicates intent and four enforcement planes (network, identity, endpoint, data) that realize decisions. It introduces composite principals with capability attenuation to model authority delegation, stop-anywhere mediation, and a tamper-evident audit substrate. A taxonomy of six interruption primitives generalizes allow/deny, and four correctness invariants are proven while demonstrating foreclosure of seven production-agent threats across five concrete workflows. A reference implementation validates the design: adjudication runs in single-digit microseconds, attenuation correctness and evidence reconstructability hold on every trial, and the audit substrate exhibits exact tamper-evidence. The scope is restricted to governing delegated action, not model behavior, and a live-agent benchmark evaluation is proposed as next step.