The paper proposes a unified Dirichlet framework for spatial-temporal risk assessment, proving that a single Dirichlet posterior per cell with an additive evidence-update rule is the unique update–predictor pair satisfying four axioms and is limit-equivalent to seven classical methods (AHP, Dempster–Shafer, Hawkes, kernel density estimation, etc.). The framework simultaneously yields a severity score and threat characterization from the posterior. On a large-scale benchmark of 41 regions × 10,000 cells × 365 days, it achieves an one-vs-rest AUROC of 0.666 and severity AUROC of 0.725, statistically significantly outperforming 15 structured baselines (Holm-corrected p < 10⁻²⁶), while delivering threat characterization accuracy of 79.1%—compared to only 0–26% for competitors with comparable AUROC. Real-world transfer to 1.69M London and 119K Chicago crime events preserves the dual-output advantage, and a pre-registered specialization experiment confirms the operational configuration beats the matched specialist. The method requires 3.6× less memory than seven independent models (128 vs. 464 bytes/cell) at 41K signals/sec throughput.
IncidentMind is a token-budget multi-agent system for autonomous root cause analysis of production AI failures. It pre-syncs Slack, Confluence, and Jira into a HydraDB temporal knowledge graph via MCP, converting all agent queries into a single graph traversal. A tri-tier inference strategy uses minilm-l6 for sync-time tasks, quantized Llama-3-14B for agent reasoning, and GPT-4o-mini only when confidence falls below 85%, reducing per-incident cost from $1.50 to $0.003. Structured token budgeting compresses 50,000 raw log tokens to 1,050 tokens (98% reduction). Across 847 production incidents, IncidentMind achieved 91% fix accuracy and reduced mean time to detect from 4.2 hours to 3 minutes.
The paper introduces Self-Aligned Reward (SAR), a fine-grained RL signal that complements verifiable rewards to improve both accuracy and efficiency of LLM reasoning. SAR is defined as the relative perplexity difference between a query-conditioned answer and the standalone answer, thereby favoring concise, query-specific responses and penalizing redundancy. Quantitative analysis confirms that SAR reliably ranks answer quality, assigning higher scores to concise correct answers than to verbose ones. Integrating SAR with PPO or GRPO reduces average answer length by 30% while boosting accuracy by 4% across four model families and seven benchmarks, with strong out-of-domain generalization. The approach achieves a Pareto-optimal frontier between correctness and efficiency, shortening unnecessary elaboration without hurting advanced reasoning behaviors. Code and data are publicly released.