The paper introduces NEXIS, a method for identifying heterogeneous treatment effects (HTEs) in controlled experiments by re-framing the problem as Markov-blanket discovery on sufficient, aligned multi-modal pre-treatment representations. NEXIS iteratively selects latent interactors with provably consistent selection, avoiding spurious causal characterizations that arise from unmeasured effect modifiers. The approach is deployed on two anti-poverty programs in Africa, augmenting each with satellite imagery to capture previously unmeasured environmental modifiers. The results produce novel, interpretable prescriptive guidelines for optimizing the programs' next iterations.
Researchers introduce TuneJury, an open instance-level pairwise reward model for text-to-music that predicts preference scores from a text prompt and an audio clip. The model is trained on publicly available human-preference labels including arena votes, metric-alignment pairs, crowdsourced comparisons, and expert aesthetic ratings. Its score margin is well-calibrated on a held-out test split, enabling data filtering via a simple threshold, and it generalizes to out-of-distribution benchmarks. For generators released after training, the paper proposes anchor calibration, a post-hoc Bradley-Terry calibration that recovers agreement efficiently without retraining. The frozen reward drives consistent gains in three downstream tasks: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available open-source on GitHub.
ARXIV··Highlight
This paper proposes a Bayesian inference framework to audit frontier AI evaluations using public leaderboard archives such as LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench. It demonstrates that terminal-only performance claims are ambiguous: a single snapshot can be compatible with vastly different pre-terminal histories, varying the time to approach a ceiling by a factor of over three. Synthetic experiments show that a candidate selection-aware frontier model fails in synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, leading audit gates to reject its stronger claims. The study introduces an archive-and-adjudication protocol that reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier assertions, providing a rigorous method for interpreting leaderboard data.
ARXIV·
This paper proves that computing an approximate stationary point of a min-max optimization problem over the hypercube is PPAD-hard when the objective is a quadratic polynomial. The hardness result holds even under strong restrictions: the polynomials are multilinear, each variable appears in at most three monomials, and the desired approximation factor is only inverse polynomial. As a direct corollary, the authors obtain the first PPAD-hardness results for two-team zero-sum polymatrix games. This establishes a fundamental computational barrier for a simple class of min-max problems.
ARXIV·
Processing is temporarily unavailable. The original item should be reviewed from its source link. This fallback keeps the item compatible with the processing contract.
ARXIV··Highlight
ActiveSAM is a training-free, zero-shot inference framework that adapts SAM 3 for open-vocabulary semantic segmentation by pruning the full dataset vocabulary to an image-conditional active subset via a low-resolution presence preview. Only the retained classes are decoded at full resolution using the frozen SAM 3 decoder with bucketed prompt multiplexing and margin-aware background calibration. On eight OVSS benchmarks, ActiveSAM outperforms the prior state-of-the-art SegEarth-OV3 by +1.4 mIoU on average while running up to 5.5× faster on large-vocabulary datasets. The method requires no target-dataset training, no weight updates, and no oracle class-presence labels. It also exhibits strong robustness under image corruption, making it suitable for noisy-input domains like autonomous driving. Code is available at https://github.com/VILA-Lab/ActiveSAM.