The paper presents SpatialClaw, a training-free framework that uses code execution as the action interface for agentic spatial reasoning. It maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, allowing a VLM-backed agent to write one executable cell per step based on all prior outputs. Evaluated on 20 static and dynamic 3D/4D spatial reasoning benchmarks, SpatialClaw achieves an average accuracy of 59.9%, outperforming the prior spatial agent by 11.2 percentage points. The gains are consistent across six vision-language model backbones from two model families, with no benchmark‑ or model‑specific tuning. The results demonstrate that a flexible, iterative code‑based interface significantly outperforms single‑pass or structured tool‑call designs for open‑ended spatial tasks.
PapersSource: ARXIVImportance: 4/5
The paper presents Agents-K1, an end-to-end pipeline that transforms raw documents into agent-native scientific knowledge graphs. It combines a multimodal parser using a five-module schema to capture entities, evidence, citations, and typed cross-entity relations from full papers, a 4B information-extraction backbone trained with GRPO under a rule-based reward, and a GraphAnything CLI that unifies web search, multimodal graph retrieval, and cross-document traversal. The authors process 2.46 million scientific papers across six subjects to construct Scholar-KG and release a one-million-paper subset. Experiments show superior performance on scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning. The pipeline is extensible to general-domain corpora and schema-conformant data synthesis.
PapersSource: ARXIVImportance: 3/5
The paper analyzes on-policy distillation (OPD), a post-training method combining on-policy student trajectories and dense teacher supervision. The study finds that OPD-style updates are small and coordinate-sparse, distributed across layers and FFN-heavy. Training only the discovered sparse subnetwork recovers nearly full OPD performance, but the sparsity-inducing SGD optimizer underperforms AdamW because dense supervision preserves heterogeneous gradient scales that benefit from adaptive scaling. Geometrically, the updates are numerically full-rank but spectrally concentrated, lying away from the principal singular subspaces of the source weights and disproportionately on coordinates where source weights are near zero. The results show that OPD retains geometric signatures of on-policy post-training rather than behaving as dense parameter rewriting.
The paper evaluates four families of speech representations for speech-driven 3D facial animation, comparing facial reconstruction quality across two facial decoders using objective metrics and perceptual evaluation. It also includes probing analyses linking tokenized representations to phonetic units and articulatory deformations. The study finds that encoding phonetic classes improves facial animation accuracy, and that semantic and label-based representations achieve comparable performance. Building on the label-based representations, the authors propose an Audio Visual Text-to-Speech (AVTTS) pipeline that uses discrete representations as a shared space to decode both speech and 3D facial motion.
DIRECT is a routing framework that dynamically allocates test-time compute per prompt in embodied Vision-Language Model (VLM) planners by analyzing multimodal scene context. It examines three scaling axes—chain-of-thought depth, model size, and memory history—and reveals that naively scaling test-time compute yields uneven and often diminishing returns. Experiments on VLABench and RoboMME demonstrate that DIRECT significantly improves the success–cost Pareto frontier over fixed model selection. Validation on a physical Franka arm in a DROID setup shows that the router matches or exceeds a stronger model's success rate while cutting average latency by up to 65%. The results confirm that intelligent compute allocation enables frontier-level embodied planning at a fraction of the cost.
The paper proposes Latent World Recovery (LWR), a framework for multimodal learning under missing modalities. LWR aligns modality-specific embeddings in a shared latent space and fuses only the modalities available at inference time to construct a unified representation, avoiding explicit imputation of missing data. Each modality is treated as a partial observation of an underlying latent state, and availability-aware representation learning is performed directly from observed modalities. Evaluated on real-world incomplete multi-omics benchmarks, LWR shows effective performance on cancer phenotype classification and survival prediction, addressing bioscience scenarios where not all modalities are always present.