The paper presents EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. It argues the key bottleneck has shifted from designing agent workflows to engineering agent environments that amplify productive behaviors and suppress harmful ones. EurekAgent engineers environments across four dimensions: permissions engineering for bounded execution and isolated evaluation, artifact engineering for filesystem and Git-based collaboration, budget engineering for budget-aware exploration, and human-in-the-loop engineering for easy oversight. The system achieves new state-of-the-art results on mathematics, kernel engineering, and machine learning tasks, including a novel 26-circle packing solution discovered with under $11 total API cost. Code and results are open-sourced, and the authors call for environment engineering as a core research direction for reliable autonomous research agents.
PapersSource: ARXIVImportance: 4/5
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.
PapersSource: ARXIVImportance: 4/5
The paper presents ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts like model cards, datasets, and evaluation reports. It addresses challenges of defining and reconciling dependencies by formalizing direct vs. indirect relations and resolving artifact identities across inconsistent documentation. Applied to four public LLM releases, ModSleuth recovered 1,060 source-verified dependencies, revealing multi-hop license obligations, train-evaluation coupling, and discrepancies between released and training-time artifacts. The system and dependency graphs are released to enable transparent analysis of increasingly complex LLM development ecosystems.
Claw-SWE-Bench is a multilingual SWE-bench-style benchmark with 350 issue-resolution instances across 8 languages and 43 repositories, designed to fairly compare heterogeneous agent harnesses (claws) through a standardized adapter protocol including fixed prompts, runtime budgets, and patch extraction. A cost-aware Lite subset of 80 instances is provided for faster validation. Using the same GLM 5.1 backbone, OpenClaw's Pass@1 jumps from 19.1% with a minimal direct-diff adapter to 73.4% with the full adapter, demonstrating that adapter design is essential for harness performance. A sweep over nine models and five harnesses shows model choice and harness choice each independently shift Pass@1 by about 29 pp and 27 pp, while total API cost varies substantially even among systems with similar accuracy. The benchmark thus treats harness architecture and cost as first-class evaluation axes for coding agents.
PROJECTMEM is an open-source, local-first memory and judgment layer that logs AI coding agent development as an append-only, plain-text event stream and projects it into compact, AI-readable summaries via the Model Context Protocol (MCP). It includes a deterministic pre-action gate that warns the agent before it repeats a previously failed fix or edits a file with a record of fragility, framed as Memory-as-Governance. The system runs fully offline, serves as a provenance trail, and ships as a 3-dependency Python package with 14 MCP tools, 19 CLI commands, and 37 automated tests. A two-month self-study across 10 projects and 207 logged events demonstrates that it eliminates the 5,000–20,000 tokens typically spent re-deriving context each session and prevents redundant debugging attempts.
iOSWorld is the first interactive benchmark built on a native iOS simulator with a persistent user identity across 26 newly built apps. It includes 133 tasks across single-app, multi-app, and memory/personalization categories, testing agents' ability to reason over personal data. Evaluated models achieve at most 52% overall accuracy, with multi-app tasks proving especially challenging at 37%. The benchmark is released open-source, including all apps, seeded data, and evaluation code.