Cleo is an open-source text-to-SQL model built by finetuning Qwen3.5-2B-Base, designed to encapsulate full analyst behavior within a 2B parameter model. The system uses the same structured harness for training, evaluation, and inference, implementing a gather-repair-answer contract that includes live execution evidence during candidate query search. Key design choices include co-optimization of the model contract, SQL safety layer, dialect handling, timeouts, and clarification behavior. The model, harness, and datasets are fully open-source on GitHub and Hugging Face. This project demonstrates how tightly coupling training and inference in a single harness can enable small models to handle complex SQL generation and interactive debugging.
Reddit user /u/summerday10 released FeynRL, an open-source framework designed to make reinforcement learning post-training for large language models, vision-language models, and agents fully transparent and modifiable. The framework exposes the entire training loop—data loading, rollout generation, reward computation, loss construction, optimization, and evaluation—so researchers can develop new algorithms without fighting hidden systems. It currently includes examples for supervised fine-tuning, DPO, and RL-style training and supports single-GPU, multi-GPU, and cluster setups. The project was motivated by the belief that open weights alone are insufficient; open training codebases that keep algorithms explicit and systems separate are necessary for advancing open ML/AI research.
This paper, presented at ACM CAIS 2026, studies safety evaluation in tool-using LLM agents. It categorizes outcomes into safe success, unsafe success, and failure, and proposes a two-tier verification architecture: deterministic policy/tool checks followed by an LLM-based verifier. Using τ-bench tool-use scenarios, the authors find that verification can reduce unsafe success but also decreases task completion as the task horizon increases. They term this phenomenon the 'Verifier Tax', a horizon-dependent tradeoff between safety and successful task completion. The work highlights that unsafe completion should be treated as a separate category distinct from safe success.
Phinite launched a multi-agent operating system that provides a registry for first-class agent identity (ID, version, owner, skill graph). It replaces traditional unit tests with behavioral evaluation, using compound reliability scoring and behavioral regression to handle non-deterministic agent execution. Skills are versioned, reusable, and agent-inheritable, enabling composability without rebuilding. The platform is cloud-agnostic, model-agnostic, and includes built-in observability (traces, cost attribution, drift detection). It is SOC 2 Type II compliant and offers free credits for testing.
A developer shares production experience building an agent with 140 MCP tools, finding that semantic embeddings for tool selection gave only 64% top-1 accuracy and were confidently wrong. BM25 over tool metadata achieved 81% accuracy, outperforming a hybrid approach that scored 78%. The key insight is that tool descriptions are short and keyword-dependent, making BM25 more effective than embeddings. Indexing schema fields like property names further improved performance. The author recommends testing specific corpora rather than assuming document-RAG defaults transfer to tool selection.
This Reddit post announces a new open-source package for Multi-Agent Reinforcement Learning (MARL) drone environments built on MuJoCo. The package, available on GitHub, aims to unify various drone objectives for the RL community. The author seeks feedback and contributions to improve the package and fix any issues. The repository includes research publications from the author related to RL.