Release b9637 of llama.cpp introduces a dedicated chat parser for the Cohere2MoE model architecture, referred to as North Code. The parser is implemented via PR #24615 to ensure correct conversation formatting for Cohere's mixture-of-experts variant. The release ships pre-built binaries for macOS, Linux, Windows, and Android across CPU, CUDA, Vulkan, ROCm, SYCL, and other backends. No other functional changes are noted in the release notes beyond this parser addition and some internal renames.
ReposSource: GITHUBImportance: 2/5
The llama.cpp project tagged release b9632. The primary change is the addition of count, d, and e filter aliases to the Jinja template engine via PR #24606. Pre-built binaries are published for a wide range of platforms: macOS arm64 with optional KleidiAI, Linux (CPU, Vulkan, ROCm 7.2, OpenVINO, SYCL FP32/FP16), Android arm64, and Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP). Several configurations are disabled in this release, including macOS Intel, iOS XCFramework, and openEuler 310p/910b builds.
ReposSource: GITHUBImportance: 2/5
This release of llama.cpp adds the cohere2moe tokenizer to llama-vocab, enabling inference with the TINY_AYA model. The change was contributed via pull request #24601. Build artifacts are provided for macOS, Linux, Windows, and Android across various backends.
ReposSource: GITHUBImportance: 3/5
Andrew Ng has released a new open-source repository called aisuite, which provides a simple, unified interface to multiple generative AI providers. The tool abstracts away differences in provider APIs, making it easier for developers to switch between various AI services. The repository description does not list specific supported providers. It aims to simplify integration and experimentation with different AI models.
The b9627 release of llama.cpp is a minor maintenance update containing a single bug fix for the llama-ui-embed utility. The fix resolves a crash that occurred when the tool was launched without specifying an asset directory, addressing issue #24597. No new features, model support, or performance changes are included. The release ships pre-built binaries across the usual set of platforms including macOS, Linux, Android, Windows, and openEuler, with various GPU backends available.
Release b9626 of llama.cpp introduces support for the Cohere2 Mixture of Experts (MoE) architecture under the new arch name "cohere2moe". It fixes sliding window attention pattern handling, resolves MTP failures by switching to iSWA, and adjusts shared expert combination to (routed+shared)*0.5. Redundant gating function checks, lmhead tensor checks, and tokenizer type definitions were removed; the tokenizer is kept as tiny_aya. Platform builds are provided for macOS (Apple Silicon/Intel), Linux (x64/arm64 with Vulkan, ROCm, OpenVINO, SYCL), Android, and Windows (CPU/CUDA/Vulkan/SYCL/HIP), along with UI support.