The llama.cpp project released tag b9659, which includes a bug fix for the mtmd component that was miscounting n_tokens (PR #24656). This release also provides pre-built binaries for a wide range of platforms, including macOS (ARM64, Intel), Linux (x64, ARM64, s390x with Vulkan, ROCm, OpenVINO, SYCL), Android (ARM64), and Windows (x64, ARM64 with CUDA 12/13, Vulkan, SYCL, HIP). Notably, the macOS Apple Silicon build with KleidiAI enabled is marked as disabled, while the iOS XCFramework artifact is available.
The llama.cpp project released build b9658. A key change improves chat debugging: on parse errors, the debug output now includes the full unparsed prompt. The release also provides pre-built binaries for many platforms, including macOS (Apple Silicon, Intel), Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Android (arm64 CPU), and Windows (CPU, CUDA, Vulkan, SYCL, HIP). The KleidiAI-enabled macOS Apple Silicon build is currently disabled in this release.
llama.cpp release b9656 hardens the PEG-native tool call parsing. It now accepts an optional leading "type":"function" field to accommodate OpenAI-style tool call serialization. On a final parse failure, the parser returns a clean error and logs the unparsed fragment instead of throwing raw internal state. The raw arguments string is preserved when it is not valid JSON, preventing an abort of the prompt rendering. Parse failures are surfaced with clearer error messages, eliminating silent empty assistant turns. The lenient handling of the "type":"function" field is gated behind an analysis flag.
The llama.cpp project released tag b9655, which fixes an 'oldie but goodie' grammar generator bug in the chat feature that surfaced during recent changes (PR #24653). Additionally, an erroneous case in the PEG parser test was updated. The release provides pre-built binaries for a wide range of platforms including macOS (Apple Silicon, Intel, KleidiAI), Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL), Android (arm64), and Windows (x64, arm64, CUDA 12/13, Vulkan, SYCL, HIP). openEuler builds and UI components are also included.
The llama.cpp release b9654 adds a post-decode callback to the mtmd (multimodal text decode) module, implemented in PR #24645. The development was assisted by the Qwen3.6-27B language model. Pre-built binaries are provided for macOS Apple Silicon, Linux x64/arm64, Windows x64/arm64, and Android, with various GPU backends (Vulkan, CUDA 12/13, ROCm, SYCL, HIP) and some configurations disabled.
The b9653 release of llama.cpp extends the Vulkan backend to handle additional CONCAT tensor operation types, improving compatibility for models that rely on these operations. It also ships pre-built binaries for macOS (Apple Silicon, Intel), Linux (multiple GPU backends including Vulkan, ROCm, OpenVINO, SYCL), Android, Windows (CUDA 12/13, Vulkan, SYCL, HIP), and openEuler platforms. The release was published automatically on June 15, 2026.