Krill, an AI relay service, launched a 618 promotion from June 15–18, 2026, reducing base Codex model rates to as low as 0.15 and offering a 66% discount coupon on Codex plans. With a 10-person group buy, the effective rate reaches 0.1 Chinese yuan per US dollar. Existing Codex plan holders on June 15 will have their quotas adjusted to the 0.1 level. Claude model access is discounted only via balance top-ups, not plans. The service uses Pro accounts and emphasizes cost transparency.
A systems-level deep dive that exposes the hidden microarchitectural costs of GPU time-slicing in Kubernetes when running concurrent LLM agents. It quantifies the actual overhead of co-locating agentic AI workloads and explains what it means for operational efficiency.
ReposSource: GITHUBImportance: 2/5
llama.cpp b9631 addresses a command-line interface bug where preserved tokens were not correctly copied, as tracked in issue #24258. The release includes pre-compiled binaries for macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x, Vulkan, ROCm, OpenVINO, SYCL), Android (arm64), Windows (CPU, CUDA, Vulkan, SYCL, HIP), and openEuler platforms. This is a routine patch release primarily focused on a single CLI fix.
ReposSource: GITHUBImportance: 2/5
This release of llama.cpp adds the cohere2moe tokenizer to llama-vocab, enabling inference with the TINY_AYA model. The change was contributed via pull request #24601. Build artifacts are provided for macOS, Linux, Windows, and Android across various backends.
ReposSource: GITHUBImportance: 2/5
The b9628 release of llama.cpp integrates SYCL backend validation into the continuous integration and release testing pipeline. The new check-release workflow now covers SYCL FP32 and FP16 builds on Ubuntu x64 and SYCL on Windows x64, ensuring Intel GPU acceleration is regularly tested. The release also maintains existing test matrices for macOS, Linux (CPU, Vulkan, ROCm, OpenVINO), Android, and Windows (CUDA, Vulkan, HIP).
TutorialsSource: MEDIUM LARGE LANGUAGE MODELSImportance: 1/5
The provided article body contains only an introductory teaser sentence, with the full content inaccessible behind Medium's continue-reading wall. No concrete information about KV caching, specific models, or inference optimizations is present in the raw content.