PrintGuard 2.0, an open-source FDM failure detector, reuses the same ShuffleNetV2 encoder with nearest-prototype classification but completely rewrites the runtime. The model is exported as a ~5 MB TFLite file via LiteRT, enabling deployment on CPython (hub mode) and in the browser (Pyodide + LiteRT.js WASM) from a single codebase. A Platform abstraction layer isolates all non-portable operations (inference, camera discovery, image encoding), so the Python engine runs unchanged in both environments. The system introduces a dynamic fairness-aware inference scheduler that uses smoothed latency estimates and max-min fairness to allocate inference capacity across cameras. A fail-safe design gates inference based on printer state, stopping only when positively not printing, and watchdog monitors camera feeds and printer services for dropouts.
Developer Knok0932 updated an open-source C++ implementation of PaddleOCR to support text detection and recognition models from PP-OCR v3 through the latest v6. The project uses the ncnn inference framework instead of the official Paddle C++ runtime, which is described as complex and heavy with many dependencies. The ncnn-based approach reportedly offers faster inference for the author's tasks and greatly simplifies deployment. The code is available on GitHub at https://github.com/Avafly/PaddleOCR-ncnn-CPP.
The author proposes an open-source edge semantic cache architecture for LLMs aimed at reducing latency and API costs. It uses Rust compiled to WebAssembly to run on CDN edge nodes (e.g., Cloudflare Workers), intercepting user prompts. On a cache hit (similarity ≥ 0.88), a cached response is returned in ~5ms without calling the LLM; on a miss, the request is proxied to providers and the cache updated asynchronously. Key components include a lightweight embedding model like bge-small-en-v1.5, a vector similarity check against an edge vector database, and an edge KV store for response texts. The author seeks community feedback on realistic semantic cache hit rates in production, potential edge caching pitfalls, and interest in an open-source template.
The paper proposes a parameter-free adaptive token allocation method for video tokenization that exploits temporal redundancy in the latent space of a frozen continuous video tokenizer. It drops spatial positions whose per-position temporal-L1 differences fall below a fixed threshold, achieving content-driven compression rates. A lightweight Latent Inpainting Transformer (LIT) with factorised spatial-temporal attention reconstructs the dropped tokens. The pipeline requires only a single encoder pass and one LIT forward pass, eliminating auxiliary routing networks. On TokenBench and DAVIS benchmarks, the method delivers competitive reconstruction fidelity with a 31x inference speedup over ElasticTok-CV and 2x over InfoTok.
This Reddit post from r/MachineLearning discusses the real-world adoption of privacy-preserving ML techniques like differential privacy, federated learning, and on-device inference. The author asks industry practitioners whether these methods are deployed in production, what engineering challenges arise, and how privacy requirements affect model performance and infrastructure costs. It also invites stories about specific use cases where these approaches have proven valuable or where tradeoffs made adoption difficult.
The post discusses whether quantization-aware training (QAT) is designed to work specifically with one quantization method, such as Google's for Gemma-4, or if alternative quantizations like those from Unsloth are valid. Unsloth's quantizations of Gemma-4-QAT reportedly produce results closer to the QAT fine-tuned models. The author questions whether this closeness is beneficial or undermines the purpose of QAT, which is to emulate a particular inference-time quantization. The discussion highlights a potential trade-off between accuracy preservation and adherence to the original quantization scheme.