AI intelligence feed

HUGGINGFACEJul 1, 2026Highlight

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

This paper proposes Asymmetric Mutual Variational Learning (AMVL), a framework that addresses the train-inference mismatch in continuous latent reasoning for multimodal large language models. The mismatch arises because standard variational training forces the inference-time prior to mimic a posterior conditioned on ground-truth answers, causing answer leakage. AMVL uses a forward KL divergence to align the prior with the posterior and a novel reverse KL divergence to regularize the posterior, preventing collapse into inference-incompatible regions. The method is instantiated in a latent-integrated MLLM and evaluated on the BLINK benchmark, where it improves the average score by +10.83 and achieves gains of up to +32.00 on individual reasoning tasks, with analyses showing improved latent-space stability.

HUGGINGFACEJul 1, 2026Highlight

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

ELDR is a decode router for prefill-decode disaggregated serving of mixture-of-experts (MoE) models that addresses latency differences caused by the expert activation patterns per batch. It constructs an expert signature from a request's prefill activations to predict which experts will be used during generation, then uses offline balanced K-means to partition signature space across decode workers and a locality-band online policy that routes each request to the least-loaded worker among those best matching its signature. A signature cache co-indexed with the KV cache at KV-block granularity maintains exact signatures under prefix caching. Implemented in vLLM and tested with up to 40 GPUs across three MoE models and two workloads, ELDR reduced median time-per-output-token (TPOT) by 5.9–13.9% over the strongest of four load-balancing baselines while keeping model outputs unchanged.

HUGGINGFACEJul 1, 2026Highlight

Nvidia Releases NVFP4-Quantized Version of Mistral-Medium-3.5-128B

Nvidia has published a quantized variant of the Mistral-Medium-3.5-128B large language model on Hugging Face. The model employs NVFP4, a 4-bit floating point precision format, to reduce memory footprint and potentially accelerate inference. It is labeled as conversational and text-generation compatible, using the safetensors format. The repository indicates the model is based on the original Mistral-Medium-3.5-128B from Mistral AI and is shared under a custom license.

HUGGINGFACEJun 30, 2026Highlight

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

The paper introduces SpheRoPE, a zero-shot, training-free, and optimization-free framework for generating 360° panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. A complementary Semantic Distortion classifier-free guidance (CFG) steers geometry. The method is demonstrated on text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive results without any fine-tuning or inference-time optimization.

HUGGINGFACEJun 30, 2026

Multi-Block Diffusion Language Models

This paper proposes Multi-Block Diffusion Language Models (MBD-LMs), extending block diffusion LMs to decode multiple consecutive blocks in parallel for inter-block parallelism. To align training with multi-block inference, they introduce Multi-block Teacher Forcing (MultiTF), which trains on bounded noise-groups conditioned on clean prefixes with randomized noise-schedulers. A Block Buffer decoding algorithm preserves KV-cache reuse and static input shapes, translating parallelism into wall-clock speedup. On MBD-LLaDA2-Mini, average tokens per forward pass increase from 3.47 to 6.19 while accuracy rises from 79.95% to 81.03%. Combined with DMax, the model reaches 9.34 TPF with only a 1.02% accuracy drop on math and code benchmarks.

HUGGINGFACEJun 30, 2026

Jackrong releases GGUF quantized version of Qwopus3.6-35B-A3B-Coder model

Jackrong has uploaded a GGUF quantized model file for Qwopus3.6-35B-A3B-Coder on Hugging Face. The base model is a multimodal mixture-of-experts model based on Qwen3.6, designed for coding, tool-use, and function calling, supporting an image-text-to-text pipeline. This GGUF version enables efficient local inference with llama.cpp. The repository is released under the Apache 2.0 license. At the time of posting, the file had 62 likes and 0 downloads, with no performance benchmarks provided.

AI signal, minus the noise.

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Nvidia Releases NVFP4-Quantized Version of Mistral-Medium-3.5-128B

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

Multi-Block Diffusion Language Models

Jackrong releases GGUF quantized version of Qwopus3.6-35B-A3B-Coder model