Microsoft has released HARC-Qwen2.5-7B-Instruct, a fine-tuned version of Qwen2.5-7B-Instruct optimized for safety and alignment in conversational AI. The model is a transformer-based text-generation model, available on Hugging Face under the Apache 2.0 license. It is distributed in safetensors format and is compatible with text-generation-inference and Hugging Face endpoints. The release is associated with the paper arXiv:2607.00572.
Microsoft released HARC-Llama-3.1-8B-Instruct on Hugging Face. It is a text-generation model built on Meta's Llama 3.1 8B Instruct. Repository tags indicate a focus on safety, alignment, and conversational use. The model card provides no benchmarks, training details, or specific capability claims. It is distributed under the Llama 3.1 license.
This paper reveals that dense on-policy self-distillation (SDPO) accelerates in-domain specialization under stable teacher signals, but causes severe forgetting and even complete collapse during continual post-training. In contrast, on-policy reinforcement learning methods like GRPO adapt more conservatively and better preserve prior capabilities. Denser self-distillation induces larger drift in parameter and response spaces, and amplifies high-frequency formatting artifacts through a self-reinforcing teacher-student loop. The findings caution that on-policy data alone is insufficient for continual learning, and dense self-distillation should not be treated as a default stabilizer.
SkillCoach proposes a self-evolving rubric framework that derives skill-grounded process rubrics from rollouts to evaluate agentic skill-use along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It separates process quality from task success by keeping an external verifier as a distinct outcome signal, enabling detection of hidden failures that final accuracy alone would miss. The evolved rubrics are then used as process supervision to select high-quality training trajectories, outperforming outcome-only filtering. Experiments show the approach improves evaluation quality and provides stronger supervision signals for enhancing agentic skill-use.
The paper adapts a mixture-of-experts discrete diffusion language model, DiffusionGemma-26B, and benchmarks it against the autoregressive Gemma-4-26B on medical visual question answering. Using the same LoRA fine-tuning recipe, the diffusion model matches or exceeds AR performance, scored by a verbosity-robust LLM judge, while decoding 3.5–4.4× faster. The fine-tuned model (3.8B active parameters) is competitive with frontier vision-language models. Crucially, the diffusion paradigm enables any-order infill: a radiologist can correct parts of a report and the model generates the text between them, a capability inherent to diffusion that autoregressive models cannot easily replicate. This suits real-world radiology reports, which often vary in style and completeness across clinicians and institutions.
The paper proposes Perceive-to-Reason (P2R), a framework that decouples fine-grained visual reasoning into a two-stage process: a Perceiver that localizes question-relevant evidence in the image, and a Reasoner that answers using the annotated image and cropped regions. It introduces Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates updates between perception-focused and reasoning-focused phases using only final-answer supervision. Built on Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance; P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its backbone. Further experiments show the benefits extend beyond high-resolution benchmarks to broader multimodal reasoning tasks.