V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

V-Zero is a novel framework that enables fine‑grained visual reasoning in multimodal large language models without relying on explicit answer labels, using on‑policy distillation and contrastive evidence gating to ground reasoning in localized image evidence.

Problem and Motivation

Fine‑grained visual reasoning requires multimodal large language models (MLLMs) to identify task‑relevant visual evidence and ground their reasoning in specific image regions. Existing agentic methods typically depend on reinforcement learning with verifiable rewards, supervised fine‑tuning on large‑scale annotated reasoning traces, or hand‑designed verification rules, all of which lead to high annotation costs, extensive exploration, and heavy reliance on textual supervision.

V-Zero Approach

V-Zero introduces answer‑label

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Problem and Motivation

V-Zero Approach

Related Articles

Apple’s Siri AI at WWDC: How a Voice-First Agent Strategy Could Move the Stock and Reshape the AI Race

NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.

Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing

Bible as RAG Database

Beyond Translation: How Hi Translate 6.0 Is Evolving into a Multilingual AI Agent