V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

V-Zero is a novel framework that enables fine‑grained visual reasoning in multimodal large language models without relying on explicit answer labels, using on‑policy distillation and contrastive evidence gating to ground reasoning in localized image evidence.

Problem and Motivation

Fine‑grained visual reasoning requires multimodal large language models (MLLMs) to identify task‑relevant visual evidence and ground their reasoning in specific image regions. Existing agentic methods typically depend on reinforcement learning with verifiable rewards, supervised fine‑tuning on large‑scale annotated reasoning traces, or hand‑designed verification rules, all of which lead to high annotation costs, extensive exploration, and heavy reliance on textual supervision.

V-Zero Approach

V-Zero introduces answer‑label