Qwen-Image-2.0-RL introduces a post-training pipeline utilizing reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to enhance the Qwen-Image-2.0 diffusion model. The framework employs task-specific composite reward models, developed via vision-language model fine-tuning with pointwise scoring and chain-of-thought reasoning, to improve visual quality and instruction-following. This approach provides more reliable reward signals for text-to-image generation.
Read original
huggingface/daily-papers