DiffusionBench: Establishing a Holistic Evaluation Framework for Diffusion Transformers
Researchers introduce DiffusionBench and NanoGen to address the narrow evaluation scope of Diffusion Transformers (DiT), moving beyond class-conditional ImageNet generation toward more comprehensive text-to-image benchmarks.
The Limitation of Current DiT Evaluation
Current research on Diffusion Transformers (DiT) for image generation has largely converged on a singular evaluation paradigm: class-conditional generation using the ImageNet dataset. While this approach allows for the tracking of Fréchet Inception Distance (FID) and other related metrics, there is a growing concern within the research community that these metrics may no longer accurately reflect genuine progress in generative modeling capabilities.
Moving Beyond Class-Conditional Generation
The primary alternative to class-conditional generation is text-to-image (T2I) generation. Historically, T2I has been bypassed in many DiT studies due to the perception that training and evaluating such models is prohibitively costly or inconvenient. The authors of this study argue that this perception is outdated and that the field requires a more holistic approach to validate the efficacy of new architectures.
Introducing NanoGen and DiffusionBench
To bridge this gap, the researchers introduce NanoGen, a framework designed to make the training and evaluation of text-to-image models more accessible. This effort culminates in DiffusionBench, a holistic evaluation suite intended to provide a more rigorous and diverse set of benchmarks for Diffusion Transformers, ensuring that improvements in model performance translate to real-world generative quality rather than just optimized scores on a single dataset.
Note: Due to the provided text being a truncated description, specific implementation details of NanoGen and the full quantitative results of DiffusionBench are not available in this summary.
Original Source