The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation
A new research paper examines the instability of the Fréchet Inception Distance (FID), revealing that the metric acts as a random variable influenced by both training and generation seeds, potentially undermining the reproducibility of reported results in generative AI.
The Fragility of the De Facto Standard
The Fréchet Inception Distance (FID) has long served as the primary benchmark for evaluating the quality of image generation models. However, a critical issue persists in current academic reporting: most research papers present a single FID score derived from a single trained model using a single sampling seed. This practice ignores the inherent stochasticity involved in the model's lifecycle, treating a volatile metric as a constant.
Analyzing Variance via the "FID Lottery"
In the paper "The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation," authors Nicolas Dufour, Alexei A. Efros, and Patrick Pérez challenge the reliability of these single-point estimates. The researchers treat FID as a random variable, analyzing its variance across a two-axis panel consisting of:
- Training Seeds: The randomness associated with weight initialization and optimization.
- Generation Seeds: The randomness inherent in the sampling process during inference.
Methodology and Experimental Setup
To quantify this variance, the authors conducted an extensive empirical study using several hundred SiT (Scalable Interpolant Transformers) networks. These models were trained on class-conditional ImageNet datasets at a resolution of 256x256. By measuring the fluctuations in FID across these numerous iterations, the study aims to reveal how much of a model's "superiority" may actually be a result of a "lucky" seed rather than architectural or algorithmic improvement.
Implications for Model Reproducibility
The findings suggest that the high variance in FID scores can lead to misleading conclusions regarding model performance. If the metric is highly sensitive to seed selection, the reported state-of-the-art results may not be consistently reproducible, suggesting a need for more robust reporting standards—such as reporting mean and variance across multiple seeds—to ensure scientific rigor in generative model evaluation.
Note: Due to the provided text being a snippet, the specific quantitative results and final conclusions of the study are not detailed.
Original Source