LISA: Likelihood Score Alignment for Visual-Condition Controllable Generation
Researchers introduce LISA, a novel approach to visual-condition controllable generation that re-evaluates the dual-branch paradigm through the lens of score-based generative modeling to improve training efficiency and the role of side-network encoding.
Revisiting the Dual-Branch Paradigm
In the field of visual-condition controllable generation, the prevailing architectural approach utilizes a dual-branch paradigm. This method typically involves training a dedicated side network to encode specific visual conditions, which then fuses intermediate-layer features into a frozen, pretrained main network. While this strategy has demonstrated significant success, the underlying mechanisms regarding the side branch's role and its overall training efficiency have remained largely underexplored.
Score-Based Generative Modeling Perspective
The authors of the LISA paper propose a theoretical revisit of this mainstream paradigm by analyzing it through the framework of score-based generative modeling. By examining how the main network preserves the learned data distribution and how the side branch introduces conditional guidance, the research aims to optimize the alignment between the likelihood score and the visual conditions.
Key Objectives of the Research
The primary goal of the LISA framework is to address the inefficiencies inherent in current side-network training. By focusing on "Likelihood Score Alignment," the researchers seek to refine how visual conditions steer the generation process without compromising the integrity of the pretrained main network's generative capabilities.