LISA: Likelihood Score Alignment for Visual-Condition Controllable Generation

Researchers introduce LISA, a novel approach to visual-condition controllable generation that re-evaluates the dual-branch paradigm through the lens of score-based generative modeling to improve training efficiency and the role of side-network encoding.

Revisiting the Dual-Branch Paradigm

In the field of visual-condition controllable generation, the prevailing architectural approach utilizes a dual-branch paradigm. This method typically involves training a dedicated side network to encode specific visual conditions, which then fuses intermediate-layer features into a frozen, pretrained main network. While this strategy has demonstrated significant success, the underlying mechanisms regarding the side branch's role and its overall training efficiency have remained largely underexplored.

Score-Based Generative Modeling Perspective

The authors of the LISA paper propose a theoretical revisit of this mainstream paradigm by analyzing it through the framework of score-based generative modeling. By examining how the main network preserves the learned data distribution and how the side branch introduces conditional guidance, the research aims to optimize the alignment between the likelihood score and the visual conditions.

Key Objectives of the Research

The primary goal of the LISA framework is to address the inefficiencies inherent in current side-network training. By focusing on "Likelihood Score Alignment," the researchers seek to refine how visual conditions steer the generation process without compromising the integrity of the pretrained main network's generative capabilities.

Note: The provided source text was truncated. Further technical details regarding the specific implementation of the alignment mechanism and the quantitative results of the LISA framework are unavailable.
Original Source
Diffusion Models Score-based Generative Modeling Visual-Conditioning Controllable Generation Neural Network Architecture