From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Researchers propose a novel architectural shift in multimodal image fusion, moving from traditional 2D feature grids to a compact 1D token interface to better balance local detail preservation with global appearance consistency.

The Challenge of Multimodal Image Fusion

Multimodal image fusion is the process of integrating complementary information from diverse imaging modalities into a single composite image. The primary technical objective is to create a fused output that preserves high-resolution local details while ensuring the overall global appearance remains consistent across the image.

Current state-of-the-art methodologies typically rely on building shared representations based on 2D feature grids. While these 2D structures are highly effective at modeling local spatial structures and fine-grained textures, they often struggle to capture and leverage image-level global appearance factors, leading to a potential disconnect between local precision and global coherence.

Proposed Solution: The 1D Token Interface

To address the limitations of 2D grids, the authors introduce a reformed approach centered around a compact 1D token interface. By leveraging a frozen pretrained image tokenizer, the proposed method transforms the representation process. This shift allows the model to move beyond spatial grids, providing a more efficient mechanism for managing the trade-off between local structural integrity and global consistency.

Key Technical Innovations

  • Tokenization: Utilization of a frozen pretrained image tokenizer to convert visual data into 1D tokens.
  • Representation Reform: Moving away from 2D feature grids to mitigate the limitations in modeling global appearance factors.
  • Optimized Fusion: A streamlined interface designed to integrate complementary multimodal information more effectively than traditional spatial-grid-based methods.

Note: The provided source material was truncated; specific implementation details regarding the tokenizer's architecture and the quantitative results of the fusion performance are not available in the provided snippet.

Original Source
Multimodal Fusion Image Tokenization Computer Vision Shared Representations Representation Learning