Multimodal Foundation Model Architecture
Autoregressive and diffusion models excel in different domains: discrete and continuous data generation, respectively. In this work, we integrate them into a unified LLM-like architecture for multimodal generation. Specifically, we switch the attention mask to be bidirectional among image tokens, and perform the multi-step denoising process to generate an image. With this strategy, we not only simplify the model structure, but also allow the model to attend to the whole context during generation, ultimately leading to substantial performance improvements.
Autoregressive and diffusion models excel in different domains: discrete and continuous data generation, respectively. In this work, we integrate them into a unified LLM-like architecture for multimodal generation. Specifically, we switch the attention mask to be bidirectional among image tokens, and perform the multi-step denoising process to generate an image. With this strategy, we not only simplify the model structure, but also allow the model to attend to the whole context during generation, ultimately leading to substantial performance improvements.