中文
Research >> Multimodal Foundation Models
Multimodal Foundation Models

We aim to develop a comprehensive multimodal foundation model capable of handling a wide range of tasks across diverse modalities, including text, image, and video understanding and generation. The model will be designed for easy adaptation to various downstream applications and scenarios.

Our research will focus on advancing the architecture of the foundation model, optimizing multimodal training algorithms, improving inference accuracy and efficiency, and exploring real-world applications.

Multimodal Foundation Model Architecture

Autoregressive and diffusion models excel in different domains: discrete and continuous data generation, respectively. In this work, we integrate them into a unified LLM-like architecture for multimodal generation. Specifically, we switch the attention mask to be bidirectional among image tokens, and perform the multi-step denoising process to generate an image. With this strategy, we not only simplify the model structure, but also allow the model to attend to the whole context during generation, ultimately leading to substantial performance improvements.

Openstory++ : A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

We introduce Openstory++, a large scale dataset combining additional instance-level annotations with both images and text. This dataset can be utilized to train multi-modal generated models, allowing for the training of instance-focused story visualization models. Furthermore, we develop a tailored training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity.

When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

We analyze how language bias contributes to hallucinations in VLMs and then introduce Cross-Modal Guidance (CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs.