We show that LLMs can be finetuned to unified models with instruction tuning.
We extend Visual Instruction Tuning to Visual-Predictive Instruction Tuning to study unified multimodal models. This simple yet effective approach enables LLMs to predict both visual and text tokens through instruction tuning, without requiring extensive architectural changes or pretraining.
Watch as MetaMorph handles both visual understanding and generation tasks (even implicit reasoning) after instruction tuning
Visual Understanding and Visual Generation are Coupled!
Visual generation capability can be unlocked with significantly less data when co-trained with visual understanding tasks. Our experiments show that 5000 examples are enough to trigger visual generation and 200K samples are sufficient to geneate high-quality visual tokens when combined with understanding tasks, compared to millions needed for pure generation training.
There exists a strong correlation between visual understanding and generation capabilities. As models improve their understanding abilities, their generation capabilities naturally enhance, and vice versa, creating a powerful synergistic effect.
The darker the heatmap, the better the performance. Understanding data proves to be significantly more valuable than generation data for both understanding and generation tasks. For example, with the same amount of total data (5M), 4M VQA + 1M Generation performs better than 1M VQA + 4M Generation data on both understanding and generation tasks.
Visual generation capabilities show strong correlation with visually demanding tasks but weaker correlation with knowledge-based tasks. This suggests that visual understanding and generation are fundamental visual capabilities in autoregressive models.
A simple yet effective extension enabling LLMs to predict both visual and text tokens
VPiT extends visual instruction tuning with:
Three major data categories formatted as instruction-tuning pairs:
For training and inference:
A unified model that demonstrates true multimodal capabilities
MetaMorph demonstrates competitive performance across a wide range of tasks. The model achieves strong results on understanding and generation benchmarks, despite using only 64 tokens for each image. The table below summarizes the performance of MetaMorph on various benchmarks.
Model | Image QA | Video QA | Generation | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Base LLM | MMBenchEN | SEED | RealWorldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | MV-Bench | FID |
GPT-4V* | - | 75.8 | 69.1 | 61.4 | 50.0 | 75.7 | 56.8 | 55.0 | 78.5 | 78.0 | 43.5 | - |
T2I Models | ||||||||||||
Stable Diffusion 1.5* | - | - | - | - | - | - | - | - | - | - | - | 9.6 |
Dalle 2* | - | - | - | - | - | - | - | - | - | - | - | 10.4 |
Imagen* | - | - | - | - | - | - | - | - | - | - | - | 7.3 |
Unified Models | ||||||||||||
EMU-3* | - | 58.5 | 68.2 | 57.4 | 36.6† | 89.2 | 31.6 | 51.8† | 68.6 | 64.7 | - | 12.8 |
Janus* | DeepSeek 1.3B | 69.4 | 63.7 | - | - | - | 30.5 | - | - | - | - | 8.5 |
VILA-U256† | LLaMA-2 7B | 66.6 | 57.1 | 46.6 | 22.0 | 67.1 | 32.2 | 38.7 | 11.4 | 48.3* | 40.8 | 19.6 |
Transfusion* | - | - | - | - | - | - | - | - | - | - | - | 6.7 |
Chameleon-7B† | - | 35.7 | 27.2 | 19.6 | 0.0 | 50.3 | 28.4 | 37.1 | 0.0 | 0.0 | - | 26.7* |
Ours | LLaMA-3.1 8B | 75.2 | 71.8 | 58.3 | 48.3 | 83.2 | 41.8 | 44.0 | 37.1 | 60.5 | 48.8 | 11.8 |
MetaMorph demonstrates ability to leverage the knowledge and capabilities of the pretrained language model for visual generation tasks:
These examples provide strong evidence that unifying language and vision in a single model allows visual generation to be guided by the linguistic reasoning and background knowledge of language models.
Generate an image of Chhogori
Chhogori, also known as K2, is the second-highest mountain in the world.
Generate an image of an Oncilla
The oncilla, also known as the little spotted cat, is a small and elusive wild feline native to Central and South America.
Generate an image of the view of Chizarira
Chizarira is a remote and rugged national park in Zimbabwe, known for its dramatic escarpments, diverse wildlife, and pristine wilderness ideal for off-the-beaten-path safaris.
Generate an image of a glass without water
MetaMorph successfully generates an empty glass, differentiating with a full glass of water.
Generate an image of a glass filled with water
MetaMorph successfully generates a filled with glass instead of empty glass.
MetaMorph also demonstrate ''reasoning capabilities'' in multimodal tasks that go beyond simple mappings from text to image. In Physics of LLM, the authors observe that LLM precompute reasoning graphs before generating any future token. Here we show this phenomenon is observed in multimodal generation.:
These reasoning capabilities go far beyond traditional text-to-image models. By leveraging the language understanding and composition abilities of the LLM, MetaMorph can interpret and solve complex problems that require multiple steps of reasoning, demonstrating a deeper, more flexible multimodal model.