We demonstrate that Large Language Models (LLMs) can be effectively finetuned into unified multimodal models capable of both understanding and generation using instruction tuning.
We release of the MetaMorph training code. You can find it on our GitHub repository. Explore the code and experiment with Visual-Predictive Instruction Tuning!
We extend Visual Instruction Tuning to Visual-Predictive Instruction Tuning (VPiT) to study unified multimodal models. This simple yet effective approach enables LLMs to predict both visual and text tokens through instruction tuning, without requiring extensive architectural changes or pretraining.
See our unified multimodal model performing various tasks, including understanding, generation, and implicit reasoning.
Watch as MetaMorph handles both visual understanding and generation tasks after instruction tuning.
Visual Understanding and Visual Generation are Coupled!
Visual generation capability can be unlocked with significantly less data when co-trained with visual understanding tasks. Our experiments show that 5,000 examples are enough to trigger visual generation, and 200K samples are sufficient to generate high-quality visual tokens when combined with understanding tasks, compared to millions needed for pure generation training.
There exists a strong correlation between visual understanding and generation capabilities. As models improve their understanding abilities (e.g., VQA scores), their generation capabilities (e.g., lower FID scores) naturally enhance, and vice versa, creating a powerful synergistic effect.
The darker the heatmap cell, the better the performance. Understanding data proves to be significantly more valuable than generation data for improving both understanding (e.g., MMBench) and generation (e.g., FID) tasks. For example, with the same total data amount (5M samples), using 4M VQA + 1M Generation data yields better results on both task types than using 1M VQA + 4M Generation data.
Visual generation capabilities (measured by FID) show strong correlation with visually demanding understanding tasks (like MMBench-General, MMBench-V&C, MMBench-VisionCentric) but weaker correlation with knowledge-based tasks (MMBench-Knowledge). This suggests that visual understanding and generation are fundamental visual capabilities intertwined within autoregressive models.
A simple yet effective extension enabling LLMs to predict both visual and text tokens.
VPiT extends visual instruction tuning with:
<image_start>
and <image_end>
delineate visual token sequences.Three major data categories are formatted into instruction-tuning pairs:
Prompt: {<visual_tokens>, <text prompt>}
, Response: {<text response>}
Prompt: {<text prompt>}
, Response: {"Here is an image...", <image_start>, <visual_tokens>, <image_end>}
To visualize the continuous visual tokens predicted by MetaMorph:
Note: This visualization step is primarily for analysis and demonstrating the model's capabilities, not for competing with state-of-the-art high-fidelity image generation models.
A unified model demonstrating true multimodal capabilities through VPiT.
Model | Image QA | Video QA | Generation | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Base LLM | MMBenchEN | SEED | RealWorldQA | MMVP | SQA | MMMU | VStar | ChartQA | TextVQA | MV-Bench | FID ↓ |
GPT-4V* | - | 75.8 | 69.1 | 61.4 | 50.0 | 75.7 | 56.8 | 55.0 | 78.5 | 78.0 | 43.5 | - |
T2I Models | ||||||||||||
Stable Diffusion 1.5* | - | - | - | - | - | - | - | - | - | - | - | 9.6 |
Dalle 2* | - | - | - | - | - | - | - | - | - | - | - | 10.4 |
Imagen* | - | - | - | - | - | - | - | - | - | - | - | 7.3 |
Unified Models | ||||||||||||
EMU-3* | - | 58.5 | 68.2 | 57.4 | 36.6† | 89.2 | 31.6 | 51.8† | 68.6 | 64.7 | - | 12.8 |
Janus* | DeepSeek 1.3B | 69.4 | 63.7 | - | - | - | 30.5 | - | - | - | - | 8.5 |
VILA-U256† | LLaMA-2 7B | 66.6 | 57.1 | 46.6 | 22.0 | 67.1 | 32.2 | 38.7 | 11.4 | 48.3* | 40.8 | 19.6 |
Transfusion* | - | - | - | - | - | - | - | - | - | - | - | 6.7 |
Chameleon-7B† | - | 35.7 | 27.2 | 19.6 | 0.0 | 50.3 | 28.4 | 37.1 | 0.0 | 0.0 | - | 26.7* |
MetaMorph (Ours) | LLaMA-3.1 8B | 75.2 | 71.8 | 58.3 | 48.3 | 83.2 | 41.8 | 44.0 | 37.1 | 60.5 | 48.8 | 11.8 |
MetaMorph demonstrates strong performance across a wide range of tasks, achieving results competitive with or superior to other open-source unified models of similar size. Notably, it performs well on both complex visual understanding benchmarks (like MMBench, SEED, MMMU) and shows reasonable visual generation quality (FID score), despite using only 64 tokens per image/frame and relying solely on instruction tuning without extensive pretraining stages common in other models.
MetaMorph demonstrates the ability to leverage the knowledge and capabilities embedded within the pretrained language model for visual generation tasks:
These examples suggest that unifying language and vision allows visual generation to be guided by the linguistic reasoning and background knowledge inherent in LLMs.
Generate an image of Chhogori
Chhogori, also known as K2, is the second-highest mountain. The model uses its knowledge to generate the correct mountain.
Generate an image of an Oncilla
The oncilla is a small spotted cat. The model accesses its knowledge base to visualize this specific animal.
Generate an image of the view of Chizarira
Chizarira is a national park in Zimbabwe. The model generates a plausible landscape view based on this geographic knowledge.
Generate an image of a glass without water
MetaMorph correctly interprets the negation and generates an empty glass, demonstrating semantic understanding.
Generate an image of a glass filled with water
Contrasting the previous example, the model accurately generates a glass containing water when negation is removed.
MetaMorph exhibits reasoning capabilities during generation, going beyond simple text-to-image mapping. Similar to how LLMs might precompute reasoning steps internally before generating text, MetaMorph appears to perform implicit reasoning before generating visual tokens:
These capabilities highlight the potential of unified models like MetaMorph to perform more complex, compositional tasks by leveraging the underlying LLM's reasoning abilities.