MetaMorph

Multimodal Understanding and Generation via Instruction Tuning

We show that LLMs can be finetuned to unified models with instruction tuning.

LLM is already VERY CLOSE to a Unified Model!
MetaMorph Overview

Shengbang Tong1,2,*,†David Fan1Jiachen Zhu1,2,*Yunyang Xiong3Xinlei Chen1
Koustuv Sinha1Michael Rabbat1Yann LeCun1,2Saining Xie2Zhuang Liu1,†

*Work done at MetaCorresponding authors

We extend Visual Instruction Tuning to Visual-Predictive Instruction Tuning to study unified multimodal models. This simple yet effective approach enables LLMs to predict both visual and text tokens through instruction tuning, without requiring extensive architectural changes or pretraining.

1
We discover that generation and understanding are mutually beneficial. Through extensive experiments, we reveal that visual generation emerges naturally as models improve at understanding—requiring as little as 200K samples when co-trained, compared to millions needed traditionally.
2
Our Visual-Predictive Instruction Tuning (VPiT) extends existing instruction tuning to predict continuous visual tokens alongside discrete text tokens. This simple modification unlocks powerful multimodal capabilities while maintaining the efficiency of instruction tuning.
3
We train the MetaMorph model using VPiT, achieving competitive performance across benchmarks. More importantly, we find intriguing evidence of modality unification: the model can leverage LLM knowledge for generation and perform implicit reasoning before generating visual tokens.
Click to explore each aspect in detail

MetaMorph Examples

MetaMorph Demo

Multimodal Understanding & Generation

Watch as MetaMorph handles both visual understanding and generation tasks (even implicit reasoning) after instruction tuning

Key Findings

Visual Understanding and Visual Generation are Coupled!

Finding 1
Visual Generation Emerges Naturally From Understanding
We only need very small number of samples to unlock visual generation when co-training with understanding tasks.
Data Efficiency Comparison

Visual generation capability can be unlocked with significantly less data when co-trained with visual understanding tasks. Our experiments show that 5000 examples are enough to trigger visual generation and 200K samples are sufficient to geneate high-quality visual tokens when combined with understanding tasks, compared to millions needed for pure generation training.

Finding 2
Visual Understanding and Generation are Synergistic
Better understanding leads to better generation and vice versa
Understanding-Generation Correlation

There exists a strong correlation between visual understanding and generation capabilities. As models improve their understanding abilities, their generation capabilities naturally enhance, and vice versa, creating a powerful synergistic effect.

Finding 3
Understanding Data is Much More Effective for Both Understanding and Generation
Understanding data significantly improves both understanding and generation performance comparing to generation data.
Data Impact Comparison

The darker the heatmap, the better the performance. Understanding data proves to be significantly more valuable than generation data for both understanding and generation tasks. For example, with the same amount of total data (5M), 4M VQA + 1M Generation performs better than 1M VQA + 4M Generation data on both understanding and generation tasks.

Finding 4
Visual Generation Aligns With Visually Demanding Understanding Tasks
Generation ability strongly correlates with general, text&chart, and vision-centric tasks but not with knowledge-based tasks
Task Correlation Analysis

Visual generation capabilities show strong correlation with visually demanding tasks but weaker correlation with knowledge-based tasks. This suggests that visual understanding and generation are fundamental visual capabilities in autoregressive models.

Visual Predictive Instruction Tuning

A simple yet effective extension enabling LLMs to predict both visual and text tokens

Multimodal Input
Process visual and text tokens in any sequence order
Unified Processing
LLM with separate text and vision heads
Token Generation
Generate text and visual tokens with diffusion visualization
Training Process
Multimodal next-token prediction with instruction tuning

VPiT extends visual instruction tuning with:

  • Multimodal Input Processing:
    • Visual inputs processed through pretrained vision encoder
    • Interpolation to 64 visual tokens
    • Trainable projection layer for dimension matching
  • Model Architecture:
    • Language Head: Generates probability distribution over vocabulary for text token prediction, trained with cross-entropy loss
    • Vision Head: Produces continuous visual embeddings matching encoder dimensions, trained with cosine similarity loss
    • Adapter layer between vision encoder and LLM
  • Token Prediction:
    • Text tokens: Cross-entropy loss on language head
    • Visual tokens: Cosine similarity loss on vision head
    • Special tokens ⟨image_start⟩ and ⟨image_end⟩ for visual sequences
  • Unified Framework:
    • Same architecture and next-token paradigm for both modalities
    • Single model processes both text and vision inputs
    • Gradients only through response tokens
    • Maintains efficiency of instruction tuning approach
VPiT Training Process
Training Data Types
Broad range of multimodal data in instruction format
Data Categories

Three major data categories formatted as instruction-tuning pairs:

  • Visual Understanding Data:
    • ImageQA: Cambrian-7M collection of instruction-tuning datasets
    • VideoQA: VideoStar and ShareVideo, processed at 1 FPS
    • Format Example:
      • Prompt: {⟨visual_tokens⟩, ⟨text prompt⟩}
      • Response: {⟨text response⟩}
  • Visual Generation Data:
    • Up to 5M image-text pairs from MetaCLIP pipeline
    • Curated into instruction format like "Generate an image of X"
    • Format Example:
      • Prompt: {⟨text prompt⟩}
      • Response: {"Here is an image based on your request", ⟨image_tokens⟩}
  • Other Visual Data:
    • Video Data:
      • HowTo100M and SomethingSomethingV2
      • Tasks: frame prediction, sequence completion, temporal reasoning
    • Visual Thinking Data:
      • Curated from Visual CoT and VStar
      • Includes visual generation in the reasoning such as zoom-in views before answering
    • Image-to-Image Data:
      • InstructPix2Pix and Aurora datasets
      • For conditioned image transformation tasks
Visual Token Visualization
Diffusion-based approach for visualizing predicted tokens
Token Visualization

For training and inference:

  • Training Stage:
    • Uses concept of "diffusion autoencoder" to condition on image embeddings
    • Finetunes an existing diffusion model (e.g., Stable Diffusion 1.5)
    • Training Configuration:
      • 2-layer MLP projector matches SigLIP embedding to cross-attention dimension
      • Freezes VAE encoder and SigLIP encoder during training
      • Uses held-out training data from MetaCLIP pipeline
      • Standard latent diffusion training
  • Inference Pipeline:
    • Step 1: VPiT Model Prediction
      • Model processes input sequence (text/images)
      • Generates continuous visual tokens via vision head
    • Step 2: Token Visualization
      • Predicted tokens fed into finetuned diffusion model
      • Diffusion model conditions on these embeddings
      • Generates final pixel-space visualization
Our goal is to use visualization as a tool to explore and analyze the properties of models trained with VPiT, rather than compete with SOTA high-fidelity image generation models.

MetaMorph Model

A unified model that demonstrates true multimodal capabilities

Scroll horizontally to see more
Competitive Performance
Strong results across understanding and generation benchmarks

MetaMorph demonstrates competitive performance across a wide range of tasks. The model achieves strong results on understanding and generation benchmarks, despite using only 64 tokens for each image. The table below summarizes the performance of MetaMorph on various benchmarks.

Model Image QA Video QA Generation
Method Base LLM MMBenchEN SEED RealWorldQA MMVP SQA MMMU VStar ChartQA TextVQA MV-Bench FID
GPT-4V* - 75.8 69.1 61.4 50.0 75.7 56.8 55.0 78.5 78.0 43.5 -
T2I Models
Stable Diffusion 1.5* - - - - - - - - - - - 9.6
Dalle 2* - - - - - - - - - - - 10.4
Imagen* - - - - - - - - - - - 7.3
Unified Models
EMU-3* - 58.5 68.2 57.4 36.6† 89.2 31.6 51.8† 68.6 64.7 - 12.8
Janus* DeepSeek 1.3B 69.4 63.7 - - - 30.5 - - - - 8.5
VILA-U256† LLaMA-2 7B 66.6 57.1 46.6 22.0 67.1 32.2 38.7 11.4 48.3* 40.8 19.6
Transfusion* - - - - - - - - - - - 6.7
Chameleon-7B† - 35.7 27.2 19.6 0.0 50.3 28.4 37.1 0.0 0.0 - 26.7*
Ours LLaMA-3.1 8B 75.2 71.8 58.3 48.3 83.2 41.8 44.0 37.1 60.5 48.8 11.8
Table 1: Comparison with state-of-the-art models. (*): Numbers from original papers. (†): Results reproduced using released weights.
LLM Knowledge Leverage
Successfully utilizes pretrained LLM knowledge and capability for visual generation

MetaMorph demonstrates ability to leverage the knowledge and capabilities of the pretrained language model for visual generation tasks:

  • It can generate accurate visual representations for highly specialized concepts like "Chhogori" (the second highest mountain in the world), "Oncilla" (a small wild cat native to South America), and "Chizarira" (a remote national park in Zimbabwe). This shows MetaMorph is tapping into the broad world knowledge embedded in the LLM to inform its visual generation.
  • The model exhibits nuanced understanding of complex semantics, correctly visualizing prompts such as negation ("a glass without water"). This semantic knowledge comes from the language understanding capabilities of the LLM.
  • By transferring knowledge from the language domain to the visual domain, MetaMorph can generate meaningful visuals for a much wider range of concepts compared to traditional text-to-image models that rely on more limited text encoders. The LLM's encyclopedic knowledge becomes a rich source of visual understanding.

These examples provide strong evidence that unifying language and vision in a single model allows visual generation to be guided by the linguistic reasoning and background knowledge of language models.

Example 1/5
Chhogori example

Prompt:

Generate an image of Chhogori

Explanation:

Chhogori, also known as K2, is the second-highest mountain in the world.

Oncilla example

Prompt:

Generate an image of an Oncilla

Explanation:

The oncilla, also known as the little spotted cat, is a small and elusive wild feline native to Central and South America.

Chizarira example

Prompt:

Generate an image of the view of Chizarira

Explanation:

Chizarira is a remote and rugged national park in Zimbabwe, known for its dramatic escarpments, diverse wildlife, and pristine wilderness ideal for off-the-beaten-path safaris.

Empty Glass example

Prompt:

Generate an image of a glass without water

Explanation:

MetaMorph successfully generates an empty glass, differentiating with a full glass of water.

Empty Glass example

Prompt:

Generate an image of a glass filled with water

Explanation:

MetaMorph successfully generates a filled with glass instead of empty glass.

Multimodal Reasoning Capabilities
Demonstrates implicit reasoning in multimodal generation tasks

MetaMorph also demonstrate ''reasoning capabilities'' in multimodal tasks that go beyond simple mappings from text to image. In Physics of LLM, the authors observe that LLM precompute reasoning graphs before generating any future token. Here we show this phenomenon is observed in multimodal generation.:

  • The model can break down complex, multi-step prompts and solve them implicitly to generate the correct image. For example, given the prompt "Generate an image of the animal resulting from a monarch caterpillar's metamorphosis", MetaMorph reasons that a monarch caterpillar undergoes metamorphosis to become a butterfly and generates an image of a butterfly without being explicitly told each step.
  • Importantly, MetaMorph performs this multi-step reasoning without explicit chaining or intermediate prompts. The reasoning is fully internal to the model.

These reasoning capabilities go far beyond traditional text-to-image models. By leveraging the language understanding and composition abilities of the LLM, MetaMorph can interpret and solve complex problems that require multiple steps of reasoning, demonstrating a deeper, more flexible multimodal model.

Example 1/4
"Generate an image of the animal resulting from a monarch caterpillar's metamorphosis"
Monarch Butterfly
Model's Implicit Reasoning Process
1
2
3
Initial Understanding
Identifies the starting point: monarch caterpillar
"Generate an image of the national flag of the country where Yellowstone National Park is located"
American Flag
Model's Implicit Reasoning Process
1
2
3
Geographic Knowledge
Locates Yellowstone National Park
Generate an image of the flower celebrated in spring festivals in the country where sushi originated
Cherry Blossom
Model's Implicit Reasoning Process
1
2
3
Geographic Knowledge
Identifies the country where sushi originated: Japan.
Generate an image of the pet animal whose name is a rearrangement of the letters in the word 'tca'
Cat
Model's Implicit Reasoning Process
1
2
3
Lexical Knowledge
Rearranges the letters in the word 'tca' to form 'cat'.