Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Teaser Image

Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP) . Our research reveals that the visual capabilities in recent MultiModal LLMs (MLLMs) still exhibit systematic shortcomings.

Searching for Visual Mistakes in CLIP and MLLM models

To understand the visual incapilities of multimodal LLMs, we delve into the visual encoder (CLIP models). We find ambiguities in CLIP embedding via "clip-blind pairs": Images that are visually different yet encoded similarly by CLIP models.

MMVP Framework
We start with finding CLIP-blind pairs that have similar CLIP embedding but different DINOv2 embedding. We manually inspect the differences between pair-wise images and formulate questions based on the differences in the images. We then ask MLLMs the question alongside the CLIP-blind pair. The model receives a score only when both questions for the CLIP-blind pair are answered correctly.

We assess the questions on SOTA open-source models (LLaVA-1.5 , InstructBLIP , Mini-GPT4 ) and closed-source models (GPT-4V , Gemini , Bard ). We also evaluate huamn performance through user studies. There is a significant performance gap between human and MLLM models, despite the latter often demonstrating impressive results. Models except GPT-4V and Gemini, scored below random guess level (25%). Most advanced GPT-4V and Gemini also face challenges in addressing basic visual grounding questions.

MMVP Framework
There is a huge gap between human performance and MLLM's performance on the simple visual questions in MMVP Benchmark.

Visual Patterns that challenge CLIP models

Having identified the CLIP-blind pairs, we summarize systematic visual patterns that the CLIP vision encoders might consistently misinterpret. We turn to the questions and options from the MMVP benchmark. With these questions, we transform abstract visual patterns in images into clearer, language-based descriptors that are easier to categorize. We identify 9 visual patterns:

Orientation and Direction
Presence of Specific Features
State and Condition
Quantity and Count
Color and Appearance
Positional and Relational Context
Structural and Physical Characteristics
Viewpoint and Perspective
Category Image
(Click on Visual Patterns to see examples)

Scaling Up CLIP Doesn't Help Visual Patterns

CLIP models develop and scale over the years. We evaluate MMVP on a variety of CLIP models . These models vary in aspects like size, training data, and methodology. As evidenced in the table, increasing network size and training data only aids in identifying two visual patterns – “color and appearance” and “state and condition”. The rest of the visual patterns continue to challenge all CLIP-based models SEAL-Bench Results

Models scaled up in resolution show minimal improvement, whereas a slight advantage is observed when scaling up the network.

CLIP mistakes and MLLMs Mistakes are Correlated

We plot CLIP’s performance and MLLMs' performance for each visual pattern. When the CLIP vision encoder underperforms on a certain visual pattern, the MLLM tends to exhibit similar shortcomings. Open-source models such as LLaVA 1.5 and InstructBLIP that explicitly use the CLIP vision encoder display a strong correlation in performance. SEAL-Bench Results

If CLIP performs poorly on a visual pattern such as ``orientation'', MLLMs also underperform on the visual pattern.

Mixture-Of-Features (MoF) MLLM

If open-sourced MLLM's visual shortcomings come from the CLIP vision encoder, how do we build a more competent visual encoder? We take initial steps to answer the question by studying Mixture-of-Features (MoF) that mixs Vision-Only SSL (DINOv2 ) features and CLIP features. SEAL-Bench Results

Different Mixture-of-Feature (MoF) Strategies in MLLM. Left: Standard MLLM that uses CLIP as off-the-shelf pretrained vision encoder; Middle: Additive-MoF (A-MoF) MLLM: Linearly mixing CLIP and DINOv2 features before the adapter; Right: InterleavedMoF (I-MoF MLLM) Spatially interleaving CLIP visual tokens and DINOv2 visual tokens after the adapter.

Vision-Only SSL features: Better Vision, Worse Language

We add a pretrained DINOv2 encoder into MLLM and linearly mix the CLIP pretrained encoder with it. Our study reveals that SEAL-Bench Results

  1. As the proportion of DINOv2 features increases, MLLM exhibits a decline in its instruction-following capability. Notably, there is a sharp decrease when the DINOv2 proportion reaches 87.5%.
  2. A higher proportion of DINOv2 features enhances the model’s visual grounding capability, but this advantage diminishes when the DINOv2 proportion surpasses 0.75, at which point instruction-following is notably impaired.

Interleaved-MoF: Combining advantages from CLIP and DINOv2 features

We propose interleaved MoF to leverage advantages from both CLIP and DINOv2 embeddings to enhance image representation. We take the processed features from CLIP and DINOv2 and interleave them while maintaining their original spatial order. Interleave MoF significantly enhances visual grounding, with a 10.7% increase observed in MMVP, without compromising the model’s ability to follow instructions. This experiment is replicated with the LLaVA-1.5 setting and under various image resolution settings, yielding similar enhancements in performance. SEAL-Bench Results

Interleaved MoF improves visual grounding while maintaining same level of instruction following ability.


  title={Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs},
  author={Shengbang Tong and Zhuang Liu and Yuexiang Zhai and Yi Ma and Yann LeCun and Saining Xie}