Peter Tong

Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a second-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I recently graduated from UC Berkeley with a triple major in Computer Science, Applied Mathematics(Honor) and Statistics(Honor). I am from Nanjing, China and Melbourne, Australia.

Peter Tong

Research

I graduated from UC Berkeley with a triple major. I am a second-year CS PhD student in NYU Courant advised by Prof. Yann LeCun and Prof. Saining Xie. I was a researcher in Berkeley Artificial Intelligence Lab(BAIR) advised by Prof. Yi Ma and Prof. Jacob Steinhardt. I am interested in world model, unsupervised/self-supervised learning, generative models and multimodal models. I would like to thank all my mentors-Yubei, Xili, Erik and collaborators for the incredible journey I had in my undergrad.

News

Publications

MetaMorph

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Technical Report

Visual understanding and visual generation are mutually beneficial in unified models! But visual understanding data is much more effective than visual generation. Capabilities in LLM can also transfer to unified models such as implicit reasoning!

Cambrian

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

NeurIPS 2024 Oral

We provide a vision-centric exploration or cookbook in MLLMs, systematically studying visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruction data, and release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.

MMVP

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

CVPR 2024 Oral

Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify 'CLIP-blind pairs' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark.

MultiMon

Mass-Producing Failures of Multimodal Systems with Language Models

Shengbang Tong*, Erik Jones*, Jacob Steinhardt
NeurIPS 2023

Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.

© 2025 Peter Tong. Last updated: March 2025.