Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a second-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I recently graduated from UC Berkeley with a triple major in Computer Science, Applied Mathematics(Honor) and Statistics(Honor). I am from Nanjing, China and Melbourne, Australia.
I graduated from UC Berkeley with a triple major. I am a second-year CS PhD student in NYU Courant advised by Prof. Yann LeCun and Prof. Saining Xie. I was a researcher in Berkeley Artificial Intelligence Lab(BAIR) advised by Prof. Yi Ma and Prof. Jacob Steinhardt. I am interested in world model, unsupervised/self-supervised learning, generative models and multimodal models. I would like to thank all my mentors-Yubei, Xili, Erik and collaborators for the incredible journey I had in my undergrad.
Visual understanding and visual generation are mutually beneficial in unified models! But visual understanding data is much more effective than visual generation. Capabilities in LLM can also transfer to unified models such as implicit reasoning!
We provide a vision-centric exploration or cookbook in MLLMs, systematically studying visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruction data, and release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.
Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify 'CLIP-blind pairs' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark.
Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.