Peter Tong

Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a first-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I recently graduateed from UC Berkeley with a triple major in Computer Science, Applied Mathematics(Honor) and Statistic(Honor). I am from Nanjing, China and Melbourne, Australia.

Email  /  Resume  /  Twitter  /  Google Scholar  /  Github

profile photo
Research

I recently graduated from UC Berkeley with a triple major. I am a first-year CS PhD student in NYU Courant advised by Prof. Yann LeCun and Prof. Saining Xie. I was a researcher in Berkeley Artificial Intelligence Lab(BAIR) advised by Prof. Yi Ma and Prof.Jacob Steinhardt. I am interested in world model, unsupervised/self-supervised learning, generative models and multimodal models. I would like to thank all my mentors-Yubei, Xili, Erik and collaborators for the incredible journey I had in my undergrad.

News
Publications & Preprints (* means equal contribution)
PontTuset Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong*, Ellis Brown*, Penghao Wu*, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie
NIPS 2024 Oral

We provide a vision-centric exploration or cookbook in MLLMs. In other words, we systematically study visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols in MLLMs. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruciton data, and more! We also release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.

PontTuset Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
CVPR 2024 Oral

Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities.

PontTuset Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai, Hao Bai*, Zipeng Lin*, Jiayi Pan*, Shengbang Tong*, Yifei Zhou*, Alen Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
NIPS 2024

We can use RL to train MLLM instead on SFT! Using RL to train from environment feedback unlocks model's ability in decision making, exceeding the limitations in SFT.

PontTuset Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong*, Erik Jones*, Jacob Steinhardt
NIPS 2023

Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.

PontTuset EMP-SSL: Towards Self-Supervised Learning in One Epoch
Shengbang Tong*, Yubei Chen*, Yi Ma, Yann LeCun
Under Review

Inspired by the newly proposed principle, our work proposes a minimalist method for self-supervised learning that tremendously reduces the epochs which SSL methods take to converge.

PontTuset Investigating the Catastrophic Forgetting in Multimodal Large Language Models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
CPAL 2024

TLDR: Fine-Tuning multimodal large language models (MLLMs) leads to catastrophic forgetting.

PontTuset Emergence of Segmentation with Minimalistic White-Box Transformers
Yaodong Yu*, Tianzhe Chu*, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, Yi Ma
CPAL 2024

Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network.

PontTuset Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
Tianzhe Chu*, Shengbang Tong*, Tianjiao Ding*, Xili Dai, Benjamin Haeffele, Rene Vidal, Yi Ma
ICLR 2024

In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective.

PontTuset White-Box Transformers via Sparse Rate Reduction
Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, Yi Ma
NIPS 2023

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally.

PontTuset Unsupervised Learning of Structured Representation via Closed-Loop Transcription
Shengbang Tong*, Xili Dai*, Yubei Chen, Mingyang Li, Zengyi Li, Brent Yi, Yann LeCun, Yi Ma
CPAL 2024

This paper proposes a new unsupervised method to learn a structured representation that may serve both discriminative and generative purpose

PontTuset Closed-Loop Transcription Via Convolutional Sparse Coding
Xili Dai, Ke Chen, Shengbang Tong, Jingyuan Zhang, Xingjian Gao, Mingyang Li, Druv Pai, Yuexiang Zhai, Xiaojun Yuan, Heung Yeung Shum, Lionel M.Ni, Yi Ma
CPAL 2024

This paper explores the natural inverse in Covolutional Sparse Coding neural network and its application in generative models.

PontTuset Unsupervised Manifold Linearizing and Clustering
Tianjiao Ding, Shengbang Tong, Kwan Ho Ryan Chan, Xili Dai, Yi Ma,Benjamin David Haeffele
ICCV 2023

This paper proposes a new unsupervised method to learn a represenation and cluster for real life dataset such as CIFAR-10, CIFAR100 and Tiny-ImageNet-200.

PontTuset Revisiting Sparse Convolutional Model for Visual Recognition
Xili Dai*, Mingyang Li*, Pengyuan Zhai, Shengbang Tong, Xingjian Gao, Shaolun Huang, Zhihui Zhu, Chong You, Yi Ma
NIPS 2022

Our method uses differentiable optimization layers that are defined from convolutional sparse coding as drop-in replacements of standard convolutional layers in conventional deep neural networks. We show that such models have equally strong empirical performance on CIFAR-10, CIFAR-100 and ImageNet datasets when compared to conventional neural networks.

PontTuset Incremental Learning of Structured Memory via Closed-Loop Transcription
Shengbang Tong, Xili Dai, Ziyang Wu, Mingyang Li, Brent Yi, Yi Ma
ICLR 2023

We propose a minimal computational model for learning a structured memory of multiple object classes in an incremental setting

PontTuset Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
Xili Dai*, Shengbang Tong*, Mingyang Li*, Ziyang Wu*, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Michael Psenka, Xiaojun Yuan, Heung Yeung Shum, Yi Ma
Entropy Journal

We propose a new computational framework for learning an explicit generative model for real-world dataset.