Peter Tong
Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a first-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I recently graduateed from UC Berkeley with a triple major in Computer Science, Applied Mathematics(Honor) and Statistic(Honor). I am from Nanjing, China and Melbourne, Australia.
Email  / 
Resume  / 
Twitter  / 
Google Scholar  / 
Github
|
|
Research
I recently graduated from UC Berkeley with a triple major. I am a first-year CS PhD student in NYU Courant advised by Prof. Yann LeCun and Prof. Saining Xie. I was a researcher in Berkeley Artificial Intelligence Lab(BAIR) advised by Prof. Yi Ma and Prof.Jacob Steinhardt. I am interested in world model, unsupervised/self-supervised learning, generative models and multimodal models. I would like to thank all my mentors-Yubei, Xili, Erik and collaborators for the incredible journey I had in my undergrad.
|
Publications & Preprints (* means equal contribution)
|
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong*,
Ellis Brown*,
Penghao Wu*,
Sanghyun Woo,
Manoj Middepogu,
Sai Charitha Akula,
Jihan Yang,
Shusheng Yang,
Adithya Iyer,
Xichen Pan,
Austin Wang,
Rob Fergus,
Yann LeCun,
Saining Xie
NeurIPS 2024 Oral
We provide a vision-centric exploration or cookbook in MLLMs. In other words, we systematically study visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols in MLLMs. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruciton data, and more! We also release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.
|
|
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
CVPR 2024 Oral
Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities.
|
|
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai, Hao Bai*, Zipeng Lin*, Jiayi Pan*, Shengbang Tong*, Yifei Zhou*, Alen Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
NeurIPS 2024
We can use RL to train MLLM instead on SFT! Using RL to train from environment feedback unlocks model's ability in decision making, exceeding the limitations in SFT.
|
|
Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong*, Erik Jones*, Jacob Steinhardt
NeurIPS 2023
Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.
|
|
EMP-SSL: Towards Self-Supervised Learning in One Epoch
Shengbang Tong*, Yubei Chen*, Yi Ma, Yann LeCun
Under Review
Inspired by the newly proposed principle, our work proposes a minimalist method for self-supervised learning that tremendously reduces the epochs which SSL methods take to converge.
|
|
Investigating the Catastrophic Forgetting in Multimodal Large Language Models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
CPAL 2024
TLDR: Fine-Tuning multimodal large language models (MLLMs) leads to catastrophic forgetting.
|
|
Emergence of Segmentation with Minimalistic White-Box Transformers
Yaodong Yu*, Tianzhe Chu*, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, Yi Ma
CPAL 2024
Through extensive experimental results, we demonstrate that when employing a white-box transformer-like architecture known as CRATE, whose design explicitly models and pursues low-dimensional structures in the data distribution, segmentation properties, at both the whole and parts levels, already emerge with a minimalistic supervised training recipe. Layer-wise finer-grained analysis reveals that the emergent properties strongly corroborate the designed mathematical functions of the white-box network.
|
|
Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
Tianzhe Chu*, Shengbang Tong*, Tianjiao Ding*, Xili Dai, Benjamin Haeffele, Rene Vidal, Yi Ma
ICLR 2024
In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective.
|
|
White-Box Transformers via Sparse Rate Reduction
Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, Yi Ma
NeurIPS 2023
In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally.
|
|
Unsupervised Learning of Structured Representation via Closed-Loop Transcription
Shengbang Tong*, Xili Dai*, Yubei Chen, Mingyang Li, Zengyi Li, Brent Yi, Yann LeCun, Yi Ma
CPAL 2024
This paper proposes a new unsupervised method to learn a structured representation that may serve both discriminative and generative purpose
|
|
Closed-Loop Transcription Via Convolutional Sparse Coding
Xili Dai, Ke Chen, Shengbang Tong, Jingyuan Zhang, Xingjian Gao, Mingyang Li, Druv Pai, Yuexiang Zhai, Xiaojun Yuan, Heung Yeung Shum, Lionel M.Ni, Yi Ma
CPAL 2024
This paper explores the natural inverse in Covolutional Sparse Coding neural network and its application in generative models.
|
|
Unsupervised Manifold Linearizing and Clustering
Tianjiao Ding, Shengbang Tong, Kwan Ho Ryan Chan, Xili Dai, Yi Ma,Benjamin David Haeffele
ICCV 2023
This paper proposes a new unsupervised method to learn a represenation and cluster for real life dataset such as CIFAR-10, CIFAR100 and Tiny-ImageNet-200.
|
|
Revisiting Sparse Convolutional Model for Visual Recognition
Xili Dai*, Mingyang Li*, Pengyuan Zhai, Shengbang Tong, Xingjian Gao, Shaolun Huang, Zhihui Zhu, Chong You, Yi Ma
NeurIPS 2022
Our method uses differentiable optimization layers that are defined from convolutional sparse coding as drop-in replacements of standard convolutional layers in conventional deep neural networks. We show that such models have equally strong empirical performance on CIFAR-10, CIFAR-100 and ImageNet datasets when compared to conventional neural networks.
|
|
Incremental Learning of Structured Memory via Closed-Loop Transcription
Shengbang Tong, Xili Dai, Ziyang Wu, Mingyang Li, Brent Yi, Yi Ma
ICLR 2023
We propose a minimal computational model for learning a structured memory of multiple object classes in an incremental setting
|
|
Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
Xili Dai*, Shengbang Tong*, Mingyang Li*, Ziyang Wu*, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Michael Psenka, Xiaojun Yuan, Heung Yeung Shum, Yi Ma
Entropy Journal
We propose a new computational framework for learning an explicit generative model for real-world dataset.
|
|