Shengbang Tong
- Papers
- 10
Cite
Notes
Only stored in your browser.
Authored papers
10Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
arXiv 2026
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
arXiv 2025
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
arXiv 2025
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
arXiv 2025
Diffusion Transformers with Representation Autoencoders
arXiv 2025
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
arXiv 2025
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
CVPR 2024 1
Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
arXiv 2023
Mass-Producing Failures of Multimodal Systems with Language Models
mass-producing-failures-of-multimodal-systems
Affiliations
Frequent co-authors
10from 10 papers