Saining Xie
- Papers
- 40
Cite
Notes
Only stored in your browser.
Authored papers
40Solaris: Building a Multiplayer Video World Model in Minecraft
arXiv 2026
Self-Refining Video Sampling
arXiv 2026
Repurposing Geometric Foundation Models for Multi-view Diffusion
arXiv 2026
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
arXiv 2026
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
arXiv 2025
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
arXiv 2025
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
arXiv 2025
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
CVPR 2025 1
Science-T2I: Addressing Scientific Illusions in Image Synthesis
CVPR 2025 1
FrontierCS: Evolving Challenges for Evolving Intelligence
arXiv 2025
What matters for Representation Alignment: Global Information or Spatial Structure?
arXiv 2025
PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop
arXiv 2025
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
arXiv 2025
Next-Embedding Prediction Makes Strong Vision Learners
arXiv 2025
Diffusion Transformers with Representation Autoencoders
arXiv 2025
Meta CLIP 2: A Worldwide Scaling Recipe
arXiv 2025
Spatial Mental Modeling from Limited Views
arXiv 2025
Flow Map Distillation Without Data
arXiv 2025
DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing
arXiv 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv 2024
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
arXiv 2024
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
arXiv 2024
On Scaling Up 3D Gaussian Splatting Training
arXiv 2024
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
CVPR 2025 1
V-IRL: Grounding Virtual Intelligence in Real Life
arXiv 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
CVPR 2024 1
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
CVPR 2023 1
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
arXiv 2023
Going Denser with Open-Vocabulary Part Segmentation
ICCV 2023 1
CiT: Curation in Training for Effective Vision-Language Data
ICCV 2023 1
Scalable Diffusion Models with Transformers
ICCV 2023 1
A ConvNet for the 2020s
CVPR 2022 1
Masked Autoencoders Are Scalable Vision Learners
CVPR 2022 1
Masked Feature Prediction for Self-Supervised Visual Pre-Training
CVPR 2022 1
SLIP: Self-supervision meets Language-Image Pre-training
arXiv 2021
An Empirical Study of Training Self-Supervised Vision Transformers
ICCV 2021 10
Sample-Efficient Neural Architecture Search by Learning Action Space
arXiv 2019
Decoupling Representation and Classifier for Long-Tailed Recognition
ICLR 2020 1
Aggregated Residual Transformations for Deep Neural Networks
aggregated-residual-transformations-for-deep-1
Holistically-Nested Edge Detection
holistically-nested-edge-detection-1
Affiliations
Frequent co-authors
10from 40 papers