0

Saining Xie

Papers
40

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
40papers

Authored papers

40

Solaris: Building a Multiplayer Video World Model in Minecraft

arXiv 2026

2026

Self-Refining Video Sampling

arXiv 2026

2026

Repurposing Geometric Foundation Models for Multi-view Diffusion

arXiv 2026

2026

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

arXiv 2026

2026

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

arXiv 2025

2025

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

arXiv 2025

2025

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

arXiv 2025

2025

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

CVPR 2025 1

2025

Science-T2I: Addressing Scientific Illusions in Image Synthesis

CVPR 2025 1

2025

FrontierCS: Evolving Challenges for Evolving Intelligence

arXiv 2025

2025

What matters for Representation Alignment: Global Information or Spatial Structure?

arXiv 2025

2025

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

arXiv 2025

2025

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

arXiv 2025

2025

Next-Embedding Prediction Makes Strong Vision Learners

arXiv 2025

2025

Diffusion Transformers with Representation Autoencoders

arXiv 2025

2025

Meta CLIP 2: A Worldwide Scaling Recipe

arXiv 2025

2025

Spatial Mental Modeling from Limited Views

arXiv 2025

2025

Flow Map Distillation Without Data

arXiv 2025

2025

DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

arXiv 2024

2024

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

arXiv 2024

2024

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

arXiv 2024

2024

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

arXiv 2024

2024

On Scaling Up 3D Gaussian Splatting Training

arXiv 2024

2024

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

CVPR 2025 1

2024

V-IRL: Grounding Virtual Intelligence in Real Life

arXiv 2024

2024

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

CVPR 2024 1

2024

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

CVPR 2023 1

2023

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

arXiv 2023

2023

Going Denser with Open-Vocabulary Part Segmentation

ICCV 2023 1

2023

CiT: Curation in Training for Effective Vision-Language Data

ICCV 2023 1

2023

Scalable Diffusion Models with Transformers

ICCV 2023 1

2022

A ConvNet for the 2020s

CVPR 2022 1

2022

Masked Autoencoders Are Scalable Vision Learners

CVPR 2022 1

2021

Masked Feature Prediction for Self-Supervised Visual Pre-Training

CVPR 2022 1

2021

SLIP: Self-supervision meets Language-Image Pre-training

arXiv 2021

2021

An Empirical Study of Training Self-Supervised Vision Transformers

ICCV 2021 10

2021

Sample-Efficient Neural Architecture Search by Learning Action Space

arXiv 2019

2019

Decoupling Representation and Classifier for Long-Tailed Recognition

ICLR 2020 1

2019

Aggregated Residual Transformations for Deep Neural Networks

aggregated-residual-transformations-for-deep-1

2016

Holistically-Nested Edge Detection

holistically-nested-edge-detection-1

2015

Affiliations

No known affiliations.

Frequent co-authors

10

from 40 papers