Xiaodan Liang

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

arXiv 2025

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

arXiv 2025

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

arXiv 2025

Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

arXiv 2025

ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions

arXiv 2025

TreeRPO: Tree Relative Policy Optimization

arXiv 2025

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

arXiv 2024

Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models

arXiv 2024

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

arXiv 2024

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

arXiv 2024

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

arXiv 2024

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

arXiv 2024

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

arXiv 2024

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

arXiv 2024

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

arXiv 2024

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

arXiv 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

physgame-uncovering-physical-commonsense

MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

arXiv 2024

OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling

arXiv 2024

MLP Can Be A Good Transformer Learner

CVPR 2024 1

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

arXiv 2024

Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

arXiv 2024

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

CVPR 2025 1

Learning Interaction-aware 3D Gaussian Splatting for One-shot Hand Avatars

arXiv 2024

FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

arXiv 2024

Surfer: Progressive Reasoning with World Models for Robotic Manipulation

arXiv 2023

CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

arXiv 2023

Fashion Matrix: Editing Photos by Just Talking

arXiv 2023

DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning

arXiv 2023

TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models

arXiv 2023

AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations

arXiv 2023

Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning

arXiv 2022

UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression

arXiv 2022

Composable Text Controls in Latent Space with ODEs

arXiv 2022

LogicSolver: Towards Interpretable Math Word Problem Solving with Logical Prompt-enhanced Learning

arXiv 2022

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning

Findings (ACL) 2021 8

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

ICCV 2021 10

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning

ACL 2021 5

UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model

ultrapose-synthesizing-dense-pose-with-1

Towards Quantifiable Dialogue Coherence Evaluation

ACL 2021 5

Don't Take It Literally: An Edit-Invariant Sequence Loss for Text Generation

don-t-take-it-literally-an-edit-invariant-1