Xiangtai Li
- Papers
- 56
Cite
Notes
Only stored in your browser.
Authored papers
56SAMTok: Representing Any Mask with Two Words
arXiv 2026
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
arXiv 2026
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
arXiv 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
arXiv 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
arXiv 2025
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
arXiv 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
ICCV 2025
OmniAudio: Generating Spatial Audio from 360-Degree Video
arXiv 2025
CyberV: Cybernetics for Test-time Scaling in Video Understanding
arXiv 2025
DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
arXiv 2025
Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
arXiv 2025
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
arXiv 2025
Visual Spatial Tuning
arXiv 2025
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
arXiv 2025
From Masks to Worlds: A Hitchhiker's Guide to World Models
arXiv 2025
PairUni: Pairwise Training for Unified Multimodal Language Models
arXiv 2025
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
arXiv 2025
PixelThink: Towards Efficient Chain-of-Pixel Reasoning
arXiv 2025
An Empirical Study of GPT-4o Image Generation Capabilities
arXiv 2025
On Path to Multimodal Generalist: General-Level and General-Bench
arXiv 2025
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
arXiv 2025
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
arXiv 2025
RecTok: Reconstruction Distillation along Rectified Flow
arXiv 2025
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
arXiv 2025
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
arXiv 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
arXiv 2025
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer
ICCV 2025
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
ICCV 2025
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
arXiv 2025
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
arXiv 2024
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
CVPR 2025 1
MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection
arXiv 2024
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
CVPR 2025 1
OMG-Seg: Is One Model Good Enough For All Segmentation?
CVPR 2024 1
Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model
arXiv 2024
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything
arXiv 2024
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
arXiv 2024
Point Cloud Mamba: Point Cloud Learning via State Space Model
arXiv 2024
EMOv2: Pushing 5M Vision Model Frontier
arXiv 2024
SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow
arXiv 2024
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing
arXiv 2024
RelationBooth: Towards Relation-Aware Customized Object Generation
arXiv 2024
DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
arXiv 2024
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
arXiv 2024
Video Prediction Transformers without Recurrence or Convolution
arXiv 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
arXiv 2024
GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
arXiv 2024
EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM
arXiv 2023
Panoptic Video Scene Graph Generation
panoptic-video-scene-graph-generation
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
ICCV 2023 1
Rethinking Mobile Block for Efficient Attention-based Models
ICCV 2023 1
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
arXiv 2023
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
arXiv 2023
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
arXiv 2023
Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition
arXiv 2022
Involution: Inverting the Inherence of Convolution for Visual Recognition
CVPR 2021 1
Affiliations
Frequent co-authors
10from 56 papers