0

Xiangtai Li

Papers
56

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
56papers

Authored papers

56

SAMTok: Representing Any Mask with Two Words

arXiv 2026

2026

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

arXiv 2026

2026

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

arXiv 2025

2026

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

arXiv 2025

2025

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

arXiv 2025

2025

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

arXiv 2025

2025

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

ICCV 2025

2025

OmniAudio: Generating Spatial Audio from 360-Degree Video

arXiv 2025

2025

CyberV: Cybernetics for Test-time Scaling in Video Understanding

arXiv 2025

2025

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

arXiv 2025

2025

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

arXiv 2025

2025

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

arXiv 2025

2025

Visual Spatial Tuning

arXiv 2025

2025

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

arXiv 2025

2025

From Masks to Worlds: A Hitchhiker's Guide to World Models

arXiv 2025

2025

PairUni: Pairwise Training for Unified Multimodal Language Models

arXiv 2025

2025

DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

arXiv 2025

2025

PixelThink: Towards Efficient Chain-of-Pixel Reasoning

arXiv 2025

2025

An Empirical Study of GPT-4o Image Generation Capabilities

arXiv 2025

2025

On Path to Multimodal Generalist: General-Level and General-Bench

arXiv 2025

2025

Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

arXiv 2025

2025

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

arXiv 2025

2025

RecTok: Reconstruction Distillation along Rectified Flow

arXiv 2025

2025

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

arXiv 2025

2025

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

arXiv 2025

2025

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

arXiv 2025

2025

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

ICCV 2025

2025

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

ICCV 2025

2025

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

arXiv 2025

2025

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

arXiv 2024

2024

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

CVPR 2025 1

2024

MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection

arXiv 2024

2024

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

CVPR 2025 1

2024

OMG-Seg: Is One Model Good Enough For All Segmentation?

CVPR 2024 1

2024

Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

arXiv 2024

2024

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

arXiv 2024

2024

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

arXiv 2024

2024

Point Cloud Mamba: Point Cloud Learning via State Space Model

arXiv 2024

2024

EMOv2: Pushing 5M Vision Model Frontier

arXiv 2024

2024

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

arXiv 2024

2024

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

arXiv 2024

2024

RelationBooth: Towards Relation-Aware Customized Object Generation

arXiv 2024

2024

DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries

arXiv 2024

2024

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

arXiv 2024

2024

Video Prediction Transformers without Recurrence or Convolution

arXiv 2024

2024

Towards Semantic Equivalence of Tokenization in Multimodal LLM

arXiv 2024

2024

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

arXiv 2024

2024

EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM

arXiv 2023

2023

Panoptic Video Scene Graph Generation

panoptic-video-scene-graph-generation

2023

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

ICCV 2023 1

2023

Rethinking Mobile Block for Efficient Attention-based Models

ICCV 2023 1

2023

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

arXiv 2023

2023

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

arXiv 2023

2023

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

arXiv 2023

2023

Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition

arXiv 2022

2022

Involution: Inverting the Inherence of Convolution for Visual Recognition

CVPR 2021 1

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 56 papers