Xiangtai Li

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

arXiv 2025

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

arXiv 2025

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

ICCV 2025

OmniAudio: Generating Spatial Audio from 360-Degree Video

arXiv 2025

CyberV: Cybernetics for Test-time Scaling in Video Understanding

arXiv 2025

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

arXiv 2025

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

arXiv 2025

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

arXiv 2025

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

ICCV 2025

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

ICCV 2025

PixelThink: Towards Efficient Chain-of-Pixel Reasoning

arXiv 2025

Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

arXiv 2025

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

arXiv 2025

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

arXiv 2025

RecTok: Reconstruction Distillation along Rectified Flow

arXiv 2025

Visual Spatial Tuning

arXiv 2025

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

arXiv 2025

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

arXiv 2025

From Masks to Worlds: A Hitchhiker's Guide to World Models

arXiv 2025

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

arXiv 2025

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

arXiv 2025

PairUni: Pairwise Training for Unified Multimodal Language Models

arXiv 2025

DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers

arXiv 2025

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

arXiv 2025

An Empirical Study of GPT-4o Image Generation Capabilities

arXiv 2025

On Path to Multimodal Generalist: General-Level and General-Bench

arXiv 2025

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

arXiv 2024

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

CVPR 2025 1

MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection

arXiv 2024

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

CVPR 2025 1

OMG-Seg: Is One Model Good Enough For All Segmentation?

CVPR 2024 1

Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

arXiv 2024

Point Cloud Mamba: Point Cloud Learning via State Space Model

arXiv 2024

Video Prediction Transformers without Recurrence or Convolution

arXiv 2024

Towards Semantic Equivalence of Tokenization in Multimodal LLM

arXiv 2024

EMOv2: Pushing 5M Vision Model Frontier

arXiv 2024

DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries

arXiv 2024

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

arXiv 2024

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

arXiv 2024

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

arXiv 2024

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

arXiv 2024

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

arXiv 2024

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

arXiv 2024

RelationBooth: Towards Relation-Aware Customized Object Generation

arXiv 2024

EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM

arXiv 2023

Panoptic Video Scene Graph Generation

panoptic-video-scene-graph-generation

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

arXiv 2023

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

ICCV 2023 1

Rethinking Mobile Block for Efficient Attention-based Models

ICCV 2023 1

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

arXiv 2023

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

arXiv 2023