0

Xinggang Wang

Papers
47

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
47papers

Authored papers

47

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

arXiv 2026

2026

Mixture-of-Depths Attention

arXiv 2026

2026

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

arXiv 2026

2026

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

reconstruction-vs-generation-taming

2025

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

arXiv 2025

2025

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

arXiv 2025

2025

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

arXiv 2025

2025

PixelHacker: Image Inpainting with Structural and Semantic Consistency

pixelhacker-image-inpainting-with-structural

2025

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

ICCV 2025

2025

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

ICCV 2025

2025

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

arXiv 2025

2025

Towards Scalable Pre-training of Visual Tokenizers for Generation

arXiv 2025

2025

MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

arXiv 2025

2025

Visual Generation Tuning

arXiv 2025

2025

Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

arXiv 2025

2025

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

arXiv 2025

2025

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

arXiv 2025

2025

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

arXiv 2025

2025

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

arXiv 2025

2025

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

CVPR 2025 1

2024

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

arXiv 2024

2024

YOLO-World: Real-Time Open-Vocabulary Object Detection

CVPR 2024 1

2024

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

arXiv 2024

2024

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

arXiv 2024

2024

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

CVPR 2025 1

2024

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

CVPR 2025 1

2024

ControlAR: Controllable Image Generation with Autoregressive Models

arXiv 2024

2024

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

arXiv 2024

2024

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

arXiv 2024

2024

GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images

arXiv 2024

2024

MoSt-DSA: Modeling Motion and Structural Interactions for Direct Multi-Frame Interpolation in DSA Images

arXiv 2024

2024

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

arXiv 2024

2024

LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels

arXiv 2024

2024

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

CVPR 2024 1

2023

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

ICCV 2023 1

2023

Matte Anything: Interactive Natural Image Matting with Segment Anything Models

arXiv 2023

2023

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

arXiv 2023

2023

SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth

arXiv 2023

2023

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

CVPR 2024 1

2023

ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers

arXiv 2023

2023

TouchStone: Evaluating Vision-Language Models by Language Models

arXiv 2023

2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

arXiv 2023

2023

PD-Quant: Post-Training Quantization based on Prediction Difference Metric

CVPR 2023 1

2022

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

ICCV 2023 1

2022

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

bytetrack-multi-object-tracking-by

2021

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

NeurIPS 2021 12

2021

CCNet: Criss-Cross Attention for Semantic Segmentation

ccnet-criss-cross-attention-for-semantic-1

2018

Affiliations

No known affiliations.

Frequent co-authors

10

from 47 papers