Song Han
- Papers
- 57
Cite
Notes
Only stored in your browser.
Authored papers
57AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
arXiv 2026
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
arXiv 2026
Flash-KMeans: Fast and Memory-Efficient Exact K-Means
arXiv 2026
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
arXiv 2026
Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
arXiv 2026
StreamingVLM: Real-Time Understanding for Infinite Video Streams
arXiv 2025
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
arXiv 2026
Scaling RL to Long Videos
arXiv 2025
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
arXiv 2025
XAttention: Block Sparse Attention with Antidiagonal Scoring
arXiv 2025
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
arXiv 2025
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
arXiv 2025
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference
arXiv 2025
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
arXiv 2025
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
arXiv 2025
Optimizing Mixture of Block Attention
arXiv 2025
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
arXiv 2025
Fast-dLLM v2: Efficient Block-Diffusion LLM
arXiv 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
arXiv 2025
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
arXiv 2025
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
arXiv 2025
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
arXiv 2025
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
arXiv 2025
SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
arXiv 2025
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
arXiv 2025
Scaling Vision Pre-Training to 4K Resolution
CVPR 2025 1
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
arXiv 2024
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
arXiv 2024
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
arXiv 2024
NVILA: Efficient Frontier Visual Language Models
CVPR 2025 1
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
arXiv 2024
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
arXiv 2024
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
arXiv 2024
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
arXiv 2024
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
CVPR 2024 1
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
arXiv 2024
BitDelta: Your Fine-Tune May Only Be Worth One Bit
arXiv 2024
Wolf: Captioning Everything with a World Summarization Framework
arXiv 2024
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2023
VILA: On Pre-training for Visual Language Models
CVPR 2024 1
Efficient Streaming Language Models with Attention Sinks
arXiv 2023
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
arXiv 2023
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
arXiv 2023
Offsite-Tuning: Transfer Learning without Full Model
arXiv 2023
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
arXiv 2022
TorchSparse: Efficient Point Cloud Inference Engine
arXiv 2022
Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation
CVPR 2022 1
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
arXiv 2022
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
arXiv 2022
TinyTL: Reduce Activations, Not Trainable Parameters for Efficient On-Device Learning
tinytl-reduce-memory-not-parameters-for
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
hat-hardware-aware-transformers-for-efficient-1
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
apq-joint-search-for-network-architecture
Once-for-All: Train One Network and Specialize it for Efficient Deployment
arXiv 2019
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
amc-automl-for-model-compression-and-1
Path-Level Network Transformation for Efficient Architecture Search
path-level-network-transformation-for-1
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
deep-gradient-compression-reducing-the-1
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
arXiv 2016
Affiliations
Frequent co-authors
10from 57 papers