Beidi Chen

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

arXiv 2025

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

arXiv 2025

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

arXiv 2025

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

arXiv 2025

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

arXiv 2024

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

arXiv 2024

LLM Inference Unveiled: Survey and Roofline Model Insights

arXiv 2024

Memory Mosaics

arXiv 2024

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

arXiv 2024

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

arXiv 2024

MagicPIG: LSH Sampling for Efficient LLM Generation

arXiv 2024

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

arXiv 2024

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

arXiv 2024

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

arXiv 2024

Sirius: Contextual Sparsity with Correction for Efficient LLMs

arXiv 2024

LoCoCo: Dropping In Convolutions for Long Context Compression

arXiv 2024

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

arXiv 2024

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

arXiv 2024

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

arXiv 2024

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

arXiv 2024

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

arXiv 2024

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training

arXiv 2024

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

arXiv 2024

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

arXiv 2024

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

arXiv 2024

Efficient Streaming Language Models with Attention Sinks

arXiv 2023

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

arXiv 2023

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

arXiv 2023

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

arXiv 2023