Beidi Chen
- Papers
- 36
Cite
Notes
Only stored in your browser.
Authored papers
36The Last Human-Written Paper: Agent-Native Research Artifacts
arXiv 2026
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
arXiv 2026
STEM: Scaling Transformers with Embedding Modules
arXiv 2026
Kinetics: Rethinking Test-Time Scaling Laws
arXiv 2025
Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation
arXiv 2025
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
arXiv 2025
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
arXiv 2025
APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding
arXiv 2025
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
arXiv 2024
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
arXiv 2024
LLM Inference Unveiled: Survey and Roofline Model Insights
arXiv 2024
Memory Mosaics
arXiv 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
arXiv 2024
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
arXiv 2024
MagicPIG: LSH Sampling for Efficient LLM Generation
arXiv 2024
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
arXiv 2024
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
arXiv 2024
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
arXiv 2024
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
arXiv 2024
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
arXiv 2024
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
arXiv 2024
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
arXiv 2024
On the Surprising Effectiveness of Attention Transfer for Vision Transformers
arXiv 2024
Sirius: Contextual Sparsity with Correction for Efficient LLMs
arXiv 2024
LoCoCo: Dropping In Convolutions for Long Context Compression
arXiv 2024
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild
arXiv 2024
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
arXiv 2024
It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
arXiv 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
arXiv 2024
Efficient Streaming Language Models with Attention Sinks
arXiv 2023
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
arXiv 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
arXiv 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
arXiv 2023
Monarch: Expressive Structured Matrices for Efficient and Accurate Training
arXiv 2022
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
pixelated-butterfly-simple-and-efficient
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
scatterbrain-unifying-sparse-and-low-rank-1
Affiliations
Frequent co-authors
10from 36 papers