Torsten Hoefler

HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs

arXiv 2025

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

arXiv 2025

Reasoning Language Models: A Blueprint

arXiv 2025

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

arXiv 2024

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

arXiv 2024

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

arXiv 2024

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

arXiv 2024

High Performance Unstructured SpMM Computation Using Tensor Cores

arXiv 2024

All models are wrong, some are useful: Model Selection with Limited Labels

arXiv 2024

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

arXiv 2023

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

arXiv 2023

QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

arXiv 2023

Co-design Hardware and Algorithm for Vector Search

arXiv 2023