Martin Jaggi
- Papers
- 17
Cite
Notes
Only stored in your browser.
Authored papers
17FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
arXiv 2025
Gradient-Normalized Smoothness for Optimization with Approximate Hessians
arXiv 2025
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
arXiv 2025
Benchmarking Optimizers for Large Language Model Pretraining
arXiv 2025
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
arXiv 2024
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
arXiv 2024
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
arXiv 2023
MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
arXiv 2023
Layer-wise Linear Mode Connectivity
arXiv 2023
Landmark Attention: Random-Access Infinite Context Length for Transformers
arXiv 2023
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention
arXiv 2023
Multiplication-Free Transformer Training via Piecewise Affine Operations
multiplication-free-transformer-training-via
Learning from History for Byzantine Robust Optimization
arXiv 2020
Evaluating the Search Phase of Neural Architecture Search
ICLR 2020 1
Model Fusion via Optimal Transport
NeurIPS 2020 12
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
powersgd-practical-low-rank-gradient-1
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
unsupervised-learning-of-sentence-embeddings-1
Affiliations
Frequent co-authors
10from 17 papers
Matteo Pagliardini
Amirkeivan Mohtashami
Andrei Semenov
Alejandro Hernández Cano
Alexander Hägele
Angelika Romanou
Antoine Bosselut
Atli Kosson
Bettina Messmer
Kyle Matoba