Luca Soldaini

Papers: 26

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

26papers

Authored papers

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

arXiv 2026

2026

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

arXiv 2025

2025

Olmo 3

arXiv 2025

2025

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

arXiv 2025

2025

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

arXiv 2025

2025

Teaching Models to Understand (but not Generate) High-risk Data

arXiv 2025

2025

Bolmo: Byteifying the Next Generation of Language Models

arXiv 2025

2025

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

arXiv 2025

2025

FlexOlmo: Open Language Models for Flexible Data Use

arXiv 2025

2025

2 OLMo 2 Furious

arXiv 2024

2024

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

preprint

2024

OLMo: Accelerating the Science of Language Models

arXiv 2024

2024

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

CVPR 2025 1

2024

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

arXiv 2024

2024

OLMoE: Open Mixture-of-Experts Language Models

arXiv 2024

2024

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

arXiv 2024

2024

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

arXiv 2024

2024

RouterRetriever: Routing over a Mixture of Expert Embedding Models

arXiv 2024

2024

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

arXiv 2024

2024

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

arXiv 2024

2024

Language models scale reliably with over-training and on downstream tasks

arXiv 2024

2024

What's In My Big Data?

arXiv 2023

2023

The Semantic Scholar Open Data Platform

arXiv 2023

2023

Paragraph-based Transformer Pre-training for Multi-Sentence Inference

NAACL 2022 7

2022

Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems

arXiv 2022

2022

The Cascade Transformer: an Application for Efficient Answer Sentence Selection

the-cascade-transformer-an-application-for-1

2020

Affiliations

No known affiliations.

Frequent co-authors

from 26 papers

Kyle Lo

14 shared papers

Hannaneh Hajishirzi

professor

12 shared papers

Noah A. Smith

11 shared papers

Dirk Groeneveld

9 shared papers

Jacob Morrison

research-engineer

8 shared papers

Luke Zettlemoyer

professor

Pang Wei Koh

Pete Walsh

Akshita Bhagia

Dustin Schwenk