0

Luca Soldaini

Papers
26

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
26papers

Authored papers

26

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

arXiv 2026

2026

olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

arXiv 2025

2025

Olmo 3

arXiv 2025

2025

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

arXiv 2025

2025

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

arXiv 2025

2025

Bolmo: Byteifying the Next Generation of Language Models

arXiv 2025

2025

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

arXiv 2025

2025

FlexOlmo: Open Language Models for Flexible Data Use

arXiv 2025

2025

Teaching Models to Understand (but not Generate) High-risk Data

arXiv 2025

2025

2 OLMo 2 Furious

arXiv 2024

2024

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

preprint

2024

OLMo: Accelerating the Science of Language Models

arXiv 2024

2024

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

CVPR 2025 1

2024

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

arXiv 2024

2024

OLMoE: Open Mixture-of-Experts Language Models

arXiv 2024

2024

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

arXiv 2024

2024

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

arXiv 2024

2024

RouterRetriever: Routing over a Mixture of Expert Embedding Models

arXiv 2024

2024

Language models scale reliably with over-training and on downstream tasks

arXiv 2024

2024

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

arXiv 2024

2024

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

arXiv 2024

2024

What's In My Big Data?

arXiv 2023

2023

The Semantic Scholar Open Data Platform

arXiv 2023

2023

Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems

arXiv 2022

2022

Paragraph-based Transformer Pre-training for Multi-Sentence Inference

NAACL 2022 7

2022

The Cascade Transformer: an Application for Efficient Answer Sentence Selection

the-cascade-transformer-an-application-for-1

2020

Affiliations

No known affiliations.

Frequent co-authors

10

from 26 papers