Jacob Hilton

President of the Alignment Research Center; ex-OpenAI safety researcher working on RLHF, scalable oversight, and ARC's theoretical alignment agenda.

Role: researcher
Currently at: Alignment Research Center (ARC)
Twitter: twitter.com/JacobHHilton
GitHub: github.com/jacobhilton
Scholar: scholar.google.com/citations
Papers: 7

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: scholar.google.com/citations

Attribution policy →

7papers·1eval contribs

Authored papers

Obfuscated Activations Bypass LLM Latent-Space Defenses

arXiv 2024

2024

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR

2022

Training language models to follow instructions with human feedback

NeurIPS

2022

Teaching Models to Express Their Uncertainty in Words

arXiv 2022

2022

Training Verifiers to Solve Math Word Problems

preprint

2021

TruthfulQA: Measuring How Models Mimic Human Falsehoods

ACL

2021

Batch size-invariance for policy optimization

batch-size-invariance-for-policy-optimization-1

2021

Eval contributions

TruthfulQA

Future of Humanity Institute (Oxford)

817 questions targeting common human misconceptions, measuring whether a model gives factually true answers or repeats popular falsehoods.

SaturatedHallucinationFactual Recall

Affiliations

Currently at

Alignment Research Center (ARC)

researcher · non profit

Previously

OpenAIfrontier lab

Frequent co-authors

from 7 papers

John Schulman

co-founder

3 shared papers

Owain Evans

founder

3 shared papers

Stephanie Lin

researcher

3 shared papers

Alex Ray

researcher

2 shared papers

Amanda Askell

researcher

2 shared papers

Karl Cobbe

research-scientist

2 shared papers

Aarohi Srivastava

researcher

1 shared paper

Abhay Sheshadri

1 shared paper

Abhinav Rastogi

researcher

1 shared paper

Abhishek Rao

1 shared paper