Jacob Andreas
- Papers
- 22
Cite
Notes
Only stored in your browser.
Authored papers
22Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
arXiv 2026
Self-Steering Language Models
arXiv 2025
Training Language Models to Explain Their Own Computations
arXiv 2025
The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
arXiv 2024
A Multimodal Automated Interpretability Agent
arXiv 2024
In-Context Language Learning: Architectures and Algorithms
arXiv 2024
Language Modeling with Editable External Knowledge
arXiv 2024
Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models
arXiv 2024
Inspecting and Editing Knowledge Representations in Language Models
arXiv 2023
LILO: Learning Interpretable Libraries by Compressing and Documenting Code
arXiv 2023
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
arXiv 2023
Eliciting Human Preferences with Language Models
arXiv 2023
Decision-Oriented Dialogue for Human-AI Collaboration
arXiv 2023
Guiding Pretraining in Reinforcement Learning with Large Language Models
arXiv 2023
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
arXiv 2023
FIND: A Function Description Benchmark for Evaluating Interpretability Methods
find-a-function-description-benchmark-for
Grokking of Hierarchical Structure in Vanilla Transformers
arXiv 2023
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
arXiv 2023
Interpreting User Requests in the Context of Natural Language Standing Instructions
arXiv 2023
PromptBoosting: Black-Box Text Classification with Ten Forward Passes
arXiv 2022
Towards Tracing Factual Knowledge in Language Models Back to the Training Data
arXiv 2022
Toward a Visual Concept Vocabulary for GAN Latent Space
toward-a-visual-concept-vocabulary-for-gan
Affiliations
Frequent co-authors
10from 22 papers