Sanmi Koyejo

End-to-End Test-Time Training for Long Context

arXiv 2025

AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning

arXiv 2025

UQ: Assessing Language Models on Unsolved Questions

arXiv 2025

Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs

arXiv 2025

Fantastic Bugs and Where to Find Them in AI Benchmarks

arXiv 2025

Reliable and Efficient Amortized Model-based Evaluation

arXiv 2025

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

arXiv 2025

The Leaderboard Illusion

preprint

Structured Prompting Enables More Robust Evaluation of Language Models

arXiv 2025

Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

arXiv 2025

Pantograph: A Machine-to-Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4

arXiv 2024

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

arXiv 2024

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

arXiv 2024

Best-of-N Jailbreaking

arXiv 2024

Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs

arXiv 2024

Representation Engineering: A Top-Down Approach to AI Transparency

arXiv 2023

Learning to (Learn at Test Time)

arXiv 2023

HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance

arXiv 2023

Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting

arXiv 2023