Tushar Khot

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

arXiv 2024

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

arXiv 2024

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

arXiv 2024

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

arXiv 2024

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

arXiv 2023

Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback

arXiv 2023

Specializing Smaller Language Models towards Multi-Step Reasoning

arXiv 2023

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

arXiv 2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

arXiv 2022

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

arXiv 2022

Teaching Broad Reasoning Skills for Multi-Step QA by Generating Hard Contexts

arXiv 2022

MuSiQue: Multihop Questions via Single-hop Question Composition

arXiv 2021

GooAQ: Open Question Answering with Diverse Answer Types

Findings (EMNLP) 2021 11

Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts

NAACL 2022 7

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

arXiv 2021

Hey AI, Can You Solve Complex Tasks by Talking to Agents?

Findings (ACL) 2022 5