0

Arman Cohan

Papers
75

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
75papers

Authored papers

75

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

arXiv 2026

2026

Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

arXiv 2025

2026

ANCHOR: Branch-Point Data Generation for GUI Agents

arXiv 2026

2026

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

arXiv 2026

2026

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

arXiv 2026

2026

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

arXiv 2026

2026

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

arXiv 2026

2026

Step-level Optimization for Efficient Computer-use Agents

arXiv 2026

2026

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

arXiv 2026

2026

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

arXiv 2026

2026

References Improve LLM Alignment in Non-Verifiable Domains

arXiv 2026

2026

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

arXiv 2026

2026

LocAgent: Graph-Guided LLM Agents for Code Localization

arXiv 2025

2025

IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

arXiv 2025

2025

AlphaResearch: Accelerating New Algorithm Discovery with Language Models

arXiv 2025

2025

Table-R1: Inference-Time Scaling for Table Reasoning

arXiv 2025

2025

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

arXiv 2025

2025

CellForge: Agentic Design of Virtual Cell Models

arXiv 2025

2025

FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

arXiv 2025

2025

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

arXiv 2025

2025

MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs

arXiv 2025

2025

MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search

arXiv 2025

2025

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

arXiv 2025

2025

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

CVPR 2025 1

2025

Z1: Efficient Test-time Scaling with Code

arXiv 2025

2025

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

arXiv 2025

2025

TESS 2: A Large-Scale Generalist Diffusion Language Model

arXiv 2025

2025

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

arXiv 2025

2025

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

arXiv 2025

2025

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

arXiv 2025

2025

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

arXiv 2025

2025

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

arXiv 2025

2025

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

arXiv 2025

2025

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

arXiv 2025

2025

OLMo: Accelerating the Science of Language Models

arXiv 2024

2024

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

arXiv 2024

2024

RouterRetriever: Routing over a Mixture of Expert Embedding Models

arXiv 2024

2024

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

arXiv 2024

2024

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

arXiv 2024

2024

Understanding Reference Policies in Direct Preference Optimization

arXiv 2024

2024

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

arXiv 2024

2024

SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers

arXiv 2024

2024

Bayesian Calibration of Win Rate Estimation with LLM Evaluators

arXiv 2024

2024

ReIFE: Re-evaluating Instruction-Following Evaluation

arXiv 2024

2024

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

arXiv 2024

2024

Evaluating LLMs at Detecting Errors in LLM Responses

arXiv 2024

2024

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

arXiv 2024

2024

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following

arXiv 2024

2024

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

arXiv 2023

2023

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

arXiv 2023

2023

The Semantic Scholar Open Data Platform

arXiv 2023

2023

Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios

arXiv 2023

2023

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

arXiv 2023

2023

QTSumm: Query-Focused Summarization over Tabular Data

arXiv 2023

2023

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

arXiv 2023

2023

TESS: Text-to-Text Self-Conditioned Simplex Diffusion

arXiv 2023

2023

Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

arXiv 2023

2023

On Learning to Summarize with Large Language Models as References

arXiv 2023

2023

FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains

arXiv 2023

2023

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

arXiv 2023

2023

FOLIO: Natural Language Reasoning with First-Order Logic

arXiv 2022

2022

PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

ACL 2022 5

2021

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

NAACL 2021 4

2021

Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity

NAACL 2022 7

2021

MultiVerS: Improving scientific claim verification with weak supervision and full-document context

Findings (NAACL) 2022 7

2021

CDLM: Cross-Document Language Modeling

Findings (EMNLP) 2021 11

2021

SPECTER: Document-level Representation Learning using Citation-informed Transformers

specter-document-level-representation

2020

TLDR: Extreme Summarization of Scientific Documents

Findings of the Association for Computational Linguistics 2020

2020

Longformer: The Long-Document Transformer

arXiv 2020

2020

ParsiNLU: A Suite of Language Understanding Challenges for Persian

arXiv 2020

2020

SciBERT: A Pretrained Language Model for Scientific Text

scibert-a-pretrained-language-model-for

2019

Structural Scaffolds for Citation Intent Classification in Scientific Publications

structural-scaffolds-for-citation-intent-1

2019

CEDR: Contextualized Embeddings for Document Ranking

arXiv 2019

2019

Pretrained Language Models for Sequential Sentence Classification

pretrained-language-models-for-sequential-1

2019

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

a-discourse-aware-attention-model-for-1

2018

Affiliations

No known affiliations.

Frequent co-authors

10

from 75 papers