Arman Cohan
- Papers
- 75
Cite
Notes
Only stored in your browser.
Authored papers
75QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
arXiv 2026
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
arXiv 2025
ANCHOR: Branch-Point Data Generation for GUI Agents
arXiv 2026
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
arXiv 2026
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
arXiv 2026
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
arXiv 2026
ResearchGym: Evaluating Language Model Agents on Real-World AI Research
arXiv 2026
Step-level Optimization for Efficient Computer-use Agents
arXiv 2026
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
arXiv 2026
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
arXiv 2026
References Improve LLM Alignment in Non-Verifiable Domains
arXiv 2026
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
arXiv 2026
LocAgent: Graph-Guided LLM Agents for Code Localization
arXiv 2025
IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
arXiv 2025
AlphaResearch: Accelerating New Algorithm Discovery with Language Models
arXiv 2025
Table-R1: Inference-Time Scaling for Table Reasoning
arXiv 2025
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
arXiv 2025
CellForge: Agentic Design of Virtual Cell Models
arXiv 2025
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
arXiv 2025
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
arXiv 2025
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
arXiv 2025
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
arXiv 2025
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
arXiv 2025
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
CVPR 2025 1
Z1: Efficient Test-time Scaling with Code
arXiv 2025
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
arXiv 2025
TESS 2: A Large-Scale Generalist Diffusion Language Model
arXiv 2025
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
arXiv 2025
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
arXiv 2025
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
arXiv 2025
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
arXiv 2025
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
arXiv 2025
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
arXiv 2025
PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving
arXiv 2025
OLMo: Accelerating the Science of Language Models
arXiv 2024
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
arXiv 2024
RouterRetriever: Routing over a Mixture of Expert Embedding Models
arXiv 2024
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
arXiv 2024
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
arXiv 2024
Understanding Reference Policies in Direct Preference Optimization
arXiv 2024
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
arXiv 2024
SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
arXiv 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
arXiv 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
arXiv 2024
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
arXiv 2024
Evaluating LLMs at Detecting Errors in LLM Responses
arXiv 2024
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
arXiv 2024
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
arXiv 2024
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
arXiv 2023
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
arXiv 2023
The Semantic Scholar Open Data Platform
arXiv 2023
Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios
arXiv 2023
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
arXiv 2023
QTSumm: Query-Focused Summarization over Tabular Data
arXiv 2023
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
arXiv 2023
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
arXiv 2023
Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering
arXiv 2023
On Learning to Summarize with Large Language Models as References
arXiv 2023
FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains
arXiv 2023
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
arXiv 2023
FOLIO: Natural Language Reasoning with First-Order Logic
arXiv 2022
PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
ACL 2022 5
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
NAACL 2021 4
Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
NAACL 2022 7
MultiVerS: Improving scientific claim verification with weak supervision and full-document context
Findings (NAACL) 2022 7
CDLM: Cross-Document Language Modeling
Findings (EMNLP) 2021 11
SPECTER: Document-level Representation Learning using Citation-informed Transformers
specter-document-level-representation
TLDR: Extreme Summarization of Scientific Documents
Findings of the Association for Computational Linguistics 2020
Longformer: The Long-Document Transformer
arXiv 2020
ParsiNLU: A Suite of Language Understanding Challenges for Persian
arXiv 2020
SciBERT: A Pretrained Language Model for Scientific Text
scibert-a-pretrained-language-model-for
Structural Scaffolds for Citation Intent Classification in Scientific Publications
structural-scaffolds-for-citation-intent-1
CEDR: Contextualized Embeddings for Document Ranking
arXiv 2019
Pretrained Language Models for Sequential Sentence Classification
pretrained-language-models-for-sequential-1
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
a-discourse-aware-attention-model-for-1
Affiliations
Frequent co-authors
10from 75 papers