Yilun Zhao
- Papers
- 47
Cite
Notes
Only stored in your browser.
Authored papers
47OpenComputer: Verifiable Software Worlds for Computer-Use Agents
arXiv 2026
Step-level Optimization for Efficient Computer-use Agents
arXiv 2026
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
arXiv 2026
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
arXiv 2026
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
arXiv 2026
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
arXiv 2026
ANCHOR: Branch-Point Data Generation for GUI Agents
arXiv 2026
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
arXiv 2026
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
arXiv 2026
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
arXiv 2026
AlphaResearch: Accelerating New Algorithm Discovery with Language Models
arXiv 2025
Table-R1: Inference-Time Scaling for Table Reasoning
arXiv 2025
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
arXiv 2025
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
arXiv 2025
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
arXiv 2025
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
arXiv 2025
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
arXiv 2025
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
arXiv 2025
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
CVPR 2025 1
Z1: Efficient Test-time Scaling with Code
arXiv 2025
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
arXiv 2025
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
arXiv 2025
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
arXiv 2025
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
arXiv 2025
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
arXiv 2025
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
arXiv 2025
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
arXiv 2025
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
arXiv 2025
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
arXiv 2025
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
arXiv 2025
PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving
arXiv 2025
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
arXiv 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
arXiv 2024
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation
arXiv 2024
Evaluating LLMs at Detecting Errors in LLM Responses
arXiv 2024
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
arXiv 2024
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
arXiv 2023
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
arXiv 2023
Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios
arXiv 2023
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
arXiv 2023
FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains
arXiv 2023
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents
arXiv 2023
QTSumm: Query-Focused Summarization over Tabular Data
arXiv 2023
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
arXiv 2023
FOLIO: Natural Language Reasoning with First-Order Logic
arXiv 2022
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
arXiv 2022
ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples
arXiv 2022
Affiliations
Frequent co-authors
10from 47 papers