0

Xiang Yue

CMU / OSU postdoc; co-author of MMMU, MMLU-Pro, MMMU-Pro benchmarks; works on multimodal LLM evaluation.

Role
researcher
Papers
29

Cite

Notes

Only stored in your browser.

29papers

Authored papers

29

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

arXiv 2026

2026

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

arXiv 2025

2025

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

arXiv 2025

2025

Demystifying Long Chain-of-Thought Reasoning in LLMs

arXiv 2025

2025

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

arXiv 2025

2025

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

arXiv 2025

2025

VisCoder2: Building Multi-Language Visualization Coding Agents

arXiv 2025

2025

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

arXiv 2025

2025

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation

arXiv 2025

2025

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

arXiv 2025

2025

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

arXiv 2025

2025

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

arXiv 2025

2025

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

NeurIPS

2024

Data Engineering for Scaling Language Models to 128K Context

arXiv 2024

2024

Evaluating Vision-Language Models as Evaluators in Path Planning

CVPR 2025 1

2024

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

arXiv 2024

2024

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

arXiv 2024

2024

Long-context LLMs Struggle with Long In-context Learning

arXiv 2024

2024

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

arXiv 2024

2024

Machine Unlearning of Pre-trained Large Language Models

arXiv 2024

2024

Evaluating Language Models as Synthetic Data Generators

arXiv 2024

2024

ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art

arXiv 2024

2024

LIME: Less Is More for MLLM Evaluation

arXiv 2024

2024

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

arXiv 2024

2024

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

arXiv 2024

2024

AttributionBench: How Hard is Automatic Attribution Evaluation?

arXiv 2024

2024

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

CVPR 2024 1

2023

Automatic Evaluation of Attribution by Large Language Models

arXiv 2023

2023

VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation

arXiv 2023

2023

Affiliations

Currently at

Carnegie Mellon University

researcher · university lab

Frequent co-authors

10

from 29 papers