Xiang Yue
CMU / OSU postdoc; co-author of MMMU, MMLU-Pro, MMMU-Pro benchmarks; works on multimodal LLM evaluation.
- Role
- researcher
- Currently at
- Carnegie Mellon University
- twitter.com/xiangyue96
- GitHub
- github.com/xiangyue9607
- Scholar
- scholar.google.com/citations
- Papers
- 29
Cite
Notes
Only stored in your browser.
Authored papers
29On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists
arXiv 2026
Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
arXiv 2025
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
arXiv 2025
Demystifying Long Chain-of-Thought Reasoning in LLMs
arXiv 2025
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
arXiv 2025
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
arXiv 2025
VisCoder2: Building Multi-Language Visualization Coding Agents
arXiv 2025
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
arXiv 2025
VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation
arXiv 2025
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
arXiv 2025
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
arXiv 2025
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
arXiv 2025
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
NeurIPS
Data Engineering for Scaling Language Models to 128K Context
arXiv 2024
Evaluating Vision-Language Models as Evaluators in Path Planning
CVPR 2025 1
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
arXiv 2024
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
arXiv 2024
Long-context LLMs Struggle with Long In-context Learning
arXiv 2024
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
arXiv 2024
Machine Unlearning of Pre-trained Large Language Models
arXiv 2024
Evaluating Language Models as Synthetic Data Generators
arXiv 2024
ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
arXiv 2024
LIME: Less Is More for MLLM Evaluation
arXiv 2024
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
arXiv 2024
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
arXiv 2024
AttributionBench: How Hard is Automatic Attribution Evaluation?
arXiv 2024
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024 1
Automatic Evaluation of Attribution by Large Language Models
arXiv 2023
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
arXiv 2023
Affiliations
Previously
Frequent co-authors
10from 29 papers