Wenhu Chen
University of Waterloo professor known for MMLU-Pro, TheoremQA, multimodal reasoning benchmarks, and Vector Institute work on retrieval-augmented LLMs.
- Role
- professor
- Currently at
- University of Waterloo
- twitter.com/WenhuChen
- GitHub
- github.com/wenhuchen
- Scholar
- scholar.google.com/citations
- Papers
- 77
Cite
Notes
Only stored in your browser.
Authored papers
77ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
arXiv 2026
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
arXiv 2026
RewardHarness: Self-Evolving Agentic Post-Training
arXiv 2026
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
arXiv 2026
Context Forcing: Consistent Autoregressive Video Generation with Long Context
arXiv 2026
ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks
arXiv 2026
OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis
arXiv 2026
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
arXiv 2026
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
arXiv 2026
ClawBench: Can AI Agents Complete Everyday Online Tasks?
arXiv 2026
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
arXiv 2026
Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining
arXiv 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
ICCV 2025
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
arXiv 2025
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
arXiv 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
arXiv 2025
ABC: Achieving Better Control of Multimodal Embeddings using VLMs
arXiv 2025
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
arXiv 2025
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
arXiv 2025
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
arXiv 2025
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
arXiv 2025
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
arXiv 2025
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
arXiv 2025
VisCoder2: Building Multi-Language Visualization Coding Agents
arXiv 2025
VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation
arXiv 2025
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
arXiv 2025
BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
arXiv 2025
Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem
arXiv 2025
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
arXiv 2025
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
arXiv 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
arXiv 2025
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
arXiv 2025
MoCha: Towards Movie-Grade Talking Character Synthesis
arXiv 2025
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
arXiv 2025
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
arXiv 2025
YuE: Scaling Open Foundation Models for Long-Form Music Generation
arXiv 2025
TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding
arXiv 2025
General-Reasoner: Advancing LLM Reasoning Across All Domains
arXiv 2025
Towards Trustworthy GUI Agents: A Survey
arXiv 2025
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
arXiv 2024
ChatMusician: Understanding and Generating Music Intrinsically with LLM
arXiv 2024
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
arXiv 2024
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
arXiv 2024
MANTIS: Interleaved Multi-Image Instruction Tuning
arXiv 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
arXiv 2024
Foundation Models for Music: A Survey
arXiv 2024
Long-context LLMs Struggle with Long In-context Learning
arXiv 2024
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
arXiv 2024
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
arXiv 2024
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
arXiv 2024
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
arXiv 2024
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
arXiv 2024
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
arXiv 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
NeurIPS
UniRAG: Universal Retrieval Augmentation for Large Vision Language Models
arXiv 2024
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024 1
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
arXiv 2023
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
NeurIPS 2023 11
ImagenHub: Standardizing the evaluation of conditional image generation models
arXiv 2023
TheoremQA: A Theorem-driven Question Answering dataset
arXiv 2023
TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
arXiv 2023
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
arXiv 2023
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
arXiv 2023
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
arXiv 2023
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024)
arXiv 2023
Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models
arXiv 2023
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
arXiv 2022
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
arXiv 2022
Large Language Models are few(1)-shot Table Reasoners
arXiv 2022
Controllable Dialogue Simulation with In-Context Learning
arXiv 2022
A Dataset for Answering Time-Sensitive Questions
arXiv 2021
FinQA: A Dataset of Numerical Reasoning over Financial Data
EMNLP 2021 11
A Systematic Investigation of KB-Text Embedding Alignment at Scale
ACL 2021 5
Attacking Open-domain Question Answering by Injecting Misinformation
contraqa-question-answering-under
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
violin-a-large-scale-dataset-for-video-and-1
Logical Natural Language Generation from Open-Domain Tables
logical-natural-language-generation-from-open-1
TabFact: A Large-scale Dataset for Table-based Fact Verification
ICLR 2020 1
Eval contributions
1Affiliations
Previously
Frequent co-authors
10from 77 papers
Ping Nie
Ge Zhang
researcher
Dongfu Jiang
Cong Wei
Wenhao Huang
Xiang Yue
researcher
Max Ku
grad-student
Weiming Ren
grad-student
Yubo Wang
grad-student
Kai Zou