Ganqu Cui

Tsinghua/Shanghai AI Lab researcher known for UltraFeedback, PRM800K-style process reward modeling, and open alignment data.

Role: researcher
Currently at: Shanghai AI Laboratory
Twitter: twitter.com/cgq2333
GitHub: github.com/cgq15
Scholar: scholar.google.com/citations
Papers: 31

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: scholar.google.com/citations

Attribution policy →

31papers·1tool contribs

Authored papers

31

Post-Trained MoE Can Skip Half Experts via Self-Distillation

arXiv 2026

InCoder-32B: Code Foundation Model for Industrial Scenarios

arXiv 2026

TEMPO: Scaling Test-time Training for Large Reasoning Models

arXiv 2026

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

arXiv 2026

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

arXiv 2026

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

arXiv 2025

TTRL: Test-Time Reinforcement Learning

arXiv 2025

MiniCPM4: Ultra-Efficient LLMs on End Devices

arXiv 2025

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

arXiv 2025

Learning to Reason under Off-Policy Guidance

arXiv 2025

Process Reinforcement through Implicit Rewards

arXiv 2025

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

arXiv 2025

FlowRL: Matching Reward Distributions for LLM Reasoning

arXiv 2025

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

arXiv 2025

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

arXiv 2025

A Survey of Reinforcement Learning for Large Reasoning Models

arXiv 2025

P1: Mastering Physics Olympiads with Reinforcement Learning

arXiv 2025

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

arXiv 2025

RLPR: Extrapolating RLVR to General Domains without Verifiers

arXiv 2025

UltraIF: Advancing Instruction Following from the Wild

arXiv 2025

From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning

arXiv 2025

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

CVPR 2025 1

Advancing LLM Reasoning Generalists with Preference Trees

arXiv 2024

UltraMedical: Building Specialized Generalists in Biomedicine

arXiv 2024

Free Process Rewards without Process Labels

arXiv 2024

Noise Contrastive Alignment of Language Models with Explicit Rewards

arXiv 2024

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment

arXiv 2024

UltraFeedback: Boosting Language Models with High-quality Feedback

ICML

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

CVPR 2024 1

Tool Learning with Foundation Models

arXiv 2023

Exploring the Universal Vulnerability of Prompt-based Learning Paradigm

Findings (NAACL) 2022 7

Tool contributions

1

UltraFeedback

OpenBMB

OpenBMB's 64k-prompt preference dataset built with GPT-4 critiques across instruction-following, truthfulness, honesty, and helpfulness - the de facto open DPO baseline.

PreferenceInstruction FollowingHallucinationSafety

Affiliations

Currently at

Shanghai AI Laboratory

researcher · research group

Previously

Tsinghua Universityuniversity lab

Frequent co-authors

10

from 31 papers

Ning Ding

researcher

21 shared papers

Zhiyuan Liu

professor

17 shared papers

Bowen Zhou

professor

13 shared papers

Maosong Sun

professor

11 shared papers

Yu Cheng

10 shared papers

Kaiyan Zhang

9 shared papers

Lifan Yuan

grad-student

9 shared papers

Yuxin Zuo

9 shared papers

Weize Chen

7 shared papers

Bingxiang He

6 shared papers