Jiaheng Liu
- Papers
- 65
Cite
Notes
Only stored in your browser.
Authored papers
65CodeTracer: Towards Traceable Agent States
arXiv 2026
OProver: A Unified Framework for Agentic Formal Theorem Proving
arXiv 2026
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation
arXiv 2026
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
arXiv 2026
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
arXiv 2026
Flow-GRPO: Training Flow Matching Models via Online RL
arXiv 2025
Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
arXiv 2025
YuE: Scaling Open Foundation Models for Long-Form Music Generation
arXiv 2025
TaskCraft: Automated Generation of Agentic Tasks
arXiv 2025
A Comprehensive Survey on Long Context Language Modeling
arXiv 2025
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
arXiv 2025
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
arXiv 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
arXiv 2025
AutoMV: An Automatic Multi-Agent System for Music Video Generation
arXiv 2025
A Survey on Latent Reasoning
arXiv 2025
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
arXiv 2025
How Far Are We from Genuinely Useful Deep Research Agents?
arXiv 2025
Efficient Agents: Building Effective Agents While Reducing Cost
arXiv 2025
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
arXiv 2025
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
arXiv 2025
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
arXiv 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
arXiv 2025
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
arXiv 2025
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
arXiv 2025
Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation
arXiv 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
arXiv 2025
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
arXiv 2025
ViDiC: Video Difference Captioning
arXiv 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
arXiv 2025
"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
arXiv 2025
Think-J: Learning to Think for Generative LLM-as-a-Judge
arXiv 2025
ProgCo: Program Helps Self-Correction of Large Language Models
arXiv 2025
AIR: Complex Instruction Generation via Automatic Iterative Refinement
arXiv 2025
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
arXiv 2025
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
arXiv 2025
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
arXiv 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
arXiv 2025
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
arXiv 2025
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
arXiv 2025
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
arXiv 2025
IF-VidCap: Can Video Caption Models Follow Instructions?
arXiv 2025
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
arXiv 2025
Multilingual Multimodal Software Developer for Code Generation
arXiv 2025
Distillation Quantification for Large Language Models
arXiv 2025
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation
arXiv 2025
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
arXiv 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
arXiv 2025
FullStack Bench: Evaluating LLMs as Full Stack Coders
arXiv 2024
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
arXiv 2024
OmniBench: Towards The Future of Universal Omni-Language Models
arXiv 2024
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
arXiv 2024
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
physgame-uncovering-physical-commonsense
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
arXiv 2024
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
arXiv 2024
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
arXiv 2024
McEval: Massively Multilingual Code Evaluation
arXiv 2024
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
arXiv 2024
MIO: A Foundation Model on Multimodal Tokens
arXiv 2024
FuzzCoder: Byte-level Fuzzing Test via Large Language Model
arXiv 2024
Can MLLMs Understand the Deep Implication Behind Chinese Images?
arXiv 2024
I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
arXiv 2024
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
arXiv 2024
OWL: A Large Language Model for IT Operations
arXiv 2023
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
arXiv 2023
MT4CrossOIE: Multi-stage Tuning for Cross-lingual Open Information Extraction
arXiv 2023
Affiliations
Frequent co-authors
10from 65 papers