Ge Zhang
M-A-P / TIGER-Lab co-founder; researcher building open evaluation benchmarks and music/multimodal LLMs.
- Role
- researcher
- Currently at
- TIGER-Lab
- twitter.com/zhangge6
- GitHub
- github.com/zhangysk
- Scholar
- scholar.google.com/citations
- Papers
- 88
Cite
Notes
Only stored in your browser.
Authored papers
88In-Place Test-Time Training
arXiv 2026
\$OneMillion-Bench: How Far are Language Agents from Human Experts?
arXiv 2026
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
arXiv 2026
Context Forcing: Consistent Autoregressive Video Generation with Long Context
arXiv 2026
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
arXiv 2026
Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities
arXiv 2026
YuE: Scaling Open Foundation Models for Long-Form Music Generation
arXiv 2025
General-Reasoner: Advancing LLM Reasoning Across All Domains
arXiv 2025
TaskCraft: Automated Generation of Agentic Tasks
arXiv 2025
FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis
arXiv 2025
A Comprehensive Survey on Long Context Language Modeling
arXiv 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
ICCV 2025
P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark
arXiv 2025
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
arXiv 2025
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
arXiv 2025
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
arXiv 2025
AutoMV: An Automatic Multi-Agent System for Music Video Generation
arXiv 2025
A Survey on Latent Reasoning
arXiv 2025
WideSearch: Benchmarking Agentic Broad Info-Seeking
arXiv 2025
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
arXiv 2025
A Systematic Analysis of Hybrid Linear Attention
arXiv 2025
Efficient Agents: Building Effective Agents While Reducing Cost
arXiv 2025
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL
arXiv 2025
A^2FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
arXiv 2025
Reverse-Engineered Reasoning for Open-Ended Generation
arXiv 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
arXiv 2025
Multilingual Multimodal Software Developer for Code Generation
arXiv 2025
Audio-FLAN: A Preliminary Release
arXiv 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
arXiv 2025
Generating Symbolic World Models via Test-time Scaling of Large Language Models
arXiv 2025
Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures
arXiv 2025
VeriGUI: Verifiable Long-Chain GUI Dataset
arXiv 2025
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
arXiv 2025
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving
arXiv 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
arXiv 2025
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
arXiv 2025
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
arXiv 2025
Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
arXiv 2025
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
arXiv 2025
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
arXiv 2025
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
arXiv 2025
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
arXiv 2025
Towards Personalized Deep Research: Benchmarks and Evaluations
arXiv 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
arXiv 2025
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
arXiv 2025
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
arXiv 2025
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
NeurIPS
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
arXiv 2024
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
arXiv 2024
OmniBench: Towards The Future of Universal Omni-Language Models
arXiv 2024
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
arXiv 2024
ChatMusician: Understanding and Generating Music Intrinsically with LLM
arXiv 2024
General Preference Modeling with Preference Representations for Aligning Language Models
arXiv 2024
Yi: Open Foundation Models by 01.AI
arXiv 2024
Foundation Models for Music: A Survey
arXiv 2024
Towards a Unified View of Preference Learning for Large Language Models: A Survey
arXiv 2024
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
arXiv 2024
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
arXiv 2024
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
physgame-uncovering-physical-commonsense
ComposerX: Multi-Agent Symbolic Music Composition with LLMs
arXiv 2024
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
arXiv 2024
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models
arXiv 2024
MORE-3S:Multimodal-based Offline Reinforcement Learning with Shared Semantic Spaces
arXiv 2024
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
arXiv 2024
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
arXiv 2024
Long-context LLMs Struggle with Long In-context Learning
arXiv 2024
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
arXiv 2024
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
arXiv 2024
McEval: Massively Multilingual Code Evaluation
arXiv 2024
Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation
arXiv 2024
Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction
arXiv 2024
MIO: A Foundation Model on Multimodal Tokens
arXiv 2024
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
arXiv 2024
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
arXiv 2024
FuzzCoder: Byte-level Fuzzing Test via Large Language Model
arXiv 2024
Can MLLMs Understand the Deep Implication Behind Chinese Images?
arXiv 2024
I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
arXiv 2024
LIME: Less Is More for MLLM Evaluation
arXiv 2024
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024 1
AutoAgents: A Framework for Automatic Agent Generation
arXiv 2023
Chinese Open Instruction Generalist: A Preliminary Release
arXiv 2023
Align on the Fly: Adapting Chatbot Behavior to Established Norms
arXiv 2023
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
arXiv 2023
Massive Editing for Large Language Models via Meta Learning
arXiv 2023
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
arXiv 2023
Training Socially Aligned Language Models on Simulated Social Interactions
arXiv 2023
TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
arXiv 2023
LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
arXiv 2023
Affiliations
Previously
Frequent co-authors
10from 88 papers