Songyang Zhang

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

arXiv 2025

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

arXiv 2025

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

arXiv 2025

Rethinking Verification for LLM Code Generation: From Generation to Testing

arXiv 2025

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

arXiv 2025

PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

arXiv 2025

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

arXiv 2025

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

arXiv 2025

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

arXiv 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

arXiv 2025

NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?

arXiv 2024

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

arXiv 2024

Are Your LLMs Capable of Stable Reasoning?

arXiv 2024

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

arXiv 2024

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

arXiv 2024

InternLM-Law: An Open Source Chinese Legal Large Language Model

arXiv 2024

GTA: A Benchmark for General Tool Agents

arXiv 2024

Adapting LLaMA Decoder to Vision Transformer

arXiv 2024

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

arXiv 2024

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

arXiv 2024

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

arXiv 2024

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

arXiv 2024

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

ICCV 2023 1

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

arXiv 2023

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

arXiv 2023

Fake Alignment: Are LLMs Really Aligned Well?

arXiv 2023