Wenhao Huang

Seed1.5-VL Technical Report

arXiv 2025

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

arXiv 2025

A Comprehensive Survey on Long Context Language Modeling

arXiv 2025

P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

arXiv 2025

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

arXiv 2025

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

arXiv 2025

AutoMV: An Automatic Multi-Agent System for Music Video Generation

arXiv 2025

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

arXiv 2025

A Survey on Latent Reasoning

arXiv 2025

WideSearch: Benchmarking Agentic Broad Info-Seeking

arXiv 2025

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

arXiv 2025

A Systematic Analysis of Hybrid Linear Attention

arXiv 2025

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

arXiv 2025

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

arXiv 2025

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

arXiv 2025

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

arXiv 2025

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

arXiv 2025

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

arXiv 2025

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

arXiv 2025

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

arXiv 2025

Reverse-Engineered Reasoning for Open-Ended Generation

arXiv 2025

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

arXiv 2025

COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values

arXiv 2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

arXiv 2025

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

arXiv 2025

OmniBench: Towards The Future of Universal Omni-Language Models

arXiv 2024

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

arXiv 2024

ChatMusician: Understanding and Generating Music Intrinsically with LLM

arXiv 2024

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

arXiv 2024

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

arXiv 2024

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

arXiv 2024

Foundation Models for Music: A Survey

arXiv 2024

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

arXiv 2024

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

arXiv 2024

Yi: Open Foundation Models by 01.AI

arXiv 2024

Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation

arXiv 2024

MIO: A Foundation Model on Multimodal Tokens

arXiv 2024

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

arXiv 2024

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

arXiv 2024

Can MLLMs Understand the Deep Implication Behind Chinese Images?

arXiv 2024

I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm

arXiv 2024

LIME: Less Is More for MLLM Evaluation

arXiv 2024

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

arXiv 2024

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

arXiv 2024

m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers

arXiv 2024

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

CVPR 2024 1

Chinese Open Instruction Generalist: A Preliminary Release

arXiv 2023

LLaSM: Large Language and Speech Model

arXiv 2023

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

arXiv 2023

Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation

arXiv 2023

TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

arXiv 2023

Can Large Language Models Understand Real-World Complex Instructions?

arXiv 2023

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

arXiv 2023