0

Kai Chen

Papers
109

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
109papers

Authored papers

109

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

arXiv 2026

2026

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

arXiv 2026

2026

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

arXiv 2026

2026

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

arXiv 2026

2026

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

arXiv 2026

2026

PhysBrain 1.0 Technical Report

arXiv 2026

2026

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

arXiv 2026

2026

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

arXiv 2026

2026

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

arXiv 2026

2026

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

arXiv 2026

2026

Visual-ERM: Reward Modeling for Visual Equivalence

arXiv 2026

2026

P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

arXiv 2026

2026

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

arXiv 2026

2026

Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning

arXiv 2026

2026

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

arXiv 2026

2026

ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning

arXiv 2026

2026

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

arXiv 2025

2025

MemOS: A Memory OS for AI System

arXiv 2025

2025

Qwen-Image Technical Report

arXiv 2025

2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

arXiv 2025

2025

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

arXiv 2025

2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

arXiv 2025

2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

arXiv 2025

2025

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

arXiv 2025

2025

CritiQ: Mining Data Quality Criteria from Human Preferences

arXiv 2025

2025

MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

arXiv 2025

2025

NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

arXiv 2025

2025

Redundancy Principles for MLLMs Benchmarks

arXiv 2025

2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

arXiv 2025

2025

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

arXiv 2025

2025

SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

arXiv 2025

2025

Rectifying LLM Thought from Lens of Optimization

arXiv 2025

2025

P1: Mastering Physics Olympiads with Reinforcement Learning

arXiv 2025

2025

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

arXiv 2025

2025

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

arXiv 2025

2025

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

arXiv 2025

2025

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

arXiv 2025

2025

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

arXiv 2025

2025

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

arXiv 2025

2025

Pre-Trained Policy Discriminators are General Reward Models

arXiv 2025

2025

Rethinking Verification for LLM Code Generation: From Generation to Testing

arXiv 2025

2025

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

arXiv 2025

2025

CharacterShot: Controllable and Consistent 4D Character Animation

arXiv 2025

2025

IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

arXiv 2025

2025

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

arXiv 2025

2025

Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

arXiv 2025

2025

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement

arXiv 2025

2025

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

arXiv 2025

2025

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

arXiv 2025

2025

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

arXiv 2025

2025

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

arXiv 2025

2025

Tady: A Neural Disassembler without Structural Constraint Violations

arXiv 2025

2025

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

arXiv 2025

2025

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

arXiv 2025

2025

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

arXiv 2025

2025

Large Language Models for Cyber Security: A Systematic Literature Review

arXiv 2024

2024

NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?

arXiv 2024

2024

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

arXiv 2024

2024

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

arXiv 2024

2024

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

arXiv 2024

2024

StyleShot: A Snapshot on Any Style

arXiv 2024

2024

Are Your LLMs Capable of Stable Reasoning?

arXiv 2024

2024

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

arXiv 2024

2024

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

arXiv 2024

2024

OMG-Seg: Is One Model Good Enough For All Segmentation?

CVPR 2024 1

2024

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

arXiv 2024

2024

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

arXiv 2024

2024

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

arXiv 2024

2024

GTA: A Benchmark for General Tool Agents

arXiv 2024

2024

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

arXiv 2024

2024

Can AI Assistants Know What They Don't Know?

arXiv 2024

2024

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

CVPR 2025 1

2024

Adapting LLaMA Decoder to Vision Transformer

arXiv 2024

2024

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

arXiv 2024

2024

CriticEval: Evaluating Large Language Model as Critic

arXiv 2024

2024

4D Contrastive Superflows are Dense 3D Representation Learners

arXiv 2024

2024

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

arXiv 2024

2024

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

arXiv 2024

2024

InternLM-Law: An Open Source Chinese Legal Large Language Model

arXiv 2024

2024

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

arXiv 2024

2024

IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict

arXiv 2024

2024

InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

arXiv 2024

2024

YOLOv10: Real-Time End-to-End Object Detection

arXiv 2024

2024

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

arXiv 2024

2024

What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

arXiv 2024

2024

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

arXiv 2024

2024

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

arXiv 2024

2024

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

arXiv 2024

2024

How Susceptible are Large Language Models to Ideological Manipulation?

arXiv 2024

2024

Dormant: Defending against Pose-driven Human Image Animation

arXiv 2024

2024

Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

arXiv 2024

2024

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

arXiv 2024

2024

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

arXiv 2024

2024

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

ICCV 2023 1

2023

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

arXiv 2023

2023

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

CVPR 2024 1

2023

GlyphControl: Glyph Conditional Control for Visual Text Generation

glyphcontrol-glyph-conditional-control-for

2023

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

arXiv 2023

2023

Evaluating Hallucinations in Chinese Large Language Models

arXiv 2023

2023

Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation

ICCV 2023 1

2023

Safer-Instruct: Aligning Language Models with Automated Preference Data

arXiv 2023

2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

arXiv 2023

2023

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

arXiv 2023

2023

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

arXiv 2023

2023

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

NeurIPS 2023 11

2023

RTMDet: An Empirical Study of Designing Real-Time Object Detectors

arXiv 2022

2022

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-targets in Semi-supervised Object Detection

consistent-teacher-provides-better

2022

NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

arXiv 2022

2022

Efficient Estimation of Word Representations in Vector Space

arXiv 2013

2013

Affiliations

No known affiliations.

Frequent co-authors

10

from 109 papers