Kai Chen
- Papers
- 109
Cite
Notes
Only stored in your browser.
Authored papers
109WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
arXiv 2026
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
arXiv 2026
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
arXiv 2026
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
arXiv 2026
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
arXiv 2026
PhysBrain 1.0 Technical Report
arXiv 2026
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
arXiv 2026
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
arXiv 2026
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
arXiv 2026
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
arXiv 2026
Visual-ERM: Reward Modeling for Visual Equivalence
arXiv 2026
P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
arXiv 2026
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
arXiv 2026
Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning
arXiv 2026
LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
arXiv 2026
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
arXiv 2026
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
MemOS: A Memory OS for AI System
arXiv 2025
Qwen-Image Technical Report
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
arXiv 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
arXiv 2025
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
arXiv 2025
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
arXiv 2025
CritiQ: Mining Data Quality Criteria from Human Preferences
arXiv 2025
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
arXiv 2025
NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results
arXiv 2025
Redundancy Principles for MLLMs Benchmarks
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
arXiv 2025
SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
arXiv 2025
Rectifying LLM Thought from Lens of Optimization
arXiv 2025
P1: Mastering Physics Olympiads with Reinforcement Learning
arXiv 2025
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
arXiv 2025
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
arXiv 2025
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
arXiv 2025
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
arXiv 2025
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
arXiv 2025
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
arXiv 2025
Pre-Trained Policy Discriminators are General Reward Models
arXiv 2025
Rethinking Verification for LLM Code Generation: From Generation to Testing
arXiv 2025
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
arXiv 2025
CharacterShot: Controllable and Consistent 4D Character Animation
arXiv 2025
IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards
arXiv 2025
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
arXiv 2025
Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
arXiv 2025
Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
arXiv 2025
Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction
arXiv 2025
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
arXiv 2025
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
arXiv 2025
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning
arXiv 2025
Tady: A Neural Disassembler without Structural Constraint Violations
arXiv 2025
Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
arXiv 2025
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
arXiv 2025
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
arXiv 2025
Large Language Models for Cyber Security: A Systematic Literature Review
arXiv 2024
NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?
arXiv 2024
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
arXiv 2024
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
arXiv 2024
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
arXiv 2024
StyleShot: A Snapshot on Any Style
arXiv 2024
Are Your LLMs Capable of Stable Reasoning?
arXiv 2024
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
arXiv 2024
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
arXiv 2024
OMG-Seg: Is One Model Good Enough For All Segmentation?
CVPR 2024 1
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
arXiv 2024
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything
arXiv 2024
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
arXiv 2024
GTA: A Benchmark for General Tool Agents
arXiv 2024
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
arXiv 2024
Can AI Assistants Know What They Don't Know?
arXiv 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
CVPR 2025 1
Adapting LLaMA Decoder to Vision Transformer
arXiv 2024
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
arXiv 2024
CriticEval: Evaluating Large Language Model as Critic
arXiv 2024
4D Contrastive Superflows are Dense 3D Representation Learners
arXiv 2024
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
arXiv 2024
Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models
arXiv 2024
InternLM-Law: An Open Source Chinese Legal Large Language Model
arXiv 2024
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin
arXiv 2024
IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict
arXiv 2024
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems
arXiv 2024
YOLOv10: Real-Time End-to-End Object Detection
arXiv 2024
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study
arXiv 2024
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
arXiv 2024
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
arXiv 2024
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
arXiv 2024
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
arXiv 2024
How Susceptible are Large Language Models to Ideological Manipulation?
arXiv 2024
Dormant: Defending against Pose-driven Human Image Animation
arXiv 2024
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
arXiv 2024
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
arXiv 2024
Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
arXiv 2024
Improving Pixel-based MIM by Reducing Wasted Modeling Capability
ICCV 2023 1
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
arXiv 2023
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
CVPR 2024 1
GlyphControl: Glyph Conditional Control for Visual Text Generation
glyphcontrol-glyph-conditional-control-for
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
arXiv 2023
Evaluating Hallucinations in Chinese Large Language Models
arXiv 2023
Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation
ICCV 2023 1
Safer-Instruct: Aligning Language Models with Automated Preference Data
arXiv 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2023
A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
arXiv 2023
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023
Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
NeurIPS 2023 11
RTMDet: An Empirical Study of Designing Real-Time Object Detectors
arXiv 2022
Consistent-Teacher: Towards Reducing Inconsistent Pseudo-targets in Semi-supervised Object Detection
consistent-teacher-provides-better
NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results
arXiv 2022
Efficient Estimation of Word Representations in Vector Space
arXiv 2013
Affiliations
Frequent co-authors
10from 109 papers