Hao Chen

Glance and Focus Reinforcement for Pan-cancer Screening

arXiv 2026

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

arXiv 2026

BitDance: Scaling Autoregressive Generative Models with Binary Tokens

arXiv 2026

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

arXiv 2026

Exploring Spatial Intelligence from a Generative Perspective

arXiv 2026

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

arXiv 2026

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

arXiv 2026

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

arXiv 2026

Progressive Residual Warmup for Language Model Pretraining

arXiv 2026

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

arXiv 2026

A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

arXiv 2025

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

arXiv 2025

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

arXiv 2025

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

arXiv 2025

Robust Latent Matters: Boosting Image Generation with Sampling Error

arXiv 2025

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

arXiv 2025

POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

ICCV 2025

ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation

arXiv 2025

ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Langauge-Image Pre-training

arXiv 2025

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

arXiv 2025

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

arXiv 2025

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

arXiv 2025

UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation

arXiv 2025

RynnVLA-002: A Unified Vision-Language-Action and World Model

arXiv 2025

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

arXiv 2025

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

arXiv 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

arXiv 2025

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

arXiv 2025

WorldVLA: Towards Autoregressive Action World Model

arXiv 2025

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

arXiv 2025

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

arXiv 2025

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

arXiv 2025

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

arXiv 2025

MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

arXiv 2025

Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

arXiv 2025

Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

arXiv 2025

Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology

arXiv 2025

Image Tokenizer Needs Post-Training

arXiv 2025

DIDS: Domain Impact-aware Data Sampling for Large Language Model Training

arXiv 2025

Adversarial Flow Models

arXiv 2025

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

arXiv 2025

Open-CD: A Comprehensive Toolbox for Change Detection

arXiv 2024

Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

arXiv 2024

MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology

arXiv 2024

AgentReview: Exploring Peer Review Dynamics with LLM Agents

arXiv 2024

MedIAnomaly: A comparative study of anomaly detection in medical images

arXiv 2024

ControlVAR: Exploring Controllable Visual Autoregressive Modeling

arXiv 2024

Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation

arXiv 2024

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

CVPR 2024 1

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

arXiv 2024

GSCo: Towards Generalizable AI in Medicine via Generalist-Specialist Collaboration

arXiv 2024

Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

arXiv 2024

GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models

arXiv 2024

DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data

CVPR 2024 1

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation

arXiv 2024

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

arXiv 2024

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

arXiv 2024

Aligning Medical Images with General Knowledge from Large Language Models

arXiv 2024

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

arXiv 2024

WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

arXiv 2024

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

arXiv 2024

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

larp-tokenizing-videos-with-a-learned

Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

metric3d-v2-a-versatile-monocular-geometric

TableGPT2: A Large Multimodal Model with Tabular Data Integration

arXiv 2024

Modeling the Label Distributions for Weakly-Supervised Semantic Segmentation

arXiv 2024

$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

arXiv 2024

Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM

arXiv 2024

GameGen-X: Interactive Open-world Game Video Generation

arXiv 2024

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis

CVPR 2024 1

RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model

arXiv 2023

A Survey on Evaluation of Large Language Models

arXiv 2023

CTVIS: Consistent Training for Online Video Instance Segmentation

ICCV 2023 1

Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection

arXiv 2023

Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction

ICCV 2023 1

PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation

arXiv 2023

HNeRV: A Hybrid Neural Representation for Videos

CVPR 2023 1

DoNet: Deep De-overlapping Network for Cytology Instance Segmentation

CVPR 2023 1

Diffusion Models for Imperceptible and Transferable Adversarial Attack

arXiv 2023

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

arXiv 2023

SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning

ICCV 2023 1

Object-aware Inversion and Reassembly for Image Editing

arXiv 2023

Cross-Modal Translation and Alignment for Survival Analysis

ICCV 2023 1

LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

arXiv 2023

Multi-view Self-supervised Disentanglement for General Image Denoising

ICCV 2023 1

Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning

train-once-get-a-family-state-adaptive

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

arXiv 2023

Better Zero-Shot Reasoning with Role-Play Prompting

arXiv 2023

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

arXiv 2023

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

arXiv 2023

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

arXiv 2023

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

arXiv 2023

PromptBench: A Unified Library for Evaluation of Large Language Models

arXiv 2023