Juncheng Li

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

arXiv 2025

Expanding the Action Space of LLMs to Reason Beyond Language

arXiv 2025

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

arXiv 2025

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

arXiv 2025

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

arXiv 2025

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

ICCV 2025

HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

arXiv 2025

On Path to Multimodal Generalist: General-Level and General-Bench

arXiv 2025

WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing

arXiv 2025

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

arXiv 2024

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

arXiv 2024

HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing

arXiv 2024

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

arXiv 2024

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

arXiv 2023

Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization for Few-shot Generalization

arXiv 2023

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

CVPR 2024 1

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

ICCV 2023 1