Chunyuan Li
- Papers
- 41
Cite
Notes
Only stored in your browser.
Authored papers
41LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
arXiv 2026
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
arXiv 2026
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
arXiv 2025
Seed1.5-VL Technical Report
arXiv 2025
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
arXiv 2025
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
arXiv 2025
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2024
Long Context Transfer from Language to Vision
arXiv 2024
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
arXiv 2024
TrustLLM: Trustworthiness in Large Language Models
arXiv 2024
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
arXiv 2024
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024
Graphic Design with Large Multimodal Model
arXiv 2024
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
arXiv 2024
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation
arXiv 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
arXiv 2024
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
arXiv 2023
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
arXiv 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2023
A Simple Framework for Open-Vocabulary Segmentation and Detection
ICCV 2023 1
Visual In-Context Prompting
CVPR 2024 1
Instruction Tuning with GPT-4
arXiv 2023
Semantic-SAM: Segment and Recognize Anything at Any Granularity
arXiv 2023
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
arXiv 2023
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
arXiv 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
arXiv 2023
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
arXiv 2023
GLIGEN: Open-Set Grounded Text-to-Image Generation
CVPR 2023 1
Towards Building the Federated GPT: Federated Instruction Tuning
arXiv 2023
HallE-Control: Controlling Object Hallucination in Large Multimodal Models
arXiv 2023
Parameter-efficient Model Adaptation for Vision Transformers
arXiv 2022
Generalized Decoding for Pixel, Image, and Language
CVPR 2023 1
Focal Modulation Networks
arXiv 2022
RegionCLIP: Region-based Language-Image Pretraining
CVPR 2022 1
LAFITE: Towards Language-Free Training for Text-to-Image Generation
arXiv 2021
Florence: A New Foundation Model for Computer Vision
arXiv 2021
Contrastive Attraction and Contrastive Repulsion for Representation Learning
contrastive-attraction-and-contrastive
Few-shot Natural Language Generation for Task-Oriented Dialog
Findings of the Association for Computational Linguistics 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
ECCV 2020 8
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training
EMNLP 2020 11
Measuring the Intrinsic Dimension of Objective Landscapes
measuring-the-intrinsic-dimension-of-1
Affiliations
Frequent co-authors
10from 41 papers