Jianwei Yang
- Papers
- 36
Cite
Notes
Only stored in your browser.
Authored papers
36LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
arXiv 2025
Magma: A Foundation Model for Multimodal AI Agents
CVPR 2025 1
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
arXiv 2025
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
arXiv 2025
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
arXiv 2025
OmniParser for Pure Vision Based GUI Agent
arXiv 2024
Efficient Modulation for Vision Networks
arXiv 2024
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
CVPR 2025 1
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
arXiv 2024
Matryoshka Multimodal Models
arXiv 2024
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation
arXiv 2024
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
arXiv 2024
Pix2Gif: Motion-Guided Diffusion for GIF Generation
arXiv 2024
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
arXiv 2023
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
arXiv 2023
detrex: Benchmarking Detection Transformers
arXiv 2023
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
arXiv 2023
GLIGEN: Open-Set Grounded Text-to-Image Generation
CVPR 2023 1
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2023
A Simple Framework for Open-Vocabulary Segmentation and Detection
ICCV 2023 1
Semantic-SAM: Segment and Recognize Anything at Any Granularity
arXiv 2023
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
arXiv 2023
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
arXiv 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
arXiv 2023
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
CVPR 2024 1
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
arXiv 2023
Visual In-Context Prompting
CVPR 2024 1
Interfacing Foundation Models' Embeddings
arXiv 2023
Focal Modulation Networks
arXiv 2022
Parameter-efficient Model Adaptation for Vision Transformers
arXiv 2022
Generalized Decoding for Pixel, Image, and Language
CVPR 2023 1
RegionCLIP: Region-based Language-Image Pretraining
CVPR 2022 1
Image Scene Graph Generation (SGG) Benchmark
arXiv 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
CVPR 2021 1
Florence: A New Foundation Model for Computer Vision
arXiv 2021
Joint Unsupervised Learning of Deep Representations and Image Clusters
joint-unsupervised-learning-of-deep-1
Affiliations
Frequent co-authors
10from 36 papers