0

Jianwei Yang

Papers
36

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
36papers

Authored papers

36

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

arXiv 2025

2025

Magma: A Foundation Model for Multimodal AI Agents

CVPR 2025 1

2025

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

arXiv 2025

2025

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

arXiv 2025

2025

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

arXiv 2025

2025

OmniParser for Pure Vision Based GUI Agent

arXiv 2024

2024

Efficient Modulation for Vision Networks

arXiv 2024

2024

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

CVPR 2025 1

2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

arXiv 2024

2024

Matryoshka Multimodal Models

arXiv 2024

2024

Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation

arXiv 2024

2024

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

arXiv 2024

2024

Pix2Gif: Motion-Guided Diffusion for GIF Generation

arXiv 2024

2024

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

arXiv 2023

2023

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

arXiv 2023

2023

detrex: Benchmarking Detection Transformers

arXiv 2023

2023

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

arXiv 2023

2023

GLIGEN: Open-Set Grounded Text-to-Image Generation

CVPR 2023 1

2023

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

arXiv 2023

2023

A Simple Framework for Open-Vocabulary Segmentation and Detection

ICCV 2023 1

2023

Semantic-SAM: Segment and Recognize Anything at Any Granularity

arXiv 2023

2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

arXiv 2023

2023

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

arXiv 2023

2023

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

arXiv 2023

2023

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

CVPR 2024 1

2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

arXiv 2023

2023

Visual In-Context Prompting

CVPR 2024 1

2023

Interfacing Foundation Models' Embeddings

arXiv 2023

2023

Focal Modulation Networks

arXiv 2022

2022

Parameter-efficient Model Adaptation for Vision Transformers

arXiv 2022

2022

Generalized Decoding for Pixel, Image, and Language

CVPR 2023 1

2022

RegionCLIP: Region-based Language-Image Pretraining

CVPR 2022 1

2021

Image Scene Graph Generation (SGG) Benchmark

arXiv 2021

2021

VinVL: Revisiting Visual Representations in Vision-Language Models

CVPR 2021 1

2021

Florence: A New Foundation Model for Computer Vision

arXiv 2021

2021

Joint Unsupervised Learning of Deep Representations and Image Clusters

joint-unsupervised-learning-of-deep-1

2016

Affiliations

No known affiliations.

Frequent co-authors

10

from 36 papers