Shilong Liu

On Path to Multimodal Historical Reasoning: HistBench and HistAgent

arXiv 2025

Web World Models

arXiv 2025

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

arXiv 2025

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

arXiv 2024

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

arXiv 2024

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

arXiv 2024

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

arXiv 2024

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

arXiv 2024

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

arXiv 2024

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

arXiv 2023

detrex: Benchmarking Detection Transformers

arXiv 2023

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

arXiv 2023

A Simple Framework for Open-Vocabulary Segmentation and Detection

ICCV 2023 1

Detection Transformer with Stable Matching

ICCV 2023 1

Visual In-Context Prompting

CVPR 2024 1

Recognize Anything: A Strong Image Tagging Model

arXiv 2023

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

arXiv 2023

Interfacing Foundation Models' Embeddings

arXiv 2023

Semantic-SAM: Segment and Recognize Anything at Any Granularity

arXiv 2023

Neural Interactive Keypoint Detection

ICCV 2023 1

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

arXiv 2023