LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
LLaVA-Plus, a general-purpose multimodal assistant, enhances large multimodal models by integrating pre-trained vision and vision-language models, performing tool-assisted tasks and improving interaction through direct image grounding.
- Year
- 2023
- Venue
- arXiv 2023
- Authors
- 13
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2311.05437ARXIV-DEFAULT
- TL;DR
- Semantic Scholar