Peng Jin
- Papers
- 20
Cite
Notes
Only stored in your browser.
Authored papers
20VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
arXiv 2025
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
arXiv 2025
MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation
arXiv 2025
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
arXiv 2025
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
arXiv 2024
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
arXiv 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
MoH: Multi-Head Attention as Mixture-of-Head Attention
arXiv 2024
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
arXiv 2024
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
arXiv 2024
Next Patch Prediction for Autoregressive Visual Generation
arXiv 2024
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation
arXiv 2024
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
arXiv 2024
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
video-llava-learning-united-visual
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
ICCV 2023 1
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
CVPR 2023 1
Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting
arXiv 2023
FreestyleRet: Retrieving Images from Style-Diversified Queries
arXiv 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
arXiv 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
CVPR 2024 1
Affiliations
Frequent co-authors
10from 20 papers