Peng Jin

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

arXiv 2025

Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

arXiv 2025

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

arXiv 2025

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

arXiv 2024

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

arXiv 2024

MoH: Multi-Head Attention as Mixture-of-Head Attention

arXiv 2024

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

arXiv 2024

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

arXiv 2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

arXiv 2024

Next Patch Prediction for Autoregressive Visual Generation

arXiv 2024

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

arXiv 2024

Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation

arXiv 2024

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

video-llava-learning-united-visual

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting

arXiv 2023

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

CVPR 2024 1

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

ICCV 2023 1

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

CVPR 2023 1

FreestyleRet: Retrieving Images from Style-Diversified Queries

arXiv 2023

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

arXiv 2023