Zhe Gan
- Papers
- 34
Cite
Notes
Only stored in your browser.
Authored papers
34Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants
arXiv 2026
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
arXiv 2026
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
arXiv 2025
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
arXiv 2025
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
arXiv 2025
Multimodal Autoregressive Pre-training of Large Vision Encoders
CVPR 2025 1
Improve Vision Language Model Chain-of-thought Reasoning
arXiv 2024
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
arXiv 2024
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
arXiv 2024
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
arXiv 2024
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv 2023
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
arXiv 2023
Guiding Instruction-based Image Editing via Multimodal Large Language Models
arXiv 2023
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
arXiv 2023
VeCLIP: Improving CLIP Training via Visual-enriched Captions
arXiv 2023
MOFI: Learning Image Representations from Noisy Entity Annotated Images
arXiv 2023
Compressing LLMs: The Truth is Rarely Pure and Never Simple
arXiv 2023
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
arXiv 2023
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
coarse-to-fine-vision-language-pre-training-1
Exploring Discrete Diffusion Models for Image Captioning
arXiv 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CVPR 2023 1
GIT: A Generative Image-to-text Transformer for Vision and Language
arXiv 2022
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
arXiv 2022
Generalized Decoding for Pixel, Image, and Language
CVPR 2023 1
GRiT: A Generative Region-to-text Transformer for Object Understanding
arXiv 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
CVPR 2023 1
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
CVPR 2022 1
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
arXiv 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
arXiv 2021
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference
violin-a-large-scale-dataset-for-video-and-1
Graph Optimal Transport for Cross-Domain Alignment
ICML 2020 1
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training
EMNLP 2020 11
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
EMNLP 2020 11
UNITER: UNiversal Image-TExt Representation Learning
ECCV 2020 8
Affiliations
Frequent co-authors
10from 34 papers