0

Zhe Gan

Papers
34

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
34papers

Authored papers

34

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

arXiv 2026

2026

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

arXiv 2026

2026

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

arXiv 2025

2025

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

arXiv 2025

2025

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

arXiv 2025

2025

Multimodal Autoregressive Pre-training of Large Vision Encoders

CVPR 2025 1

2024

Improve Vision Language Model Chain-of-thought Reasoning

arXiv 2024

2024

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

arXiv 2024

2024

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

arXiv 2024

2024

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

arXiv 2024

2024

Ferret: Refer and Ground Anything Anywhere at Any Granularity

arXiv 2023

2023

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

arXiv 2023

2023

Guiding Instruction-based Image Editing via Multimodal Large Language Models

arXiv 2023

2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

arXiv 2023

2023

VeCLIP: Improving CLIP Training via Visual-enriched Captions

arXiv 2023

2023

MOFI: Learning Image Representations from Noisy Entity Annotated Images

arXiv 2023

2023

Compressing LLMs: The Truth is Rarely Pure and Never Simple

arXiv 2023

2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

arXiv 2023

2023

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

coarse-to-fine-vision-language-pre-training-1

2022

Exploring Discrete Diffusion Models for Image Captioning

arXiv 2022

2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

CVPR 2023 1

2022

GIT: A Generative Image-to-text Transformer for Vision and Language

arXiv 2022

2022

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

arXiv 2022

2022

Generalized Decoding for Pixel, Image, and Language

CVPR 2023 1

2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

arXiv 2022

2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

CVPR 2023 1

2022

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

CVPR 2022 1

2021

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

arXiv 2021

2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

arXiv 2021

2021

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

violin-a-large-scale-dataset-for-video-and-1

2020

Graph Optimal Transport for Cross-Domain Alignment

ICML 2020 1

2020

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

EMNLP 2020 11

2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

EMNLP 2020 11

2020

UNITER: UNiversal Image-TExt Representation Learning

ECCV 2020 8

2019

Affiliations

No known affiliations.

Frequent co-authors

10

from 34 papers