0

Zilong Huang

Papers
23

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
23papers

Authored papers

23

Let ViT Speak: Generative Language-Image Pre-training

arXiv 2026

2026

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

arXiv 2026

2026

Mixture-of-Depths Attention

arXiv 2026

2026

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

arXiv 2026

2026

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

arXiv 2025

2025

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

arXiv 2025

2025

Seed1.5-VL Technical Report

arXiv 2025

2025

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

ICCV 2025

2025

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

arXiv 2025

2025

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

ICCV 2025

2025

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

arXiv 2025

2025

ThinkGen: Generalized Thinking for Visual Generation

arXiv 2025

2025

MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

arXiv 2025

2025

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

arXiv 2025

2025

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

arXiv 2025

2025

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

arXiv 2025

2025

Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

arXiv 2025

2025

Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

CVPR 2025 1

2025

Depth Anything V2

arXiv 2024

2024

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

CVPR 2024 1

2024

Classification Done Right for Vision-Language Pre-Training

arXiv 2024

2024

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

arXiv 2024

2024

CCNet: Criss-Cross Attention for Semantic Segmentation

ccnet-criss-cross-attention-for-semantic-1

2018

Affiliations

No known affiliations.

Frequent co-authors

10

from 23 papers