Zicheng Liu
- Papers
- 42
Cite
Notes
Only stored in your browser.
Authored papers
42DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
arXiv 2026
TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
arXiv 2026
Stabilizing Efficient Reasoning with Step-Level Advantage Selection
arXiv 2026
Masked Autoencoders Are Effective Tokenizers for Diffusion Models
arXiv 2025
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
arXiv 2025
Instella: Fully Open Language Models with Stellar Performance
arXiv 2025
Where LLM Agents Fail and How They can Learn From Failures
arXiv 2025
Directional Reasoning Injection for Fine-Tuning MLLMs
arXiv 2025
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
CVPR 2025 1
Taming LLMs by Scaling Learning Rates with Gradient Grouping
arXiv 2025
Self-Taught Agentic Long Context Understanding
arXiv 2025
CaptionQA: Is Your Caption as Useful as the Image Itself?
arXiv 2025
Switch EMA: A Free Lunch for Better Flatness and Sharpness
arXiv 2024
Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions
arXiv 2024
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning
arXiv 2024
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
arXiv 2024
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
arXiv 2024
A Survey on Mixup Augmentations and Beyond
arXiv 2024
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
arXiv 2024
OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning
openstl-a-comprehensive-benchmark-of-spatio
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023
ORES: Open-vocabulary Responsible Visual Synthesis
arXiv 2023
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024 1
Segment and Caption Anything
CVPR 2024 1
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
arXiv 2023
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023 1
SemiReward: A General Reward Model for Semi-supervised Learning
arXiv 2023
Adaptive Human Matting for Dynamic Videos
CVPR 2023 1
RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design
arXiv 2023
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CVPR 2023 1
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
arXiv 2022
GRiT: A Generative Region-to-text Transformer for Object Understanding
arXiv 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
coarse-to-fine-vision-language-pre-training-1
Exploring Discrete Diffusion Models for Image Captioning
arXiv 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
CVPR 2023 1
GIT: A Generative Image-to-text Transformer for Vision and Language
arXiv 2022
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
CVPR 2022 1
Florence: A New Foundation Model for Computer Vision
arXiv 2021
End-to-End Semi-Supervised Object Detection with Soft Teacher
ICCV 2021 10
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
arXiv 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
arXiv 2021
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
arXiv 2019
Affiliations
Frequent co-authors
10from 42 papers