0

Zicheng Liu

Papers
42

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
42papers

Authored papers

42

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

arXiv 2026

2026

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

arXiv 2026

2026

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

arXiv 2026

2026

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

arXiv 2025

2025

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

arXiv 2025

2025

Instella: Fully Open Language Models with Stellar Performance

arXiv 2025

2025

Where LLM Agents Fail and How They can Learn From Failures

arXiv 2025

2025

Directional Reasoning Injection for Fine-Tuning MLLMs

arXiv 2025

2025

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

CVPR 2025 1

2025

Taming LLMs by Scaling Learning Rates with Gradient Grouping

arXiv 2025

2025

Self-Taught Agentic Long Context Understanding

arXiv 2025

2025

CaptionQA: Is Your Caption as Useful as the Image Itself?

arXiv 2025

2025

Switch EMA: A Free Lunch for Better Flatness and Sharpness

arXiv 2024

2024

Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions

arXiv 2024

2024

Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

arXiv 2024

2024

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

arXiv 2024

2024

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

arXiv 2024

2024

A Survey on Mixup Augmentations and Beyond

arXiv 2024

2024

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

arXiv 2024

2024

OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning

openstl-a-comprehensive-benchmark-of-spatio

2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

arXiv 2023

2023

ORES: Open-vocabulary Responsible Visual Synthesis

arXiv 2023

2023

DisCo: Disentangled Control for Realistic Human Dance Generation

CVPR 2024 1

2023

Segment and Caption Anything

CVPR 2024 1

2023

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

arXiv 2023

2023

Equivariant Similarity for Vision-Language Foundation Models

ICCV 2023 1

2023

SemiReward: A General Reward Model for Semi-supervised Learning

arXiv 2023

2023

Adaptive Human Matting for Dynamic Videos

CVPR 2023 1

2023

RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design

arXiv 2023

2023

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

CVPR 2023 1

2022

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

arXiv 2022

2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

arXiv 2022

2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

coarse-to-fine-vision-language-pre-training-1

2022

Exploring Discrete Diffusion Models for Image Captioning

arXiv 2022

2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

CVPR 2023 1

2022

GIT: A Generative Image-to-text Transformer for Vision and Language

arXiv 2022

2022

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

CVPR 2022 1

2021

Florence: A New Foundation Model for Computer Vision

arXiv 2021

2021

End-to-End Semi-Supervised Object Detection with Soft Teacher

ICCV 2021 10

2021

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

arXiv 2021

2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

arXiv 2021

2021

Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation

arXiv 2019

2019

Affiliations

No known affiliations.

Frequent co-authors

10

from 42 papers