0

Renrui Zhang

Papers
49

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
49papers

Authored papers

49

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

arXiv 2026

2026

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

arXiv 2026

2026

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

arXiv 2026

2026

PEARL: Personalized Streaming Video Understanding Model

arXiv 2026

2026

GENIUS: Generative Fluid Intelligence Evaluation Suite

arXiv 2026

2026

Seed1.5-VL Technical Report

arXiv 2025

2025

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

arXiv 2025

2025

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

ICCV 2025

2025

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

arXiv 2025

2025

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

arXiv 2025

2025

TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

arXiv 2025

2025

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

arXiv 2025

2025

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

arXiv 2025

2025

DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

arXiv 2025

2025

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

arXiv 2025

2025

Generative Universal Verifier as Multimodal Meta-Reasoner

arXiv 2025

2025

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

arXiv 2025

2025

Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking

arXiv 2025

2025

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

arXiv 2025

2025

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

arXiv 2025

2025

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

arXiv 2025

2025

LLaVA-OneVision: Easy Visual Task Transfer

arXiv 2024

2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

arXiv 2024

2024

Training-free Regional Prompting for Diffusion Transformers

arXiv 2024

2024

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

arXiv 2024

2024

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

arXiv 2024

2024

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

arXiv 2024

2024

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

ICCV 2025

2024

TerDiT: Ternary Diffusion Models with Transformers

arXiv 2024

2024

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

arXiv 2024

2024

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

arXiv 2024

2024

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

arXiv 2024

2024

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

arXiv 2024

2024

ImageBind-LLM: Multi-modality Instruction Tuning

arXiv 2023

2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

arXiv 2023

2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

ICCV 2023 1

2023

ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

arXiv 2023

2023

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

CVPR 2023 1

2023

Gradient-based Parameter Selection for Efficient Fine-Tuning

CVPR 2024 1

2023

PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation

arXiv 2023

2023

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

arXiv 2023

2023

Personalize Segment Anything Model with One Shot

arXiv 2023

2023

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

arXiv 2023

2023

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

arXiv 2023

2023

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

CVPR 2023 1

2022

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

ICCV 2023 1

2022

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

ICCV 2023 1

2022

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

CVPR 2023 1

2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

arXiv 2021

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 49 papers