Renrui Zhang
- Papers
- 49
Cite
Notes
Only stored in your browser.
Authored papers
49Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
arXiv 2026
Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation
arXiv 2026
VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
arXiv 2026
PEARL: Personalized Streaming Video Understanding Model
arXiv 2026
GENIUS: Generative Fluid Intelligence Evaluation Suite
arXiv 2026
Seed1.5-VL Technical Report
arXiv 2025
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
arXiv 2025
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
ICCV 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
arXiv 2025
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
arXiv 2025
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
arXiv 2025
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
arXiv 2025
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
arXiv 2025
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
arXiv 2025
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
arXiv 2025
Generative Universal Verifier as Multimodal Meta-Reasoner
arXiv 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
arXiv 2025
Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking
arXiv 2025
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
arXiv 2025
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
arXiv 2025
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
arXiv 2025
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
Training-free Regional Prompting for Diffusion Transformers
arXiv 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
arXiv 2024
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
arXiv 2024
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
ICCV 2025
TerDiT: Ternary Diffusion Models with Transformers
arXiv 2024
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
arXiv 2024
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
arXiv 2024
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
arXiv 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
arXiv 2024
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
ICCV 2023 1
ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation
arXiv 2023
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023 1
Gradient-based Parameter Selection for Efficient Fine-Tuning
CVPR 2024 1
PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation
arXiv 2023
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
arXiv 2023
Personalize Segment Anything Model with One Shot
arXiv 2023
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
arXiv 2023
RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision
arXiv 2023
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
CVPR 2023 1
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023 1
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
ICCV 2023 1
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
CVPR 2023 1
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
arXiv 2021
Affiliations
Frequent co-authors
10from 49 papers