Rongrong Ji
- Papers
- 60
Cite
Notes
Only stored in your browser.
Authored papers
60Motion-Aware Caching for Efficient Autoregressive Video Generation
arXiv 2026
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
arXiv 2026
A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation
arXiv 2026
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
arXiv 2026
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv 2025
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning
arXiv 2025
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
arXiv 2025
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
arXiv 2025
Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach
arXiv 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
arXiv 2025
Speculative Decoding Reimagined for Multimodal Large Language Models
arXiv 2025
Grounded Chain-of-Thought for Multimodal Large Language Models
arXiv 2025
Training Long-Context LLMs Efficiently via Chunk-wise Optimization
arXiv 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025
SVFR: A Unified Framework for Generalized Video Face Restoration
CVPR 2025 1
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective
arXiv 2025
A Light and Tuning-free Method for Simulating Camera Motion in Video Generation
arXiv 2025
DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion
arXiv 2025
ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation
arXiv 2025
SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
arXiv 2025
Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text
arXiv 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
arXiv 2024
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation
arXiv 2024
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
arXiv 2024
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
arXiv 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
arXiv 2024
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs
arXiv 2024
UniVST: A Unified Framework for Training-free Localized Video Style Transfer
arXiv 2024
GraCo: Granularity-Controllable Interactive Segmentation
CVPR 2024 1
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
arXiv 2024
TraDiffusion: Trajectory-Based Training-Free Image Generation
arXiv 2024
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
arXiv 2024
AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation
arXiv 2024
AffineQuant: Affine Transformation Quantization for Large Language Models
arXiv 2024
Multi-branch Collaborative Learning Network for 3D Visual Grounding
arXiv 2024
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation
arXiv 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024 1
ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
arXiv 2024
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
arXiv 2024
Aligning and Prompting Everything All at Once for Universal Visual Perception
arXiv 2023
You Only Segment Once: Towards Real-Time Panoptic Segmentation
CVPR 2023 1
Bi-directional Masks for Efficient N:M Sparse Training
arXiv 2023
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization
arXiv 2023
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
ICCV 2023 1
JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues
arXiv 2023
Pseudo-label Alignment for Semi-supervised Instance Segmentation
ICCV 2023 1
X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation
arXiv 2023
InterFormer: Real-time Interactive Image Segmentation
ICCV 2023 1
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
arXiv 2023
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration
ICCV 2023 1
Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective
CVPR 2023 1
X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
ICCV 2023 1
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples
arXiv 2023
Discriminator-Cooperated Feature Map Distillation for GAN Compression
CVPR 2023 1
Exploring Target Representations for Masked Autoencoders
arXiv 2022
SMMix: Self-Motivated Image Mixing for Vision Transformers
ICCV 2023 1
Lottery Jackpots Exist in Pre-trained Models
arXiv 2021
OMPQ: Orthogonal Mixed Precision Quantization
arXiv 2021
Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion
arXiv 2020
Affiliations
Frequent co-authors
10from 60 papers