Jiaqi Wang
- Papers
- 79
Cite
Notes
Only stored in your browser.
Authored papers
79WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
arXiv 2026
ETCHR: Editing To Clarify and Harness Reasoning
arXiv 2026
Channel-wise Vector Quantization
arXiv 2026
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
arXiv 2026
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
arXiv 2026
EasyVideoR1: Easier RL for Video Understanding
arXiv 2026
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
arXiv 2026
UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
arXiv 2026
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
arXiv 2026
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
arXiv 2026
Unified Personalized Reward Model for Vision Generation
arXiv 2026
Visual-ERM: Reward Modeling for Visual Equivalence
arXiv 2026
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
arXiv 2025
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
arXiv 2025
Visual Agentic Reinforcement Fine-Tuning
arXiv 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
arXiv 2025
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
CVPR 2025 1
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
arXiv 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
arXiv 2025
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
arXiv 2025
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
arXiv 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
CVPR 2025 1
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
arXiv 2025
RelightVid: Temporal-Consistent Diffusion Model for Video Relighting
arXiv 2025
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
arXiv 2025
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
arXiv 2025
Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models
arXiv 2025
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
arXiv 2025
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
arXiv 2025
EtCon: Edit-then-Consolidate for Reliable Knowledge Editing
arXiv 2025
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
arXiv 2025
G^2RPO: Granular GRPO for Precise Reward in Flow Models
arXiv 2025
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
arXiv 2025
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
arXiv 2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
arXiv 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
arXiv 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
arXiv 2025
Enhancing Monocular 3D Scene Completion with Diffusion Model
arXiv 2025
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
arXiv 2025
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
arXiv 2025
SS4D: Native 4D Generative Model via Structured Spacetime Latents
arXiv 2025
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
arXiv 2025
Think Visually, Reason Textually: Vision-Language Synergy in ARC
arXiv 2025
SPARK: Synergistic Policy And Reward Co-Evolving Framework
arXiv 2025
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
arXiv 2025
GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization
arXiv 2025
MM-IFEngine: Towards Multimodal Instruction Following
arXiv 2025
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
arXiv 2025
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
arXiv 2025
Long-CLIP: Unlocking the Long-Text Capability of CLIP
arXiv 2024
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
ICCV 2025
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
arXiv 2024
CRAG -- Comprehensive RAG Benchmark
arXiv 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
arXiv 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
arXiv 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
arXiv 2024
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
arXiv 2024
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
arXiv 2024
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
arXiv 2024
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
arXiv 2024
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
arXiv 2024
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
arXiv 2024
TechGPT-2.0: A large language model project to solve the task of knowledge graph construction
arXiv 2024
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought
arXiv 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
arXiv 2024
1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit
arXiv 2024
SS-GEN: A Social Story Generation Framework with Large Language Models
arXiv 2024
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
CVPR 2024 1
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
arXiv 2023
VIGC: Visual Instruction Generation and Correction
arXiv 2023
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2024 1
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024 1
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
arXiv 2023
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
arXiv 2023
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
arXiv 2023
Affiliations
Frequent co-authors
10from 79 papers