Kaipeng Zhang
- Papers
- 51
Cite
Notes
Only stored in your browser.
Authored papers
51Generative World Renderer
arXiv 2026
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
arXiv 2026
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
arXiv 2026
MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences
arXiv 2026
World Craft: Agentic Framework to Create Visualizable Worlds via Text
arXiv 2026
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG
arXiv 2026
PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
arXiv 2026
PyVision-RL: Forging Open Agentic Vision Models via RL
arXiv 2026
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
Sekai: A Video Dataset towards World Exploration
arXiv 2025
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning
arXiv 2025
Enhance-A-Video: Better Generated Video for Free
arXiv 2025
Yume-1.5: A Text-Controlled Interactive World Generation Model
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
Neural-Driven Image Editing
arXiv 2025
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
arXiv 2025
PyVision: Agentic Vision with Dynamic Tooling
arXiv 2025
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
arXiv 2025
Symbolic Graphics Programming with Large Language Models
arXiv 2025
REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training
arXiv 2025
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
arXiv 2025
SVBench: Evaluation of Video Generation Models on Social Reasoning
arXiv 2025
Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction
arXiv 2025
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
arXiv 2025
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
arXiv 2025
Neighboring Autoregressive Modeling for Efficient Visual Generation
ICCV 2025
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
mdk12-bench-a-multi-discipline-benchmark-for
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
ICCV 2025
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
arXiv 2024
Needle In A Multimodal Haystack
arXiv 2024
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
arXiv 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
CVPR 2025 1
T3M: Text Guided 3D Human Motion Synthesis from Speech
arXiv 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024 1
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
arXiv 2024
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
arXiv 2024
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
arXiv 2024
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
arXiv 2024
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
arXiv 2024
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
arXiv 2024
ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality
arXiv 2024
Adapting LLaMA Decoder to Vision Transformer
arXiv 2024
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction
arXiv 2024
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
arXiv 2023
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching
arXiv 2023
MLLMs-Augmented Visual-Language Representation Learning
arXiv 2023
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
arXiv 2023
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
ICCV 2023 1
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024 1
Affiliations
Frequent co-authors
10from 51 papers