Zhaoxiang Zhang
- Papers
- 59
Cite
Notes
Only stored in your browser.
Authored papers
59MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
arXiv 2026
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
arXiv 2026
CodeTracer: Towards Traceable Agent States
arXiv 2026
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
arXiv 2026
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development
arXiv 2026
OProver: A Unified Framework for Agentic Formal Theorem Proving
arXiv 2026
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
arXiv 2026
GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction
arXiv 2026
YuE: Scaling Open Foundation Models for Long-Form Music Generation
arXiv 2025
A Comprehensive Survey on Long Context Language Modeling
arXiv 2025
LayerAnimate: Layer-specific Control for Animation
ICCV 2025
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
arXiv 2025
A Survey on Latent Reasoning
arXiv 2025
TC-Light: Temporally Coherent Generative Rendering for Realistic World Transfer
arXiv 2025
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation
arXiv 2025
Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
arXiv 2025
Practical Continual Forgetting for Pre-trained Vision Models
arXiv 2025
VGGT-X: When VGGT Meets Dense Novel View Synthesis
arXiv 2025
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
arXiv 2025
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
arXiv 2025
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
arXiv 2025
Uniform Discrete Diffusion with Metric Path for Video Generation
arXiv 2025
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
arXiv 2025
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
arXiv 2025
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
arXiv 2025
Unified Vision-Language-Action Model
arXiv 2025
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
arXiv 2025
IF-VidCap: Can Video Caption Models Follow Instructions?
arXiv 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
arXiv 2025
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
arXiv 2025
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
arXiv 2025
Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond
arXiv 2024
CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians
arXiv 2024
CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes
arXiv 2024
OmniBench: Towards The Future of Universal Omni-Language Models
arXiv 2024
Reconstructive Visual Instruction Tuning
arXiv 2024
OpenSatMap: A Fine-grained High-resolution Satellite Dataset for Large-scale Map Construction
arXiv 2024
Enhancing End-to-End Autonomous Driving with Latent World Model
arXiv 2024
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
arXiv 2024
MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D Object Detection
arXiv 2024
Monocular Occupancy Prediction for Scalable Indoor Scenes
arXiv 2024
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
arXiv 2024
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
arXiv 2024
MIO: A Foundation Model on Multimodal Tokens
arXiv 2024
FIRM: Flexible Interactive Reflection reMoval
arXiv 2024
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
arXiv 2023
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
arXiv 2023
Blind Video Deflickering by Neural Filtering with a Flawed Atlas
CVPR 2023 1
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
arXiv 2023
Once Detected, Never Lost: Surpassing Human Performance in Offline LiDAR based 3D Object Detection
ICCV 2023 1
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
CVPR 2024 1
PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
CVPR 2024 1
Bootstrap Masked Visual Modeling via Hard Patches Mining
arXiv 2023
FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection
CVPR 2023 1
DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions
droppos-pre-training-vision-transformers-by
Bootstrap Fine-Grained Vision-Language Alignment for Unified Zero-Shot Anomaly Localization
arXiv 2023
LMR: A Large-Scale Multi-Reference Dataset for Reference-based Super-Resolution
ICCV 2023 1
DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization
ICCV 2023 1
Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation
arXiv 2023
Affiliations
Frequent co-authors
10from 59 papers