Haiyang Xu
- Papers
- 22
Cite
Notes
Only stored in your browser.
Authored papers
22ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
arXiv 2026
AgentOCR: Reimagining Agent History via Optical Self-Compression
arXiv 2026
Qwen2.5-VL Technical Report
arXiv 2025
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation
arXiv 2025
Qwen3-VL Technical Report
arXiv 2025
Science-T2I: Addressing Scientific Illusions in Image Synthesis
CVPR 2025 1
Megrez-Omni Technical Report
arXiv 2025
Mobile-Agent-v3: Fundamental Agents for GUI Automation
arXiv 2025
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
arXiv 2025
VideoNSA: Native Sparse Attention Scales Video Understanding
arXiv 2025
Perception-Aware Policy Optimization for Multimodal Reasoning
arXiv 2025
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
arXiv 2024
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
arXiv 2024
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
CVPR 2025 1
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
arXiv 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
arXiv 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
arXiv 2023
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
CVPR 2024 1
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
arXiv 2023
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
arXiv 2023
Evaluation and Analysis of Hallucination in Large Vision-Language Models
arXiv 2023
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
arXiv 2022
Affiliations
Frequent co-authors
10from 22 papers