Xiang Bai
- Papers
- 54
Cite
Notes
Only stored in your browser.
Authored papers
54Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
arXiv 2026
Multimodal OCR: Parse Anything from Documents
arXiv 2026
MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
arXiv 2026
Towards Generalizable Robotic Manipulation in Dynamic Environments
arXiv 2026
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
arXiv 2026
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
arXiv 2026
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
arXiv 2026
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
arXiv 2026
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
arXiv 2026
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
arXiv 2025
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
ICCV 2025
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
arXiv 2025
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
ICCV 2025
Seeing the Future, Perceiving the Future: A Unified Driving World Model for Future Generation and Perception
arXiv 2025
Visual Text Processing: A Comprehensive Review and Unified Evaluation
arXiv 2025
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
arXiv 2025
SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
semiets-integrating-spatial-and-content
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
arXiv 2025
Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution
arXiv 2025
TokBench: Evaluating Your Visual Tokenizer before Visual Generation
arXiv 2025
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
ICCV 2025
LION: Linear Group RNN for 3D Object Detection in Point Clouds
arXiv 2024
MINIMA: Modality Invariant Image Matching
CVPR 2025 1
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
arXiv 2024
Parameter-Efficient Fine-Tuning in Spectral Domain for Point Cloud Learning
arXiv 2024
Liquid: Language Models are Scalable Multi-modal Generators
arXiv 2024
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
arXiv 2024
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
arXiv 2024
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
arXiv 2024
SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer
arXiv 2024
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
arXiv 2024
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
arXiv 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
arXiv 2024
WAS: Dataset and Methods for Artistic Text Segmentation
arXiv 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
arXiv 2024
R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
arXiv 2024
Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
arXiv 2024
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition
arXiv 2024
SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting
arXiv 2024
DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning
arXiv 2023
SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth
arXiv 2023
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
arXiv 2023
General Object Foundation Model for Images and Videos at Scale
CVPR 2024 1
Side Adapter Network for Open-Vocabulary Semantic Segmentation
CVPR 2023 1
ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer
ICCV 2023 1
Toward Real Text Manipulation Detection: New Dataset and New Solution
arXiv 2023
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion
arXiv 2022
CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer
arXiv 2022
Syntax-Aware Network for Handwritten Mathematical Expression Recognition
CVPR 2022 1
An Empirical Study of End-to-End Temporal Action Detection
CVPR 2022 1
When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition
arXiv 2022
Knowledge Mining with Scene Text for Fine-Grained Recognition
CVPR 2022 1
End-to-End Semi-Supervised Object Detection with Soft Teacher
ICCV 2021 10
MASTER: Multi-Aspect Non-local Network for Scene Text Recognition
arXiv 2019
Affiliations
Frequent co-authors
10from 54 papers