Bin Wang
- Papers
- 78
Cite
Notes
Only stored in your browser.
Authored papers
78MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
arXiv 2026
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
arXiv 2026
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
arXiv 2026
InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery
arXiv 2026
Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision
arXiv 2026
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
arXiv 2026
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
arXiv 2026
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
arXiv 2026
MoDora: Tree-Based Semi-Structured Document Analysis System
arXiv 2026
Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis
arXiv 2025
TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
arXiv 2025
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
arXiv 2025
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
arXiv 2025
MM-ACT: Learn from Multimodal Parallel Generation to Act
arXiv 2025
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
arXiv 2025
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
arXiv 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
arXiv 2025
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
arXiv 2025
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
arXiv 2025
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
arXiv 2025
Step-Audio 2 Technical Report
arXiv 2025
ROSE: Remove Objects with Side Effects in Videos
arXiv 2025
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation
arXiv 2025
Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models
arXiv 2025
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
arXiv 2025
GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition
arXiv 2025
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
arXiv 2025
OmniTry: Virtual Try-On Anything without Masks
arXiv 2025
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
arXiv 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
arXiv 2025
SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
arXiv 2025
LEGION: Learning to Ground and Explain for Synthetic Image Detection
ICCV 2025
PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
arXiv 2025
SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model
arXiv 2025
DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
arXiv 2025
Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
arXiv 2025
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
CVPR 2025 1
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
arXiv 2024
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
arXiv 2024
A Cross Spatio-Temporal Pathology-based Lung Nodule Dataset
arXiv 2024
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
arXiv 2024
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
arXiv 2024
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
CVPR 2025 1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
arXiv 2024
AudioBench: A Universal Benchmark for Audio Large Language Models
arXiv 2024
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
ICCV 2025
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
arXiv 2024
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
arXiv 2024
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
arXiv 2024
SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM
arXiv 2024
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
arXiv 2024
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
arXiv 2024
A New Dataset and Framework for Real-World Blurred Images Super-Resolution
arXiv 2024
Segmentation-guided Layer-wise Image Vectorization with Gradient Fills
arXiv 2024
CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification
arXiv 2024
A Comprehensive Evaluation of Quantization Strategies for Large Language Models
arXiv 2024
CRAFT: Extracting and Tuning Cultural Instructions from the Wild
arXiv 2024
CoinMath: Harnessing the Power of Coding Instruction for Math LLMs
arXiv 2024
CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment
arXiv 2024
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
arXiv 2023
VIGC: Visual Instruction Generation and Correction
arXiv 2023
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2024 1
Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution
ICCV 2023 1
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
arXiv 2023
Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation
ICCV 2023 1
Performance-aware Approximation of Global Channel Pruning for Multitask CNNs
arXiv 2023
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
arXiv 2023
Parrot Captions Teach CLIP to Spot Text
arXiv 2023
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023
Instructive Dialogue Summarization with Query Aggregations
arXiv 2023
MISC: A MIxed Strategy-Aware Model Integrating COMET for Emotional Support Conversation
ACL 2022 5
Just Rank: Rethinking Evaluation with Word and Sentence Similarities
ACL 2022 5
C3KG: A Chinese Commonsense Conversation Knowledge Graph
arXiv 2022
MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions
arXiv 2022
Improving Knowledge Graph Embedding Using Simple Constraints
improving-knowledge-graph-embedding-using-1
Affiliations
Frequent co-authors
10from 78 papers