Tao Zhang
- Papers
- 37
Cite
Notes
Only stored in your browser.
Authored papers
37SAMTok: Representing Any Mask with Two Words
arXiv 2026
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
arXiv 2026
SWE-World: Building Software Engineering Agents in Docker-Free Environments
arXiv 2026
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
arXiv 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
arXiv 2025
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
arXiv 2025
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
arXiv 2025
HunyuanImage 3.0 Technical Report
arXiv 2025
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
arXiv 2025
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
arXiv 2025
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
arXiv 2025
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
arXiv 2025
Baichuan-Omni-1.5 Technical Report
arXiv 2025
On Path to Multimodal Generalist: General-Level and General-Bench
arXiv 2025
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
arXiv 2025
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
arXiv 2025
S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
arXiv 2025
Native Hybrid Attention for Efficient Sequence Modeling
arXiv 2025
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
arXiv 2025
UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation
arXiv 2025
Ocean-OCR: Towards General OCR Application via a Vision-Language Model
arXiv 2025
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
ICCV 2025
An Empirical Study of GPT-4o Image Generation Capabilities
arXiv 2025
Wavelet Diffusion Neural Operator
arXiv 2024
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
arXiv 2024
Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model
arXiv 2024
TableGPT2: A Large Multimodal Model with Tabular Data Integration
arXiv 2024
Baichuan-Omni Technical Report
arXiv 2024
DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
arXiv 2024
Generative Regression Based Watch Time Prediction for Short-Video Recommendation
arXiv 2024
Point Cloud Mamba: Point Cloud Learning via State Space Model
arXiv 2024
Compositional Generative Inverse Design
arXiv 2024
SysBench: Can Large Language Models Follow System Messages?
arXiv 2024
MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music
arXiv 2024
Baichuan 2: Open Large-scale Language Models
arXiv 2023
DVIS: Decoupled Video Instance Segmentation Framework
ICCV 2023 1
DVIS++: Improved Decoupled Framework for Universal Video Segmentation
arXiv 2023
Affiliations
Frequent co-authors
10from 37 papers