Yu Zhou
- Papers
- 29
Cite
Notes
Only stored in your browser.
Authored papers
29Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
arXiv 2026
Semantic Audio-Visual Navigation in Continuous Environments
arXiv 2026
STEP3-VL-10B Technical Report
arXiv 2026
Visual Text Processing: A Comprehensive Review and Unified Evaluation
arXiv 2025
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding
arXiv 2025
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
arXiv 2025
HunyuanOCR Technical Report
arXiv 2025
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
arXiv 2025
SAM 3: Segment Anything with Concepts
arXiv 2025
MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples
arXiv 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
arXiv 2025
Step-Audio 2 Technical Report
arXiv 2025
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
arXiv 2025
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
arXiv 2025
VidText: Towards Comprehensive Evaluation for Video Text Understanding
arXiv 2025
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation
arXiv 2025
MuSc: Zero-Shot Industrial Anomaly Classification and Segmentation with Mutual Scoring of the Unlabeled Images
arXiv 2024
TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
arXiv 2024
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
arXiv 2024
SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
ICCV 2025
AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios
CVPR 2025 1
Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues
arXiv 2024
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval
arXiv 2024
Toward Real Text Manipulation Detection: New Dataset and New Solution
arXiv 2023
UATVR: Uncertainty-Adaptive Text-Video Retrieval
ICCV 2023 1
Non-Sequential Graph Script Induction via Multimedia Grounding
arXiv 2023
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
arXiv 2023
Weakly Supervised Semantic Segmentation via Progressive Patch Learning
arXiv 2022
Affiliations
Frequent co-authors
10from 29 papers