Zheng Ge
- Papers
- 27
Cite
Notes
Only stored in your browser.
Authored papers
27Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
arXiv 2026
SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
arXiv 2026
WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics
arXiv 2026
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
arXiv 2026
GEBench: Benchmarking Image Generation Models as GUI Environments
arXiv 2026
STEP3-VL-10B Technical Report
arXiv 2026
Step1X-Edit: A Practical Framework for General Image Editing
arXiv 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
arXiv 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
arXiv 2025
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
arXiv 2025
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
arXiv 2025
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
arXiv 2025
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
arXiv 2025
Step-GUI Technical Report
arXiv 2025
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
arXiv 2025
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
arXiv 2024
Slow Perception: Let's Perceive Geometric Figures Step-by-step
arXiv 2024
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
arXiv 2024
Reconstructive Visual Instruction Tuning
arXiv 2024
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
arXiv 2024
DreamLLM: Synergistic Multimodal Comprehension and Creation
arXiv 2023
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
arXiv 2023
Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
arXiv 2023
Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
arXiv 2022
MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception
ICCV 2023 1
YOLOX: Exceeding YOLO Series in 2021
arXiv 2021
Affiliations
Frequent co-authors
10from 27 papers