Xiangyu Zhang

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

arXiv 2026

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

arXiv 2026

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

arXiv 2026

PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

arXiv 2026

GEBench: Benchmarking Image Generation Models as GUI Environments

arXiv 2026

STEP3-VL-10B Technical Report

arXiv 2026

Step1X-Edit: A Practical Framework for General Image Editing

arXiv 2025

Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets

arXiv 2025

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

arXiv 2025

CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

CVPR 2025 1

Farseer: A Refined Scaling Law in Large Language Models

arXiv 2025

Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

arXiv 2025

Step-DeepResearch Technical Report

arXiv 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs

arXiv 2025

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

arXiv 2025

REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

arXiv 2025

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

arXiv 2025

Step-Audio 2 Technical Report

arXiv 2025

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

arXiv 2025

Hita: Holistic Tokenizer for Autoregressive Image Generation

ICCV 2025

M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?

arXiv 2025

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

arXiv 2025

Step-GUI Technical Report

arXiv 2025

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

arXiv 2025

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

arXiv 2025

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

arXiv 2025

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

arXiv 2025

Safety at Scale: A Comprehensive Survey of Large Model Safety

arXiv 2025

Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

arXiv 2025

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

arXiv 2025

$μ$KE: Matryoshka Unstructured Knowledge Editing of Large Language Models

arXiv 2025

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

arXiv 2024

LLMDFA: Analyzing Dataflow in Code with Large Language Models

arXiv 2024

Slow Perception: Let's Perceive Geometric Figures Step-by-step

arXiv 2024

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

arXiv 2024

Reconstructive Visual Instruction Tuning

arXiv 2024

Reflected Flow Matching

arXiv 2024

ProSec: Fortifying Code LLMs with Proactive Security Alignment

arXiv 2024

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

ICCV 2023 1

Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

arXiv 2023

KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair

arXiv 2023

DreamLLM: Synergistic Multimodal Comprehension and Creation

arXiv 2023

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

arXiv 2023

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

voxelnext-fully-sparse-voxelnet-for-3d-object

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

ICCV 2023 1

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

arXiv 2023

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

arXiv 2023

Bootstrap Masked Visual Modeling via Hard Patches Mining

arXiv 2023

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

ICCV 2023 1

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

arXiv 2023

PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

ICCV 2023 1

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception

ICCV 2023 1

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

CVPR 2022 1

Self-Supervised Visual Representation Learning with Semantic Grouping

arXiv 2022

Focal Sparse Convolutional Networks for 3D Object Detection

CVPR 2022 1

Reversible Column Networks

arXiv 2022

FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated Learning

arXiv 2022