Xiangyu Zhang
- Papers
- 64
Cite
Notes
Only stored in your browser.
Authored papers
64Step-Audio-R1.5 Technical Report
arXiv 2026
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
arXiv 2026
SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
arXiv 2026
WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics
arXiv 2026
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
arXiv 2026
GEBench: Benchmarking Image Generation Models as GUI Environments
arXiv 2026
STEP3-VL-10B Technical Report
arXiv 2026
Step1X-Edit: A Practical Framework for General Image Editing
arXiv 2025
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
arXiv 2025
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
arXiv 2025
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI
CVPR 2025 1
Farseer: A Refined Scaling Law in Large Language Models
arXiv 2025
Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
arXiv 2025
Step-DeepResearch Technical Report
arXiv 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
arXiv 2025
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
arXiv 2025
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
arXiv 2025
ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants
arXiv 2025
Step-Audio 2 Technical Report
arXiv 2025
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
arXiv 2025
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
arXiv 2025
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
arXiv 2025
Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation
arXiv 2025
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
arXiv 2025
Safety at Scale: A Comprehensive Survey of Large Model Safety
arXiv 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
arXiv 2025
$μ$KE: Matryoshka Unstructured Knowledge Editing of Large Language Models
arXiv 2025
Hita: Holistic Tokenizer for Autoregressive Image Generation
ICCV 2025
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
arXiv 2025
Step-GUI Technical Report
arXiv 2025
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
arXiv 2025
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
arXiv 2024
LLMDFA: Analyzing Dataflow in Code with Large Language Models
arXiv 2024
Slow Perception: Let's Perceive Geometric Figures Step-by-step
arXiv 2024
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token
arXiv 2024
Reconstructive Visual Instruction Tuning
arXiv 2024
Reflected Flow Matching
arXiv 2024
ProSec: Fortifying Code LLMs with Proactive Security Alignment
arXiv 2024
Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection
ICCV 2023 1
Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection
arXiv 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
arXiv 2023
Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
arXiv 2023
Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary
arXiv 2023
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
arXiv 2023
VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking
voxelnext-fully-sparse-voxelnet-for-3d-object
Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
ICCV 2023 1
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
arXiv 2023
Bootstrap Masked Visual Modeling via Hard Patches Mining
arXiv 2023
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
ICCV 2023 1
KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair
arXiv 2023
PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images
ICCV 2023 1
Focal Sparse Convolutional Networks for 3D Object Detection
CVPR 2022 1
MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception
ICCV 2023 1
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
CVPR 2022 1
Self-Supervised Visual Representation Learning with Semantic Grouping
arXiv 2022
Reversible Column Networks
arXiv 2022
FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated Learning
arXiv 2022
RepVGG: Making VGG-style ConvNets Great Again
CVPR 2021 1
DocTer: Documentation Guided Fuzzing for Testing Deep Learning API Functions
arXiv 2021
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition
arXiv 2021
Channel Pruning for Accelerating Very Deep Neural Networks
channel-pruning-for-accelerating-very-deep-1
Identity Mappings in Deep Residual Networks
arXiv 2016
Deep Residual Learning for Image Recognition
deep-residual-learning-for-image-recognition-1
Affiliations
Frequent co-authors
10from 64 papers