Heng Wang
- Papers
- 30
Cite
Notes
Only stored in your browser.
Authored papers
30Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
arXiv 2026
Adaptation of Agentic AI
arXiv 2025
Kimi-VL Technical Report
arXiv 2025
Cosmos World Foundation Model Platform for Physical AI
arXiv 2025
Step-DeepResearch Technical Report
arXiv 2025
OpenCUA: Open Foundations for Computer-Use Agents
arXiv 2025
The Collapse of Patches
arXiv 2025
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
arXiv 2025
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
arXiv 2025
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources
arXiv 2025
BannerAgency: Advertising Banner Design with Multimodal LLM Agents
arXiv 2025
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
arXiv 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
arXiv 2025
Fast Prompt Alignment for Text-to-Image Generation
arXiv 2024
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
arXiv 2024
Autoregressive Pretraining with Mamba in Vision
arXiv 2024
Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
arXiv 2024
DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection
arXiv 2024
Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images
arXiv 2024
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
arXiv 2023
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
arXiv 2023
Can Language Models Solve Graph Problems in Natural Language?
NeurIPS 2023 11
Progressive Volume Distillation with Active Learning for Efficient NeRF Architecture Conversion
arXiv 2023
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?
ICCV 2023 1
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
arXiv 2023
TwiBot-22: Towards Graph-Based Twitter Bot Detection
arXiv 2022
Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity
CVPR 2022 1
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
arXiv 2022
Is Space-Time Attention All You Need for Video Understanding?
arXiv 2021
A Closer Look at Spatiotemporal Convolutions for Action Recognition
a-closer-look-at-spatiotemporal-convolutions-1
Affiliations
Frequent co-authors
10from 30 papers