Zhou Zhao
- Papers
- 49
Cite
Notes
Only stored in your browser.
Authored papers
49Orient Anything V2: Unifying Orientation and Rotation Understanding
arXiv 2026
VoxMind: An End-to-End Agentic Spoken Dialogue System
arXiv 2026
Reinforced Visual Perception with Tools
arXiv 2025
Depth Anything with Any Prior
arXiv 2025
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
arXiv 2025
LiftFeat: 3D Geometry-Aware Local Feature Matching
arXiv 2025
OmniAudio: Generating Spatial Audio from 360-Degree Video
arXiv 2025
TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
arXiv 2025
Versatile Framework for Song Generation with Prompt-based Control
arXiv 2025
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
arXiv 2025
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
arXiv 2025
DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
arXiv 2025
APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization
arXiv 2025
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
arXiv 2025
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
arXiv 2025
CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
arXiv 2025
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
arXiv 2025
Seeking and Updating with Live Visual Knowledge
arXiv 2025
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
arXiv 2024
WavChat: A Survey of Spoken Dialogue Models
arXiv 2024
Language-Codec: Bridging Discrete Codec Representations and Speech Language Models
arXiv 2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
arXiv 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
arXiv 2024
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
arXiv 2024
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
arXiv 2024
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
arXiv 2024
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
arXiv 2024
Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition
arXiv 2024
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
arXiv 2024
Classifier-guided Gradient Modulation for Enhanced Multimodal Learning
arXiv 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
arXiv 2024
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
arXiv 2024
FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation
arXiv 2024
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
arXiv 2023
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
arXiv 2023
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
arXiv 2023
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
arXiv 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
arXiv 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
ICCV 2023 1
Detector Guidance for Multi-Object Text-to-Image Generation
arXiv 2023
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
ICCV 2023 1
Pseudo Numerical Methods for Diffusion Models on Manifolds
pseudo-numerical-methods-for-diffusion-models
Learning the Beauty in Songs: Neural Singing Voice Beautifier
ACL 2022 5
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech
arXiv 2022
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
arXiv 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
arXiv 2021
Parallel and High-Fidelity Text-to-Lip Generation
arXiv 2021
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
NeurIPS 2021 12
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
arXiv 2019
Affiliations
Frequent co-authors
10from 49 papers