0

Zhou Zhao

Papers
49

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
49papers

Authored papers

49

Orient Anything V2: Unifying Orientation and Rotation Understanding

arXiv 2026

2026

VoxMind: An End-to-End Agentic Spoken Dialogue System

arXiv 2026

2026

Reinforced Visual Perception with Tools

arXiv 2025

2025

Depth Anything with Any Prior

arXiv 2025

2025

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

arXiv 2025

2025

LiftFeat: 3D Geometry-Aware Local Feature Matching

arXiv 2025

2025

OmniAudio: Generating Spatial Audio from 360-Degree Video

arXiv 2025

2025

TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis

arXiv 2025

2025

Versatile Framework for Song Generation with Prompt-based Control

arXiv 2025

2025

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

arXiv 2025

2025

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

arXiv 2025

2025

DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

arXiv 2025

2025

APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

arXiv 2025

2025

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

arXiv 2025

2025

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits

arXiv 2025

2025

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

arXiv 2025

2025

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

arXiv 2025

2025

Seeking and Updating with Live Visual Knowledge

arXiv 2025

2025

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

arXiv 2024

2024

WavChat: A Survey of Spoken Dialogue Models

arXiv 2024

2024

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models

arXiv 2024

2024

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

arXiv 2024

2024

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

arXiv 2024

2024

GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks

arXiv 2024

2024

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

arXiv 2024

2024

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

arXiv 2024

2024

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

arXiv 2024

2024

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition

arXiv 2024

2024

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

arXiv 2024

2024

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning

arXiv 2024

2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

arXiv 2024

2024

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

arXiv 2024

2024

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

arXiv 2024

2024

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

arXiv 2023

2023

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

arXiv 2023

2023

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

arXiv 2023

2023

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

arXiv 2023

2023

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

arXiv 2023

2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

ICCV 2023 1

2023

Detector Guidance for Multi-Object Text-to-Image Generation

arXiv 2023

2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

ICCV 2023 1

2023

Pseudo Numerical Methods for Diffusion Models on Manifolds

pseudo-numerical-methods-for-diffusion-models

2022

Learning the Beauty in Songs: Neural Singing Voice Beautifier

ACL 2022 5

2022

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

arXiv 2022

2022

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

arXiv 2022

2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

arXiv 2021

2021

Parallel and High-Fidelity Text-to-Lip Generation

arXiv 2021

2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

NeurIPS 2021 12

2021

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

arXiv 2019

2019

Affiliations

No known affiliations.

Frequent co-authors

10

from 49 papers