Yuxuan Wang
- Papers
- 35
Cite
Notes
Only stored in your browser.
Authored papers
35Fish Audio S2 Technical Report
arXiv 2026
Thoth: Mid-Training Bridges LLMs to Time Series Understanding
arXiv 2026
The AI Hippocampus: How Far are We From Human Memory?
arXiv 2026
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
arXiv 2025
Qwen3-Omni Technical Report
arXiv 2025
Qwen3-VL Technical Report
arXiv 2025
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
arXiv 2025
MagiCodec: Simple Masked Gaussian-Injected Codec for High-Fidelity Reconstruction and Generation
arXiv 2025
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
arXiv 2025
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
arXiv 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
arXiv 2025
Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain
arXiv 2025
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
arXiv 2025
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
arXiv 2025
Discrete Markov Bridge
arXiv 2025
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
arXiv 2024
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
arXiv 2024
Understanding Multimodal Hallucination with Parameter-Free Representation Alignment
arXiv 2024
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
arXiv 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
arXiv 2024
TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables
arXiv 2024
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
arXiv 2024
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
arXiv 2024
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
arXiv 2024
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
arXiv 2024
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
arXiv 2023
Separate Anything You Describe
arXiv 2023
Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models
arXiv 2023
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions
arXiv 2023
Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training
arXiv 2023
VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration
arXiv 2022
NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism
arXiv 2022
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
arXiv 2022
Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation
arXiv 2021
VoiceFixer: Toward General Speech Restoration with Neural Vocoder
arXiv 2021
Affiliations
Frequent co-authors
10from 35 papers