Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://zeyuxie29.github.io/PicoAudio.github.io.
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
PicoAudio, a temporal-controlled audio generation framework, enhances timestamp and occurrence frequency controllability through tailored model design and fine-grained audio-text data.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 4
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2407.02869v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar