Stable Audio introduces a new approach to audio generation using latent diffusion models. Traditional audio diffusion models have been limited to generating fixed-size outputs, creating challenges when generating variable-length audios, such as full songs. Stable Audio is designed to overcome this limitation by conditioning on text metadata, audio file duration, and start time, allowing for controlled content and length. This architecture can render 95 seconds of stereo audio in less than one second using an NVIDIA A100 GPU. It combines a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model to achieve this. The model is trained using a vast dataset from AudioSparx, totaling over 19,500 hours of audio. Stable Audio represents the advanced work of Stability AI's research lab, Harmonai, with promising future developments including open-source models.

 
No comments:
Post a Comment