Audio VAEs · phase

Comparison of Phase Audio Processing Methods for Variational Autoencoders

Mateo Cámara, José Luis Blanco

Universidad Politécnica de Madrid

Abstract

This work analyzes how signal phase is handled in one of the most popular architectures for generative audio-effect synthesis: the Variational Autoencoder (VAE). Fourier-based autoencoders have routinely sidestepped phase — storing and reattaching it at the output, or regenerating it with Griffin–Lim. We instead evaluate VAEs whose latent space carries both amplitude and phase, and test the Modulated Complex Lapped Transform (MCLT) as an alternative to the Fourier Transform. Using a new database of percussive beats, we compare — objectively and in subjective-equivalent terms — FFT and MCLT autoencoders, Griffin–Lim phase regeneration, multi-channel networks, and the Complex VAE. The autoencoders learn to represent and handle phase holistically, reaching state-of-the-art quality for audio effects, while overall quality tracks the joint reconstruction of phase and amplitude.

The comparison

Each row is a reconstruction method; each column is one of eight percussion hits (four rubber-mallet strikes, four wide-drumstick strikes) from the beats database.

The two blocks below repeat the comparison under the STFT and the MCLT representations, so you can hear how the transform choice affects phase recovery.

STFT representation

Reconstructions of eight percussion hits (rubber mallet, wide drumstick) under each STFT-based model, against Griffin–Lim phase regeneration.

ExampleMallet 1Mallet 2Mallet 3Mallet 4Stick 1Stick 2Stick 3Stick 4
Original
2D-VAE
3D-VAE
Griffin–Lim
Complex VAE

MCLT representation

The same hits reconstructed under each model built on the Modulated Complex Lapped Transform.

ExampleMallet 1Mallet 2Mallet 3Mallet 4Stick 1Stick 2Stick 3Stick 4
Original
2D-VAE
3D-VAE
Griffin–Lim
Complex VAE