Comparison of Phase Audio Processing Methods for Variational Autoencoders

Abstract

This work analyzes how signal phase is handled in one of the most popular architectures for generative audio-effect synthesis: the Variational Autoencoder (VAE). Fourier-based autoencoders have routinely sidestepped phase — storing and reattaching it at the output, or regenerating it with Griffin–Lim. We instead evaluate VAEs whose latent space carries both amplitude and phase, and test the Modulated Complex Lapped Transform (MCLT) as an alternative to the Fourier Transform. Using a new database of percussive beats, we compare — objectively and in subjective-equivalent terms — FFT and MCLT autoencoders, Griffin–Lim phase regeneration, multi-channel networks, and the Complex VAE. The autoencoders learn to represent and handle phase holistically, reaching state-of-the-art quality for audio effects, while overall quality tracks the joint reconstruction of phase and amplitude.

The comparison

Each row is a reconstruction method; each column is one of eight percussion hits (four rubber-mallet strikes, four wide-drumstick strikes) from the beats database.

2D-VAE / 3D-VAE — autoencoders that encode amplitude and phase together in a latent space (2- or 3-dimensional convolutions over the time–frequency plane).
Griffin–Lim — amplitude-only reconstruction with iterative phase regeneration, the classic baseline.
Complex VAE — a network operating directly on complex spectra.

The two blocks below repeat the comparison under the STFT and the MCLT representations, so you can hear how the transform choice affects phase recovery.