The comparison
Each row is a reconstruction method; each column is one of eight percussion hits (four rubber-mallet strikes, four wide-drumstick strikes) from the beats database.
- 2D-VAE / 3D-VAE — autoencoders that encode amplitude and phase together in a latent space (2- or 3-dimensional convolutions over the time–frequency plane).
- Griffin–Lim — amplitude-only reconstruction with iterative phase regeneration, the classic baseline.
- Complex VAE — a network operating directly on complex spectra.
The two blocks below repeat the comparison under the STFT and the MCLT representations, so you can hear how the transform choice affects phase recovery.
STFT representation
Reconstructions of eight percussion hits (rubber mallet, wide drumstick) under each STFT-based model, against Griffin–Lim phase regeneration.
| Example | Mallet 1 | Mallet 2 | Mallet 3 | Mallet 4 | Stick 1 | Stick 2 | Stick 3 | Stick 4 |
|---|---|---|---|---|---|---|---|---|
| Original | ||||||||
| 2D-VAE | ||||||||
| 3D-VAE | ||||||||
| Griffin–Lim | ||||||||
| Complex VAE |
MCLT representation
The same hits reconstructed under each model built on the Modulated Complex Lapped Transform.
| Example | Mallet 1 | Mallet 2 | Mallet 3 | Mallet 4 | Stick 1 | Stick 2 | Stick 3 | Stick 4 |
|---|---|---|---|---|---|---|---|---|
| Original | ||||||||
| 2D-VAE | ||||||||
| 3D-VAE | ||||||||
| Griffin–Lim | ||||||||
| Complex VAE |