Decoding Vocal Articulations from Acoustic Latent Representations

Abstract

We present a neural encoder for acoustic-to-articulatory inversion built on the Pink Trombone synthesizer, whose parameters (tongue position, vocal-cord configuration, …) are physically interpretable. The system identifies which articulatory features produce the acoustic characteristics captured in a neural latent representation. We compare a variational autoencoder trained from scratch — with a "projector" sub-network on its bottleneck that decodes the synthesizer's parameters — against two pretrained encoders, EnCodec and Wav2Vec, which let us train only the projector. Predicting six parameters and evaluating with objective and ViSQOL subjective-equivalent metrics on both synthetic and human speech, the predicted parameters generate human-like vowel sounds when fed back through the synthesizer. Dataset, code, and findings are released to support future work.

How it works

Two routes produce the latent embeddings the projector reads from:

VAE + Synth — a self-supervised variational autoencoder trained from scratch to reconstruct the input, with a projector on its bottleneck that decodes the Pink Trombone’s articulatory parameters.
Pretrained encoders — EnCodec and Wav2Vec supply the latent space directly, so only the projector is trained. This removes the from-scratch encoder, simplifying the pipeline and cutting compute.

Because the synthesizer is a physical model, the predicted parameters are directly interpretable as vocal-tract gestures rather than opaque features.

Tongue position during the utterance /ieaou/, with physical boundaries marked. — Tongue position during the utterance /i–e–a–o–u/. Only one articulator shifts at a time, so the trajectory is easy to read; the purple line marks the physical boundary.

Histogram of ViSQOL scores. — Distribution of ViSQOL scores (equivalent to Figure 2 in the paper).

Listen

For every example: the original human recording, then the reconstruction from each encoder’s predicted parameters, in a slow variant (steady parameters) and a fast variant (up to 10 parameter changes across the utterance).

Cite

@article{camara2024decoding,
  title   = {Decoding Vocal Articulations from Acoustic Latent Representations},
  author  = {C{\'a}mara, Mateo and Marcos, Fernando and Blanco, Jos{\'e} Luis},
  journal = {arXiv preprint arXiv:2406.14379},
  year    = {2024}
}

Vowel reconstructions

Original human recording vs. articulatory parameters predicted by each encoder, fed back through the Pink Trombone synthesizer. "Slow" holds parameters steady; "fast" allows 10 changes across the utterance.

Example	Original	VAE+Synth · slow	VAE+Synth · fast	Wav2Vec · slow	Wav2Vec · fast	EnCodec · slow	EnCodec · fast
/a/
/e/
/i/
/o/
/u/
/e–i–u/
roy
/i–e–a–o–u/
/o–i–u/
/a–i–o/