Parameter Optimization for a Physical Model of the Vocal System

Abstract

We evaluate black-box and grey-box optimization of the parameters of a simplified physical synthesizer — the Pink Trombone — to emulate male and female vocal tracts for vocalic and non-speech sounds. A genetic optimizer with Mel-spectrogram representations infers articulatory parameters from human recordings by direct spectral comparison with the synthesizer's output, optimized over temporal windows with a modified objective metric for temporal coherence. We also study grey-box approaches: pYIN for fundamental-frequency prediction and a ResNet-based network to seed the optimization. Subjective tests confirm the synthesizer convincingly mimics human vocal sounds and show a marked improvement over previous state of the art. These evaluations also let us fine-tune the ViSQOL perceptual metric, giving a calibrated tool for assessing physical-synthesizer modeling.

Watch

The interactive demo above the title lets you try a handful of yawns and vowels in the browser. Every sample used in the study is collected below.

Listen

Three comparisons, each as a grid of audio players. Drag horizontally on narrow screens — the row labels stay pinned.

Vowels — male and female vocal tracts

For each vowel: the human recordings, then the previous state-of-the-art ("old") and our method ("new"), for the female and male vocal tracts.

Example	Original ♂	Original ♀	Old method ♀	New method ♀	Old method ♂	New method ♂
/a/
/e/
/i/
/o/
/u/

Temporal penalization sweep

Effect of the temporal-coherence penalization weight on vowel sequences (original recording followed by increasing penalization).

Example	Original	Penalization 0	Penalization 1	Penalization 2	Penalization 4
/a-e-i-o-u/
/e-i-u/
roy ♂
roy ♀

Initialization strategy

The same target optimized from different starting points: no initialization, a multispectral-error initializer, a neural-network predictor, and pYIN for the fundamental frequency.

Example	Original	No initialization	Multispectral error	Neural network	pYIN
/a/ ♂
/a/ ♀
a-i-o
a-o
/e/ ♀
roy ♀
roy ♂
yawn 1
yawn 2
yawn 3
e-i-u
oi-oi-oi

Perceptual metric — ViSQOL for the Pink Trombone

The subjective tests let us calibrate ViSQOL for this domain. Download the fitted support-vector-regression model:

Download the SVR model →

After installing ViSQOL (see its user guide), run:

./bazel-bin/visqol \
  --reference_file ref1.wav \
  --degraded_file deg1.wav \
  --similarity_to_quality_model libsvm_svr_pinktrombone_model.txt

The listening test

The subjective evaluation ran as a MUSHRA test on Go Listen (in Spanish — your browser can translate it). The full questionnaire, translated to English, is documented in the test appendix.