Speech-to-speech

Open-source multilingual translation & cloned speech synthesis

Mateo Cámara, Juan Gutiérrez, María Pilar Daza, José Luis Blanco

Universidad Politécnica de Madrid

Abstract

We present an open-source system for multilingual translation and speech regeneration. Whisper handles speech recognition, with Voice Activity Detection to find speaking intervals, followed by a pipeline of large language models: one segments speech into coherent sentences, a second translates them. A text-to-speech module with voice cloning then regenerates the speech in the original speaker's voice, preserving naturalness and identity. Components run locally or via APIs, enabling cost-effective deployment — real-time translation in video calls, regeneration for broadcasts, Bluetooth multilingual playback — while a detailed analysis of latency and word accuracy demonstrates its potential for inclusive, real-world communication.

Pipeline

Audio in → Whisper transcription with voice-activity detection → an LLM that segments the transcript into complete sentences → a second LLM that translates them → a voice-cloning TTS that regenerates speech in the original speaker’s voice. Every stage can run locally or through an API, so the system scales from a laptop to a server.

Results

Subjects scored the system sentence by sentence:

Subject scores, sentence 1
Sentence 1
Subject scores, sentence 2
Sentence 2
Subject scores, sentence 3
Sentence 3
Subject scores, sentence 4
Sentence 4

Listen

Standalone outputs presented to listeners:

Sample 1
Sample 2

And the paired test-vs-reference comparisons:

Paired samples — test vs. reference

Each translated-and-cloned output beside the human reference shown to listeners in the subjective test.

ExampleTranslated + clonedReference
Sample 3
Sample 4