Expressive Text-to-Speech Synthesis via Deep Learning

Text-To-Speech synthesis systems have existed for decades and have recently improved with the advent of Deep Learning (DL). These systems offer excellent speech quality, by learning from tens of hours of speech.

The challenge faced by researchers has therefore evolved: it is now necessary to be able to produce, remarkable voices, similar to those of actors, possessing a specific grain and a great ability for expressiveness. This is the field of expressive speech synthesis.

My doctoral research focused on the challenges of creating highly natural and expressive synthetic speech. While modern Text-to-Speech (TTS) models can produce clear audio, they often struggle to convey the nuances of human emotion, tone, and prosody. The core of this project was to develop a deep learning-based framework that moves beyond monotone, robotic speech to generate rich, context-aware, and emotionally resonant audio.

The research involved designing and training neural networks to model the intricate relationship between text and prosody (the rhythm, stress, and intonation of speech). This included developing novel techniques for controlling and fine-tuning expressive features, allowing the model to adapt its style based on input text, speaker characteristics, or a specified emotion.

By integrating attention mechanisms and advanced sequence-to-sequence architectures, the final system demonstrated a significant improvement in the naturalness and expressiveness of the synthesized voices, paving the way for more engaging and human-like AI interactions.

Block Diagram of the Expressive TTS Pipeline

Block diagram of the expressive Text-to-Speech synthesis pipeline.

This diagram illustrates a typical end-to-end architecture for expressive TTS, which was the foundation for the research.

The **Text Encoder** takes the input text and converts it into a sequence of meaningful embeddings.
The **Prosody & Style Encoder** learns to extract expressive features from the text and/or a reference audio clip, allowing it to predict parameters like pitch, duration, and energy.
The **Attention-based Decoder** combines the text and style embeddings to generate a mel-spectrogram—a visual representation of the audio’s frequency content over time.
Finally, a high-fidelity **Vocoder** converts the mel-spectrogram into the final waveform, or audible speech. This final step is crucial for ensuring the output sounds natural and clear.

Read the Full Papers

Here are the papers that detail this work and its findings.

Interspeech paper focusing on the training strategy and latent space analysis Journal paper focusing on evaluation of synthesis control PhD thesis

Subject vulgarisation

This presentation comes from the “Ma thèse en 180 secondes” competition aiming to explain the problematic of a PhD thesis in simple words in 2-3 minutes.

ICE-Talk demo

This demonstration allows exploration of a continuous 2D space in which you can click and hear synthesized speech samples with a different expressiveness.

The second controls an intensity in different style categories.

source code: http://github.com/noetits/ICE-Talk

1. The birch canoe slid on the smooth planks.
2. Glue the sheet to the dark blue background.
3. It's easy to tell the depth of a well.
4. These days a chicken leg is a rare dish.
5. Rice is often served in round bowls.

Speech with style categories

This is a demonstration of synthesized speech with different styles and controllable intensities.

I developed of a multi-style TTS system with the possibility to control the intensity of style categories. It is a modified version of DCTTS, a deep learning based TTS system. In the modified version, it takes an encoding of the category at the input of the decoder. During training, a simple one-hot encoding is used. The size of the code is the number of different styles. At synthesis stage, we can modify the intensity of a style category by inputting other codes.

In fact, it is quite impressive that the DNN is able to interpolate in intensities without having seen intermediate styles in the database.

In this demo, we interpolate between neutral and each style.

Only the number corresponding to neutral and the category with which we interpolate are non-zero. The sum of the numbers of a code is one.

Example:

Let’s assume we have three categories (neutral, happy, angry). A code to have “quite angry” speech could be

[0.3, 0, 0.7]

The interface hereafter allows you to play a pre-synthesized sentence by clicking on a point. The axes correspond to styles and the distance from the center of the circle corresponds to the intensity of the style.

PhD project