Summary

Text-To-Speech synthesis systems have existed for decades and have recently improved with the advent of Deep Neural Networks (DNNs). These systems offer excellent speech quality, by learning from tens of hours of speech.

The challenge faced by researchers today has therefore evolved: it is now necessary to be able to produce, remarkable voices, similar to those of actors, possessing a specific grain and a great ability for expressiveness. This is the field of expressive speech synthesis.

In this context, the main issues are the variability in vocal expression of emotions, and the difficulty of annotating large databases with very subjective expressive metadata that are still poorly defined. Deep Learning has proven to be effective in handling complex data but requires a large amount of this annotated data.

Currently, there exist some databases annotated in emotions by hand. Those databases are suitable for recognition systems but not for speech synthesis because of their format. Indeed these databases contain exterior noises and speech overlaps as they generally consist of dyadic conversations.

The strategy proposed in this project is precisely to develop an automated system of expressive annotation of large voice databases. This system would be trained thanks to existing annotated databases suitable for recognition and then applied to speech databases with high audio quality suitable for speech synthesis.

Here are two demonstrations of Controllable Expressive Speech Synthesis.

The first one allows exploration of a continuous 2D space in which you can click and hear synthesized speech samples with a different expressiveness.

The second controls an intensity in different style categories.

Subject vulgarisation

This presentation comes from the “Ma thèse en 180 secondes” competition aiming to explain the problematic of a PhD thesis in simple words in 2-3 minutes.

Links

ICE-Talk demo

source code: http://github.com/noetits/ICE-Talk


1. The birch canoe slid on the smooth planks.
2. Glue the sheet to the dark blue background.
3. It's easy to tell the depth of a well.
4. These days a chicken leg is a rare dish.
5. Rice is often served in round bowls.


Speech with style categories

This is a demonstration of synthesized speech with different styles and controllable intensities.

I developed of a multi-style TTS system with the possibility to control the intensity of style categories. It is a modified version of DCTTS, a deep learning based TTS system. In the modified version, it takes an encoding of the category at the input of the decoder. During training, a simple one-hot encoding is used. The size of the code is the number of different styles. At synthesis stage, we can modify the intensity of a style category by inputting other codes.

In fact, it is quite impressive that the DNN is able to interpolate in intensities without having seen intermediate styles in the database.

In this demo, we interpolate between neutral and each style.

Only the number corresponding to neutral and the category with which we interpolate are non-zero. The sum of the numbers of a code is one.

Example:

Let’s assume we have three categories (neutral, happy, angry). A code to have “quite angry” speech could be

[0.3, 0, 0.7]

The interface hereafter allows you to play a pre-synthesized sentence by clicking on a point. The axes correspond to styles and the distance from the center of the circle corresponds to the intensity of the style.