Projects and use-cases | AI for Speech and Language

Expertise in AI for Speech and Language Processing: Guiding Businesses to Success

As a seasoned professional in AI for speech and language processing, I offer my expertise to businesses seeking to develop and implement speech technology solutions tailored to their specific needs. With a strong background in various projects, I can provide valuable insights and guidance on creating effective and efficient speech technology systems that cater to your business requirements.

The projects I have worked on showcase my proficiency in AI for speech and language processing, including pronunciation training systems, sentiment analysis and emotion recognition, emotional voices databases, and voice generation with control over emotional expressiveness. These projects demonstrate my ability to tackle diverse challenges and develop innovative solutions in the field of speech technology.

By leveraging my expertise, I can help your business navigate the complexities of developing and implementing speech technology solutions that cater to your unique needs. Whether you are looking to improve customer service, enhance user experiences, or create innovative products, my guidance and support can provide the foundation you need to succeed in the rapidly evolving field of AI for speech and language processing. Don’t hesitate to reach out and explore how my expertise can benefit your business and drive it towards success in the world of speech technology.

Contact me: noe dot tits at gmail dot com

Pronunciation analysis in speech

The presented project is a complete system that provides pronunciation training to English learners based on a speech technology built on top of a self-supervised pre-trained model adapted for mispronunciation detection. It integrates a set of different machine learning models based on speech representation learning that analyze the speech sample and provide feedback on different pronunciation aspects. The system is integrated into a mobile application that offers a variety of speaking and listening exercises as well as tutorials.

For more details:

https://flowchase.app

Flowchase

The system offers a practical solution to the lack of applications focusing on oral skills in computer-assisted language learning (CALL), specifically to pronunciation. It offers a mobile application that guides learners to practice speaking exercises and analyze their pronunciation with the help of speech technology. The system performs a forced alignment between the speech sample and the phonetic transcription, extracts information on the phonetic content of the audio, and analyzes different pronunciation aspects such as vowels, consonants, word and sentence stress, and pauses between breath groups in an utterance. The system provides feedback to the user through a set of different feedback cards with advice on how to improve their pronunciation. The use of transfer learning, and specifically self-supervised learning, helps to leverage models trained on related tasks for which there exist abundant datasets towards tasks for which few data exist. This approach makes the system able to perform pronunciation training with relatively little data, which is a significant advantage compared to other systems that require large amounts of training data.

Sentiment Analysis and Emotion Recognition

This paper proposes a novel approach for Sentiment Analysis and Emotion Recognition called Transformer-based Joint-Encoding (TBJE). TBJE leverages the Transformer architecture, which is commonly used in Machine Translation tasks, and a modular co-attention mechanism inspired by Visual Question Answering. The proposed approach is highly efficient and effective, and can jointly encode one or more modalities. The resulting model outperforms other state-of-the-art models on both Sentiment Analysis and Emotion Recognition tasks.

For more details:

Noé Tits, Kevin El Haddad, and Thierry Dutoit. 2018. ASR-based Features for Emotion Recognition: A Transfer Learning Approach. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 48–52, Melbourne, Australia. Association for Computational Linguistics.

Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, and Stéphane Dupont. 2020. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), pages 1–7, Seattle, USA. Association for Computational Linguistics.

Example Use-cases

As for the use case, this project can be applied in a wide range of scenarios where understanding human emotions and sentiment is critical. For example, it can be used in marketing research to analyze customer feedback, in mental health analysis to monitor and predict mood disorders, and in social media analysis to extract sentiment from text, images, and videos.

Emotional Voices Database

The paper presents an open-source database of emotional speech designed for speech synthesis, speech analysis and speech representation learning purposes. The dataset comprises five emotion classes for male and female actors in English. The paper addresses the current problem of a lack of open-source emotional speech databases for deep learning-based systems, providing a solution with high-quality data. The authors showcase the data’s effectiveness by using a simple DNN system to transform neutral speech to angry speech and testing the results through a CMOS perception test.

The paper highlights the need for emotional speech synthesis and analysis systems and the challenges of understanding the emotional dimension in speech. The authors’ proposed solution, an open-source emotional speech database, could potentially contribute to the development of more efficient systems for speech synthesis and analysis.

For more details:

https://github.com/numediart/EmoV-DB

Adigwe, A., Tits, N., Haddad, K. E., Ostadabbas, S., & Dutoit, T. (2018). The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514.

Example Use-Cases

One potential use case for the open-source emotional speech database is to develop more efficient speech synthesis and analysis systems with an emotional dimension. For example, a chatbot in a customer service context could utilize emotional speech synthesis to convey empathy and provide a more human-like experience for the user. The emotional speech database could be used to train deep neural networks to learn the emotional characteristics of speech and produce emotional speech in real-time. This could contribute to improving the user experience in various applications, such as chatbots, virtual assistants, and voice-enabled devices.

Another potential use case for this emotional speech dataset could be the development of an emotional voice assistant that can adjust its tone and delivery based on the user’s emotional state. With the increasing prevalence of virtual assistants such as Siri, Alexa, and Google Assistant, there is a growing demand for more personalized and emotionally responsive interactions with these systems. By using the emotional speech dataset to train a deep learning-based speech synthesis system, developers could create a voice assistant that can accurately detect the user’s emotional state and adjust its speech output accordingly. For example, if the user sounds sad or anxious, the voice assistant could respond with a soothing, comforting tone. Conversely, if the user sounds excited or enthusiastic, the voice assistant could respond with a more energetic and upbeat tone. This would create a more engaging and responsive user experience, and could potentially lead to increased user satisfaction and engagement with the voice assistant.

Voice generation with control over emotional expressiveness

The paper analyzes and compares different latent spaces for speech synthesis systems that enable the possibility to build controllable speech synthesis systems with an understandable behavior. The authors use classical feature selection techniques to evaluate the ability of various embedding types to discriminate between speech styles.

The paper focuses on unsupervised techniques for controllable speech synthesis, which can avoid the problem of labeled data. The latent embeddings learned model the remaining variation in speech signals after accounting for variation due to phonetics, speaker identity, and channel effects.

For more details:

Tits, N., Wang, F., Haddad, K.E., Pagel, V., Dutoit, T. (2019) Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis. Proc. Interspeech 2019, 4475-4479, DOI: 10.21437/Interspeech.2019-1426.

Tits, N.; El Haddad, K.; Dutoit, T. Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System. Informatics 2021, 8, 84. https://doi.org/10.3390/informatics8040084

Example Use-Cases

The proposed methodology can be used to build controllable speech synthesis systems that allow users to generate speech in different styles or manners. For example, a speech synthesis system could be developed that can produce speech with different emotions, accents, or speaking rates.
A potential use case for this research is in the development of more natural human-machine interfaces and create virtual agents with specific voices and characteristics. By controlling the variability of the speech produced by machines, users can interact with them in a more natural and intuitive way. For instance, a speech synthesis system could be developed that can generate speech in a way that is more empathetic, which could be useful in healthcare or customer service settings.