Developing a precise and data-efficient pronunciation training system for English learners using advanced self-supervised learning models.

Abstract

This project addresses the critical need for effective and scalable pronunciation training tools for language learners. We developed a novel system that moves beyond traditional speech recognition by focusing on granular phoneme-level analysis. The core of this system is a transformer-based foundational audio model, specifically wav2vec 2.0, which was fine-tuned to predict duration-aware phoneme sequences. This innovative approach significantly reduces the need for extensive labeled datasets, enabling highly accurate mispronunciation detection and the provision of personalized, actionable feedback to users. The resulting system offers a powerful and efficient solution for AI-powered language education.

  • Key Technologies & Concepts: Transformer Models (Wav2Vec), Fine-tuning, Self-supervised Learning, Speech Representation Learning, AI in Education

Methodology & Approach

Our methodology centered on the adaptation of a pre-trained, self-supervised audio model. By fine-tuning the wav2vec architecture, we taught the model to align a user’s speech with the expected sequence and duration of phonemes. Unlike standard speech-to-text systems that simply transcribe words, our model predicts a structured sequence of phonemes, including their temporal boundaries. This granular, duration-aware prediction allows the system to pinpoint exactly where a mispronunciation occurs—whether it’s a substitution, omission, or an incorrect duration of a specific sound—and to provide highly targeted feedback.


Model Pipeline Alignment v11

Model Pipeline of the Pronunciation System

Speech Sample

Raw speech audio

Wav2Vec2 Feature Extractor

Context Representations

Self-supervised learning

PCA Dimension Reduction

High-dim to 2D

Frame Classifier

K-Nearest Neighbors

Probability Matrix

Grouping Frames from Probability Matrix

D

0.8

OW

0.84

N

0.86

T

0.92

AE

0.98

S

1.0

K

1.14

M

1.24

IY

1.32

Sentence: "Don't ask me"

1. Self-Supervised Feature Extraction (wav2vec2 fine-tuned with CTC)

The pipeline begins with a pretrained wav2vec2 model (trained across 53 languages), which is further fine‑tuned on phoneme recognition using a CTC (Connectionist Temporal Classification) loss. This yields contextual frame-level phonetic representations without explicit force-alignment input.


2. Feature Reduction via PCA

High-dimensional embeddings from the last hidden layer of wav2vec2 are passed through Principal Component Analysis (PCA) to reduce dimensionality while retaining most of the variance. This step ensures efficient downstream processing and reduces model complexity.


3. Frame-Level Phoneme Classification

Reduced embeddings are classified using a frame-level phoneme classifier, trained with forced-alignment labels (generated by tools like the Montreal Forced Aligner). This classifier learns to assign phoneme classes to each audio frame.


4. Probability Vectors per Frame

Each audio frame is represented as a probability distribution over phoneme classes, resulting in a probability vector for every time step.


5. Text-Independent Phone-to-Audio Alignment via Classifier Outputs

Because the system only relies on these frame-by-frame probabilities (without any pre-specified phone sequence), it supports text-independent phone-to-audio alignment. Essentially, it aligns phoneme labels to audio segments without knowing the transcript in advance.


6. Multilingual Capability and Minimal Training

By leveraging self-supervised learning and forced-alignment labels, this method achieves high-quality phoneme alignment with minimal additional training, and without explicit text input. It also works across English variants (American and British), making it language-independent and robust.

Impact & Significance

The Flowchase system demonstrates the power of adapting large foundational models for specific educational applications. By fine-tuning a model on a relatively small amount of labeled data, we were able to create a highly effective tool that offers scalable and personalized feedback. This project's success lies in its ability to provide language learners with the kind of immediate, precise, and data-driven guidance that is often only available from a human tutor, making high-quality pronunciation training more accessible to a global audience.

Reference

This project's technical foundation is explined in greater details in the following academinc paper published as part of the R&D project:

Tits N, Bhatnagar P, Dutoit T. Text-Independent Phone-to-Audio Alignment Leveraging SSL (TIPAA-SSL) Pre-Trained Model Latent Representation and Knowledge Transfer. Acoustics. 2024; 6(3):772-781. https://doi.org/10.3390/acoustics6030042