Tuesday 5 April 2022

Automatic Speech Recognition (Part I)

Conversational AI: Automatic Speech Recognition) --> Natural Language Understanding --> Information Retrieval --> Natural Language Generation --> Text-to-Speech

P(W|O)  = argmax_W P(O|W).P(W)
The acoustic model estimates the observation likelihood P(O|W).
The language model estimates the prior probability P(W).

Applications: YouTube closed captioning; Cortana/Siri/Alexa front end; dictation systems

Resources: phonemes, pronunciation dictionary, speech data (labelled for phonemes and words), word vocabulary
https://openslr.org/index.html

Challenges:

  • speech recording: quality and acoustic range of the mic; 
  • background noise;
  • speaker: speed, age and accent of pronunciation; volume of speech (distance between speaker and mic)
  • language-specific constraints: inflections, no writing system;
  • code mixing; multilingual setting; 
  • speed of ASR (esp. in time-critical applications); 
  • task complexity: identify speaker; account for speech-noise; listen continuously or push-to-talk;
  • train ASR for new languages: create resources such as pronouncing dictionary, learn HMM and n-gram probabilities

Components: Acoustic Model (sampling, Fourier transform from time-domain to frequency-domain, pronunciation dictionary, HMM to map phonemes to word), Language Model (n-gram probabilities, smoothing, back-off, linear interpolation)

time-domain: time vs. frequency (pitch) plot
frequency-domain: extract component frequencies with the corresponding amplitudes

Evaluation: Word Error Rate (WER) is a measure of the word-level edit distance in terms of insertions, deletions and substitutions between the reference and the model output.

Vocabulary: humans learn ~30K words by middle age; in case of an open-vocabulary setting, consider all unknown words as the special token <UNK>

HMM: observations (phonemes), states (words), decode the HMM (estimate the best state sequence), train the HMM, evaluate the HMM (estimate the best observation sequence)

International Phonetic Alphabet: list of phones; English has 26 letters to convey the 44 syllable sounds;

CMU Pronunciation Dictionary (CMUdict): mapping between words and phones for North American English; 134000 entries; challenges (names, inflections, numbers)

ASR: CMU Sphinx (2015), Espresso (2019), NeMo, Whisper
TTS: WaveNet (2016), Tacotron (2017)

frequency (pitch)
amplitude (loudness)
waveform;
digitization (sampling, quantization); spectrum computed using Fourier transform; spectrogram; spectral features as vectors (LPC, PLP); phone likelihood estimation (using neural networks);

How to train an HMM?
How to remove background noise?
How to build ASR for different dialects of English?
Do we need a resource that is labelled for phonemes?
How to map speech to phonemes?
Why not build end-to-end ASR systems?

On Construction of the ASR-oriented Indian English Pronunciation Dictionary

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html