P(W|O) = argmax_W P(O|W).P(W)
The acoustic model estimates the observation likelihood P(O|W).
The language model estimates the prior probability P(W).
Applications: YouTube closed captioning; Cortana/Siri/Alexa front end; dictation systems
Resources: phonemes, pronunciation dictionary, speech data (labelled for phonemes and words), word vocabulary
https://openslr.org/index.html
Challenges:
- speech recording: quality and acoustic range of the mic;
- background noise;
- speaker: speed, age and accent of pronunciation; volume of speech (distance between speaker and mic)
- language-specific constraints: inflections, no writing system;
- code mixing; multilingual setting;
- speed of ASR (esp. in time-critical applications);
- task complexity: identify speaker; account for speech-noise; listen continuously or push-to-talk;
- train ASR for new languages: create resources such as pronouncing dictionary, learn HMM and n-gram probabilities
Components: Acoustic Model (sampling, Fourier transform from time-domain to frequency-domain, pronunciation dictionary, HMM to map phonemes to word), Language Model (n-gram probabilities, smoothing, back-off, linear interpolation)
time-domain: time vs. frequency (pitch) plotfrequency-domain: extract component frequencies with the corresponding amplitudes
Evaluation: Word Error Rate (WER) is a measure of the word-level edit distance in terms of insertions, deletions and substitutions between the reference and the model output.
Vocabulary: humans learn ~30K words by middle age; in case of an open-vocabulary setting, consider all unknown words as the special token <UNK>
International Phonetic Alphabet: list of phones; English has 26 letters to convey the 44 syllable sounds;
CMU Pronunciation Dictionary (CMUdict): mapping between words and phones for North American English; 134000 entries; challenges (names, inflections, numbers)
ASR: CMU Sphinx (2015), Espresso (2019), NeMo, Whisper
TTS: WaveNet (2016), Tacotron (2017)
frequency (pitch)
amplitude (loudness)
waveform; digitization (sampling, quantization); spectrum computed using Fourier transform; spectrogram; spectral features as vectors (LPC, PLP); phone likelihood estimation (using neural networks);
How to train an HMM?
How to remove background noise?
How to build ASR for different dialects of English?
Do we need a resource that is labelled for phonemes?
How to map speech to phonemes?
Why not build end-to-end ASR systems?
On Construction of the ASR-oriented Indian English Pronunciation Dictionary
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html