Tuesday 23 January 2024

Automatic Speech Recognition (Part II)

WFST

decoding/recognition

 

A lattice is a representation of the alternative word-sequences that are "sufficiently likely" for a particular utterance. In order to understand lattices properly you have to understand decoding graphs in the WFST framework

 

SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS

https://cs.nyu.edu/~mohri/pub/hbka.pdf

Audio Intro

https://ketanhdoshi.github.io/Audio-Intro/

 

sound:  variation in air pressure plotted over time

 

  • Amplitude is loudness (or "volume".) High amplitude is loud, low amplitude is quiet. We measure loudness in decibels (db).
  • Frequency is pitch. High frequency is a high-pitched sound, low frequency is, well, low. We measure frequency in hertz (Hz) and kilohertz (kHz), which is thousands of hertz.

 

Fourier Transform: convert audio from Time-domain (Amplitude vs. Time) to  Frequency-domain (Frequency vs. Time)

 

The Spectrum plots all of the frequencies that are present in the signal along with the strength or amplitude of each frequency.

It is useful to convert audio waveform into an image such as spectrogram so that CNN-architecture to extract features.