Digital Speech Processing
CZR Acad. year 2005/2006 Summer semester 5 credits
Language of instruction
Course Web Pages
Subject specific learning outcomes and competences
Generic learning outcomes and competences
Prerequisite kwnowledge and skills
- Krčmová, N.: Fonetika a fonologie: zvuková stavba současné češtiny. ISBN 80-210-0137-2. Masarykova univerzita, Brno, 1990
- Rabiner, L. Juang, B.H.: Fundamentals of speech recognition, Signal Processing, Prentice Hall, Engelwood Cliffs, NJ, 1993
- Rabiner, L.R., Schaeffer, L.W.: Digital processing of speech signals, Prentice Hall, 1978
- Psutka, J.: Komunikace s s počítačem mluvenou řečí. Academia, Praha, 1995. (in Czech, available in FIT library).
- Gold, B., Morgan, N.: Speech and audio signal processing, John Wiley and Sons, 2000. (available in FIT library).
- Young, S., Jansen, J., Odell, J., Ollason, D., Woodland, P.: The HTK book, Entropics Cambridge Research Lab., 1996, Cambridge, UK. Excellent introduction to HMMs, free download at http://htk.eng.cam.ac.uk/
- http://www.fit.vutbr.cz/~cernocky/speech/ - lecture notes, labs, functions. This page's going to grow...
- http://www.fit.vutbr.cz/~cernocky/oldspeech/ - lecture notes, labs, functions. Old version, but especially some labs (everything in Matlab) might be interesting.
Syllabus of lectures
- Organization of the course, applications, sciences related to the topic, information carried by speech, demonstrations.
- Digital processing of speech signals: recording - sampling, quantization. Speech spectra - continuous Fourier transform; what do we get when we sample. Discrete Fourier transform. Random signals, power spectral density. Modification of speech - linear filters. Frequency response of a filter.
- Pre-processing of speech: dc removal, preemphasis, frames, basic parameters. Spectrogram. Speech production: articulatory organs - vocal cords and vocal tract vs. excitation and filter. Characteristics in time and frequency, influence of excitation and filter. What can be seen on long- and short-term spectrograms. How to separate excitation and filter: cepstrum, MFCC.
- Linear-predictive model: what is it good for ? Separation of vocal tract characteristics from excitation - applications in coding and recognition. Prediction of a sample from past samples - linear prediction (LP). Error of LP. Obtaining the error using a single filter. Determination of vocal tract characteristics using LP analysis. Spectrum estimated by LP. Features derived from LP - LAR and LSF. LPC-cepstrum.
- Determination of fundamental frequency (F0). Terminology. Characteristics of F0 of males, females and kids. Use in speech processing systems . Methods based on autocorrelation function. NCCF. Long-term predictor and cepstral analysis for F0 determination. Reliability and problems of F0 detectors.
- Coding I.: Aims of coding. Bit-rate, objective and subjective measurements of quality. Classification of coders according to bit-rate. Waveform coders. Vocoders - LPC. Vector quantization in speech coding.
- Coding II. - CELP, Coding in GSM networks: GSM, GSM-EFR, GSM-HR, Voice over IP. Introduction to speech recognition - the task, classification of recognizers: isolated words - connected words - continuous speech, speaker dependent - speaker independent. Basic function blocks. Voice activity detection (VAD) for isolated words.
- Recognition using DTW. Recognition based on distance of speech frames - various definitions of distance. Timing: linear modification, dynamic programming (Dynamic Time Warping DTW). Hidden Markov models (HMM I.): Introduction, motivations and relation to DTW. Structure f the model, Gaussian distributions, state sequences.
- HMM II. probability of a sequence of states, Baum-Welch and Viterbi probabilities. Training of models: Baum-Welch, recognition: Viterbi. Token passing. Connected words.
- HMM III. Continuous speech with large vocabulary: recognition of small units - phonemes... Phonetics: vowels and consonants, characteristics, classification of phonemes. International phoneme alphabets: IPA, SAMPA, TIMIT. Co-articulation. Applications in recognition: context-dependent triphones. Large vocabulary, Language modeling, lattice rescoring, forced alignment [Martin Karafiát].
- Features for recognition [Lukáš Burget, Petr Schwarz, Pavel Matějka]. What do we need: suppression of pitch, de-correlation, link with spectral envelope. How do we reach it: LPCC, MFCC, de-correlation: PCA, LDA, HLDA, channel robustness: normalization. Further tricks with features - delta, delta-delta. "Hot-topics" in feature extraction: TRAPs a FeatureNet, neural nets. Tools for speech processing.
- Speech synthesis: structure of the synthesizer. Conversion of written text to speech: text-to-speech. Text normalization. Prosody (melody, accents, timing) in synthesis. Units for synthesis - manual and automatic selection, corpus-based synthesis. Generation of signal in time and frequency domains: PSOLA and HNM. Applications, SW for synthesis: EPOS, MBROLA, Festival.
- Further topics in speech processing:
- speaker identification/verification (principles, false acceptation, false rejection, cost function, optimal operation point, EER). [Černocký].
- Phoneme recognition [Petr Schwarz, Petr Jenderka]
- LVCSR [Martin Karafiát]
- Recognizer merging [Lukáš Burget]
- Very Low Bit Rate coding [Petr Motlíček, Černocký]
- audio-video recognition [Petr Motlíček]
- speech databases [Černocký].
Syllabus of numerical exercises
Syllabus - others, projects and individual work of students
- 4 projects a 8 pts. - 32
- mid-semestral exam - theoretical questions only - 18
- semestral exam - theory and numerical examples - 50
- All materials is authorized for both exams.
- Projects: for each project, software and short documentation (how to compile, how to run, which algorithms are used) should be supplied.