Course details

Digital Speech Processing

CZR Acad. year 2006/2007 Summer semester 5 credits

Applications of speech processing, digital processing of speech signals, production and perception of speech, introduction to phonetics, pre-processing and basic parameters of speech, linear-predictive model, cepstrum, fundamental frequency estimation, coding (time domain and vocoders), recognition (DTW and HMM), synthesis. Software and libraries for speech processing.

Guarantor

Černocký Jan, prof. Dr. Ing. (DCGM)

Language of instruction

Czech

Completion

Examination

Time span

26 hrs lectures
2 hrs exercises
12 hrs pc labs
12 hrs projects

Department

Department of Computer Graphics and Multimedia (UPGM)

Subject specific learning outcomes and competences

Students will get familiar with principal methods and algorithms of speech signal processing. They will be able to design a simple system for speech processing (speech activity detector, recognizer of limited number of isolated words), including its implementation into application programs.

The students will deepen their knowledge in signal processing. The will acquire new skills in math- and visualization-SW Matlab and in practical use of C/C++. During projects, they will get acquainted with independent development work.

Learning objectives

To provide students with the knowledge of basic characteristics of speech signal in relation to production and hearing of speech by humans. To describe basic algorithms of speech analysis common to many applications. To give an overview of applications (recognition, synthesis, coding) and to inform about practical aspects of speech algorithms implementation.

Prerequisite knowledge and skills

Basic knowledge of signal processing.

Study literature

Krčmová, N.: Fonetika a fonologie: zvuková stavba současné češtiny. ISBN 80-210-0137-2. Masarykova univerzita, Brno, 1990
Rabiner, L. Juang, B.H.: Fundamentals of speech recognition, Signal Processing, Prentice Hall, Engelwood Cliffs, NJ, 1993
Rabiner, L.R., Schaeffer, L.W.: Digital processing of speech signals, Prentice Hall, 1978

Fundamental literature

Psutka, J.: Komunikace s s počítačem mluvenou řečí. Academia, Praha, 1995. K disposici v knihovně FIT.
Gold, B., Morgan, N.: Speech and audio signal processing, John Wiley and Sons, 2000. K disposici v knihovně FIT.
Young, S., Jansen, J., Odell, J., Ollason, D., Woodland, P.: The HTK book, Entropics Cambridge Research Lab., 1996, Cambridge, UK. Výborný úvod do HMM, ke stažení na http://htk.eng.cam.ac.uk/
http://www.fit.vutbr.cz/~cernocky/speech/ - přednášky, cvika, funkce. Materiálu zde bude postupně přibývat.
http://www.fit.vutbr.cz/~cernocky/oldspeech/ - přednášky, cvika, funkce. Stará verze, ale zvláště některé laboratoře (vše v Matlabu) by mohly být zajímavé

Syllabus of lectures

Organization of the course, applications, sciences related to the topic, information carried by speech, demonstrations.
Digital processing of speech signals: recording - sampling, quantization. Speech spectra - continuous Fourier transform; what do we get when we sample. Discrete Fourier transform. Random signals, power spectral density. Modification of speech - linear filters. Frequency response of a filter.
Pre-processing of speech: dc removal, preemphasis, frames, basic parameters. Spectrogram. Speech production: articulatory organs - vocal cords and vocal tract vs. excitation and filter. Characteristics in time and frequency, influence of excitation and filter. What can be seen on long- and short-term spectrograms. How to separate excitation and filter: cepstrum, MFCC.
Linear-predictive model: what is it good for ? Separation of vocal tract characteristics from excitation - applications in coding and recognition. Prediction of a sample from past samples - linear prediction (LP). Error of LP. Obtaining the error using a single filter. Determination of vocal tract characteristics using LP analysis. Spectrum estimated by LP. Features derived from LP - LAR and LSF. LPC-cepstrum.
Determination of fundamental frequency (F0). Terminology. Characteristics of F0 of males, females and kids. Use in speech processing systems . Methods based on autocorrelation function. NCCF. Long-term predictor and cepstral analysis for F0 determination. Reliability and problems of F0 detectors.
Coding I.: Aims of coding. Bit-rate, objective and subjective measurements of quality. Classification of coders according to bit-rate. Waveform coders. Vocoders - LPC. Vector quantization in speech coding.
Coding II. - CELP, Coding in GSM networks: GSM, GSM-EFR, GSM-HR, Voice over IP. Introduction to speech recognition - the task, classification of recognizers: isolated words - connected words - continuous speech, speaker dependent - speaker independent. Basic function blocks. Voice activity detection (VAD) for isolated words.
Recognition using DTW. Recognition based on distance of speech frames - various definitions of distance. Timing: linear modification, dynamic programming (Dynamic Time Warping DTW). Hidden Markov models (HMM I.): Introduction, motivations and relation to DTW. Structure f the model, Gaussian distributions, state sequences.
HMM II. probability of a sequence of states, Baum-Welch and Viterbi probabilities. Training of models: Baum-Welch, recognition: Viterbi. Token passing. Connected words.
HMM III. Continuous speech with large vocabulary: recognition of small units - phonemes... Phonetics: vowels and consonants, characteristics, classification of phonemes. International phoneme alphabets: IPA, SAMPA, TIMIT. Co-articulation. Applications in recognition: context-dependent triphones. Large vocabulary, Language modeling, lattice rescoring, forced alignment [Martin Karafiát].
Features for recognition [Lukáš Burget, Petr Schwarz, Pavel Matějka]. What do we need: suppression of pitch, de-correlation, link with spectral envelope. How do we reach it: LPCC, MFCC, de-correlation: PCA, LDA, HLDA, channel robustness: normalization. Further tricks with features - delta, delta-delta. "Hot-topics" in feature extraction: TRAPs a FeatureNet, neural nets. Tools for speech processing.
Speech synthesi

Syllabus of numerical exercises

Numerical exercise 3 hrs: digital filter, LPC, DTW, HMM, spectrogram reading.

Syllabus of computer exercises

Speech processing in Matlab: reading/writing of speech files, basic operations, recording of speech.
Signal processing in Matlab: design of filter, poles, zeros, frequency response, filtering, spectral analysis: FT, PSD.
Speech in C - class for input of speech. PROJECT 1: Simple frequency analyzer using FFT (will be supplied), output using ASCII characters, height of a column corresponds to energy in a frequency band.
LPC in C: Correlation, Levinson and Durbin, short-term energy. Check with Matlab on a speech file. Preparation for coding - storing to well-defined structure.
NCCF and fundamental frequency detection, Matlab, C. Threshold determination. Storing to the structure. Advanced: median smoothing of estimates.
PROJECT 2: - full LPC coder and decoder in C (without quantization of parameters). Advanced: speech output on-line using OSS (self-study).
Preparation for recognition: LPCC, voice activity detection, storing of speech files (samples and features for training of HMM and as references for DTW), preparation for the calling of recognizer.
PROJECT 3: full on-line recognizer based on DTW.
HTK - creation of small database of numerals, work with HMMs in HTK: prototypes, training, recognition, evaluation. Models must to be stored (will be needed for project No.4).
Preparation for HMM recognition: experience with decoder written by Lukas Burget - reading of models. Function for MFCC computation will be supplied, check with HTK.
PROJECT 4: HMM recognizer: writing of code for computation of output probability and Viterbi decoder using token-passing. Interfacing with voice-activity detector. Advanced: multi-threading (first thread records, second extracts features, third performs VAD, fourth recognizes).
Synthesis: database with phone-labels available, synthesis from text using concatenation. Advanced: using of HNM synthesis.

Progress assessment

Study evaluation is based on marks obtained for specified items. Minimimum number of marks to pass is 50.

Controlled instruction

4 projects a 8 pts. - 32
half-semestral exam - theoretical questions only - 18
semestral exam - theory and numerical examples - 50

All materials is authorized for both exams.
Projects: for each project, software and short documentation (how to compile, how to run, which algorithms are used) should be supplied.

Passing bounary for ECTS assessment - 50 points