Faculty of Information Technology, BUT

Course details

Digital Speech Processing

CZR Acad. year 2005/2006 Summer semester 5 credits

Applications of speech processing, digital processing of speech signals, production and perception of speech, introduction to phonetics, pre-processing and basic parameters of speech, linear-predictive model, cepstrum, fundamental frequency estimation, coding (time domain and vocoders), recognition (DTW and HMM), synthesis. Software and libraries for speech processing.

Guarantor

Language of instruction

Czech

Completion

Examination

Time span

26 hrs lectures, 2 hrs exercises, 12 hrs pc labs, 12 hrs projects

Assessment points

50 exam, 25 half-term test, 12 labs, 13 projects

Department

Lecturer

Course Web Pages

Subject specific learning outcomes and competences

Students will get familiar with principal methods and algorithms of speech signal processing. They will be able to design a simple system for speech processing (speech activity detector, recognizer of limited number of isolated words), including its implementation into application programs.

Generic learning outcomes and competences

The students will deepen their knowledge in signal processing. The will acquire new skills in math- and visualization-SW Matlab and in practical use of C/C++. During projects, they will get acquainted with independent development work.

Learning objectives

To provide students with the knowledge of basic characteristics of speech signal in relation to production and hearing of speech by humans. To describe basic algorithms of speech analysis common to many applications. To give an overview of applications (recognition, synthesis, coding) and to inform about practical aspects of speech algorithms implementation.

Prerequisite kwnowledge and skills

Basic knowledge of signal processing.

Study literature

  • Krčmová, N.: Fonetika a fonologie: zvuková stavba současné češtiny. ISBN 80-210-0137-2. Masarykova univerzita, Brno, 1990
  • Rabiner, L. Juang, B.H.: Fundamentals of speech recognition, Signal Processing, Prentice Hall, Engelwood Cliffs, NJ, 1993
  • Rabiner, L.R., Schaeffer, L.W.: Digital processing of speech signals, Prentice Hall, 1978

Fundamental literature

  • Psutka, J.: Komunikace s s počítačem mluvenou řečí. Academia, Praha, 1995. (in Czech, available in FIT library).
  • Gold, B., Morgan, N.: Speech and audio signal processing, John Wiley and Sons, 2000. (available in FIT library).
  • Young, S., Jansen, J., Odell, J., Ollason, D., Woodland, P.:  The HTK book, Entropics Cambridge Research Lab., 1996, Cambridge, UK. Excellent introduction to HMMs, free download at http://htk.eng.cam.ac.uk/
  • http://www.fit.vutbr.cz/~cernocky/speech/ - lecture notes, labs, functions. This page's going to grow...
  • http://www.fit.vutbr.cz/~cernocky/oldspeech/ - lecture notes, labs, functions. Old version, but especially some labs (everything in Matlab) might be interesting.

Syllabus of lectures

  1. Organization of the course, applications, sciences related to the topic, information carried by speech, demonstrations.
  2. Digital processing of speech signals: recording - sampling, quantization. Speech spectra - continuous Fourier transform; what do we get when we sample. Discrete Fourier transform. Random signals, power spectral density. Modification of speech - linear filters. Frequency response of a filter.
  3. Pre-processing of speech: dc removal, preemphasis, frames, basic parameters. Spectrogram. Speech production: articulatory organs - vocal cords and vocal tract vs. excitation and filter. Characteristics in time and frequency, influence of excitation and filter. What can be seen on long- and short-term spectrograms. How to separate excitation and filter: cepstrum, MFCC.
  4. Linear-predictive model: what is it good for ? Separation of vocal tract characteristics from excitation - applications in coding and recognition. Prediction of a sample from past samples - linear prediction (LP). Error of LP. Obtaining the error using a single filter. Determination of vocal tract characteristics using LP analysis. Spectrum estimated by LP. Features derived from LP - LAR and LSF. LPC-cepstrum.
  5. Determination of fundamental frequency (F0). Terminology. Characteristics of F0 of males, females and kids. Use in speech processing systems . Methods based on autocorrelation function. NCCF. Long-term predictor and cepstral analysis for F0 determination. Reliability and problems of F0 detectors.
  6. Coding I.: Aims of coding. Bit-rate, objective and subjective measurements of quality. Classification of coders according to bit-rate. Waveform coders. Vocoders - LPC. Vector quantization in speech coding.
  7. Coding II. - CELP, Coding in GSM networks: GSM, GSM-EFR, GSM-HR, Voice over IP. Introduction to speech recognition - the task, classification of recognizers: isolated words - connected words - continuous speech, speaker dependent - speaker independent. Basic function blocks. Voice activity detection (VAD) for isolated words.
  8. Recognition using DTW. Recognition based on distance of speech frames - various definitions of distance. Timing: linear modification, dynamic programming (Dynamic Time Warping DTW). Hidden Markov models (HMM I.): Introduction, motivations and relation to DTW. Structure f the model, Gaussian distributions, state sequences.
  9. HMM II. probability of a sequence of states, Baum-Welch and Viterbi probabilities. Training of models: Baum-Welch, recognition: Viterbi. Token passing. Connected words.
  10. HMM III. Continuous speech with large vocabulary: recognition of small units - phonemes... Phonetics: vowels and consonants, characteristics, classification of phonemes. International phoneme alphabets: IPA, SAMPA, TIMIT. Co-articulation. Applications in recognition: context-dependent triphones. Large vocabulary, Language modeling, lattice rescoring, forced alignment [Martin Karafiát].
  11. Features for recognition [Lukáš Burget, Petr Schwarz, Pavel Matějka]. What do we need: suppression of pitch, de-correlation, link with spectral envelope. How do we reach it: LPCC, MFCC, de-correlation: PCA, LDA, HLDA, channel robustness: normalization. Further tricks with features - delta, delta-delta. "Hot-topics" in feature extraction: TRAPs a FeatureNet, neural nets. Tools for speech processing.
  12. Speech synthesis: structure of the synthesizer. Conversion of written text to speech: text-to-speech. Text normalization. Prosody (melody, accents, timing) in synthesis. Units for synthesis - manual and automatic selection, corpus-based synthesis. Generation of signal in time and frequency domains: PSOLA and HNM. Applications, SW for synthesis: EPOS, MBROLA, Festival.
  13. Further topics in speech processing:
    • speaker identification/verification (principles, false acceptation, false rejection, cost function, optimal operation point, EER). [Černocký].
    • Phoneme recognition [Petr Schwarz, Petr Jenderka]
    • LVCSR [Martin Karafiát]
    • Recognizer merging [Lukáš Burget]
    • Very Low Bit Rate coding [Petr Motlíček, Černocký]
    • audio-video recognition [Petr Motlíček]
    • speech databases [Černocký].

Syllabus of numerical exercises

Numerical exercise 3hrs: digital filter, LPC, DTW, HMM, spectrogram reading.

Syllabus - others, projects and individual work of students

see the program of computer labs.

Progress assessment

  1. 4 projects a 8 pts. - 32
  2. mid-semestral exam - theoretical questions only - 18
  3. semestral exam - theory and numerical examples - 50
  • All materials is authorized for both exams.
  • Projects: for each project, software and short documentation (how to compile, how to run, which algorithms are used) should be supplied.
Passing bounary for ECTS assessment - 50 points
Back to top