Thesis Details

Adaptace jazykového modelu na cílovou doménu využívající stahování veřejných dat

Master's Thesis Student: Gregušová Sabína Academic Year: 2021/2022 Supervisor: Karafiát Martin, Ing., Ph.D.

English title

Domain Specific Data Crawling for Language Model Adaptation

Language

Czech

Abstract

The goal of this thesis is to implement a system for automatic language model adaptation for Phonexia ASR system. System expects input in the form of source that, which is analysed and appropriate terms for web search are chosen. Every web search results in a set of documents that undergo cleaning and filtering procedures. The resulting web corpora is mixed with Phonexia model and evaluated. In order to estimate the most optimal parameters, I conducted 3 sets of experiments for Hindi, Czech and Mandarin. The results of the experiments were very favourable and the implemented system managed to decrease perplexity and Word Error Rate in most cases.

Keywords

speech-to-text, automatic speech recognition, language model, language model adaptation, automatic web search, automatic web document scraping, automatic assessment of web documents

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Information Technology and Artificial Intelligence, Specialization Sound, Speech and Natural Language Processing

Files

Status

defended, grade C

Date

17 June 2022

Reviewer

Švec Ján, Ing.

Committee

Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT), předseda
Hradiš Michal, Ing., Ph.D. (DCGM FIT BUT), člen
Janoušek Vladimír, doc. Ing., Ph.D. (DITS FIT BUT), člen
Kanich Ondřej, Ing., Ph.D. (DITS FIT BUT), člen
Rozman Jaroslav, Ing., Ph.D. (DITS FIT BUT), člen
Zbořil František, doc. Ing., Ph.D. (DITS FIT BUT), člen

Citation

GREGUŠOVÁ, Sabína. Adaptace jazykového modelu na cílovou doménu využívající stahování veřejných dat. Brno, 2022. Master's Thesis. Brno University of Technology, Faculty of Information Technology. 2022-06-17. Supervised by Karafiát Martin. Available from: https://www.fit.vut.cz/study/thesis/24957/

BibTeX

@mastersthesis{FITMT24957,
    author = "Sab\'{i}na Gregu\v{s}ov\'{a}",
    type = "Master's thesis",
    title = "Adaptace jazykov\'{e}ho modelu na c\'{i}lovou dom\'{e}nu vyu\v{z}\'{i}vaj\'{i}c\'{i} stahov\'{a}n\'{i} ve\v{r}ejn\'{y}ch dat",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2022,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/24957/"
}

Theses