Thesis Details

Poloautomatická normalizace slov z matričních záznamů

Bachelor's Thesis Student: Hříbek David Academic Year: 2018/2019 Supervisor: Rozman Jaroslav, Ing., Ph.D.
English title
Semi-Automatic Word Normalization in Parish Records
Language
Czech
Abstract

This work deals with the extension of DEMoS web application for the management of parish records by the possibility of normalization (assignment of a normalized form of writing to individual words) of names, surnames, occupations, domiciles and other types of words occurring in parish records. In the solution, a duplicate record detection process was used, which allowed sorting of the records from parish records into clusters of similar words. As a result of the clustering, it was possible to share normalized word variants within these clusters. Thus, DEMoS suggests normalized variants for words entered by users, used not only for the same words, but also for similar words. In this work, automatic testing of word clustering was proposed. In total, 640 different combinations of clustering parameters were tested for each word type. Subsequently, the best clustering parameters were selected for each word type. By normalizing words, DEMoS application significantly increases the efficiency of searching in parish records. Records are also easier to read.

Keywords

parish records, data-matching, deduplication, normalization, duplicate detection, searching, DEMoS

Department
Degree Programme
Information Technology
Files
Status
defended, grade A
Date
13 June 2019
Reviewer
Committee
Zbořil František, doc. Ing., Ph.D. (DITS FIT BUT), předseda
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT), člen
Grézl František, Ing., Ph.D. (DCGM FIT BUT), člen
Hliněná Dana, doc. RNDr., Ph.D. (DMAT FEEC BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen
Citation
HŘÍBEK, David. Poloautomatická normalizace slov z matričních záznamů. Brno, 2019. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2019-06-13. Supervised by Rozman Jaroslav. Available from: https://www.fit.vut.cz/study/thesis/21640/
BibTeX
@bachelorsthesis{FITBT21640,
    author = "David H\v{r}\'{i}bek",
    type = "Bachelor's thesis",
    title = "Poloautomatick\'{a} normalizace slov z matri\v{c}n\'{i}ch z\'{a}znam\r{u}",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2019,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/21640/"
}
Back to top