Thesis Details
Poloautomatická normalizace slov z matričních záznamů
This work deals with the extension of DEMoS web application for the management of parish records by the possibility of normalization (assignment of a normalized form of writing to individual words) of names, surnames, occupations, domiciles and other types of words occurring in parish records. In the solution, a duplicate record detection process was used, which allowed sorting of the records from parish records into clusters of similar words. As a result of the clustering, it was possible to share normalized word variants within these clusters. Thus, DEMoS suggests normalized variants for words entered by users, used not only for the same words, but also for similar words. In this work, automatic testing of word clustering was proposed. In total, 640 different combinations of clustering parameters were tested for each word type. Subsequently, the best clustering parameters were selected for each word type. By normalizing words, DEMoS application significantly increases the efficiency of searching in parish records. Records are also easier to read.
parish records, data-matching, deduplication, normalization, duplicate detection, searching, DEMoS
Burget Lukáš, doc. Ing., Ph.D. (DCGM FIT BUT), člen
Grézl František, Ing., Ph.D. (DCGM FIT BUT), člen
Hliněná Dana, doc. RNDr., Ph.D. (DMAT FEEC BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen
@bachelorsthesis{FITBT21640, author = "David H\v{r}\'{i}bek", type = "Bachelor's thesis", title = "Poloautomatick\'{a} normalizace slov z matri\v{c}n\'{i}ch z\'{a}znam\r{u}", school = "Brno University of Technology, Faculty of Information Technology", year = 2019, location = "Brno, CZ", language = "czech", url = "https://www.fit.vut.cz/study/thesis/21640/" }