Thesis Details

Vyhledávání duplicitních textů

Bachelor's Thesis Student: Pekař Tomáš Academic Year: 2014/2015 Supervisor: Smrž Pavel, doc. RNDr., Ph.D.

English title

Duplicate Text Identification

Language

Czech

Abstract

The aim of this work is to design and implement a system for duplicate text identification. The application should be able to index documents and also searching documents at index. In our work we deal with preprocessing documents, their fragmentation and indexing. Furthermore we analyze methods for duplicate text identification, that are also linked with strategies for selecting substrings. The thesis includes a description of the basic data structures that can be used to index n-grams.

Keywords

searching, hash, ducplicates, indexing, n-gram, inverted index, data structures

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Information Technology

Files

Thesis text 1019 kB

Status

defended, grade D

Date

16 June 2015

Reviewer

Kouřil Jan, Ing.

Committee

Meduna Alexander, prof. RNDr., CSc. (DIFS FIT BUT), předseda
Beran Vítězslav, doc. Ing., Ph.D. (DCGM FIT BUT), člen
Drábek Vladimír, doc. Ing., CSc. (DCSY FIT BUT), člen
Křena Bohuslav, Ing., Ph.D. (DITS FIT BUT), člen
Očenášek Pavel, Mgr. Ing., Ph.D. (DIFS FIT BUT), člen

Citation

PEKAŘ, Tomáš. Vyhledávání duplicitních textů. Brno, 2015. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2015-06-16. Supervised by Smrž Pavel. Available from: https://www.fit.vut.cz/study/thesis/9668/

BibTeX

@bachelorsthesis{FITBT9668,
    author = "Tom\'{a}\v{s} Peka\v{r}",
    type = "Bachelor's thesis",
    title = "Vyhled\'{a}v\'{a}n\'{i} duplicitn\'{i}ch text\r{u}",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2015,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/9668/"
}

Theses