Thesis Details

Automatizovaná extrakce informací z emailů

Bachelor's Thesis Student: Kanda Rastislav Academic Year: 2018/2019 Supervisor: Vídeňský František, Ing.

English title

Automated Extraction of Information from Emails

Language

Czech

Abstract

The purpose of this thesis is to familiarize oneself with methodology of information extraction from text. On the basis of acquired knowledge, propose a design and implement a system, which should be capable of gathering information from email messages. Proposed system should help Kiwi.com s.r.o. with processing of incoming email messages from travel companies. In current situation it is possible to process those email messages automatically. However, to process those messages automatically, it is necessary to manually create a template suitable for extraction. Possible alteration could be algorithm ROBULA+, which can generate more robust XPath locator from given XPath locator. These locators should be more resistant to changes in the HTML structure. ROBULA+ algorithm is a central point of automated creation of templates suitable for parsing email messages. Implemented system can be qualified with satisfactory successivity (approximately 75%). This means that system is able to find reference to created reservation in three out of four cases.

Keywords

information extraction, email, ROBULA+, automation, REST API, XPath, Python

Department

Department of Intelligent Systems FIT BUT

Degree Programme

Information Technology

Files

Status

defended, grade B

Date

10 June 2019

Reviewer

Zbořil František, doc. Ing., Ph.D.

Committee

Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Fučík Otto, doc. Dr. Ing. (DCSY FIT BUT), člen
Holík Lukáš, doc. Mgr., Ph.D. (DITS FIT BUT), člen
Szőke Igor, Ing., Ph.D. (DCGM FIT BUT), člen
Veselý Vladimír, Ing., Ph.D. (DIFS FIT BUT), člen

Citation

KANDA, Rastislav. Automatizovaná extrakce informací z emailů. Brno, 2019. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2019-06-10. Supervised by Vídeňský František. Available from: https://www.fit.vut.cz/study/thesis/22028/

BibTeX

@bachelorsthesis{FITBT22028,
    author = "Rastislav Kanda",
    type = "Bachelor's thesis",
    title = "Automatizovan\'{a} extrakce informac\'{i} z email\r{u}",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2019,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/22028/"
}

Theses