Thesis Details

Extrakce informací z Wikipedie

Bachelor's Thesis Student: Valušek Ondřej Academic Year: 2018/2019 Supervisor: Smrž Pavel, doc. RNDr., Ph.D.

English title

Information Extraction from Wikipedia

Language

Czech

Abstract

This thesis deals with automatic type extraction in English Wikipedia articles and their attributes. Several approaches with the use of machine learning will be presented. Furthermore, important features like date of birth in articles regarding people, or area in those about lakes, and many more, will be extracted. With the use of the system presented in this thesis, one can generate a well structured knowledge base, using a file with Wikipedia articles (called dump file) and a small training set containing a few well-classed articles. Such knowledge base can then be used for semantic enrichment of text. During this process a file with so called definition words is generated. Definition words are features extracted by natural text analysis, which could be used also in other ways than in this thesis. There is also a component that can determine, which articles were added, deleted or modified in between the creation of two different knowledge bases.

Keywords

article classification, entity type detection, natural text, natural language processing, partof-speech tagging, SpaCy, Stanford CoreNLP, Wikipedia, SVM, Support Vector Machine,machine learning, artificial intelligence, attribute extraction

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Information Technology

Files

Status

defended, grade D

Date

10 June 2019

Reviewer

Otrusina Lubomír, Ing.

Committee

Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Fučík Otto, doc. Dr. Ing. (DCSY FIT BUT), člen
Holík Lukáš, doc. Mgr., Ph.D. (DITS FIT BUT), člen
Szőke Igor, Ing., Ph.D. (DCGM FIT BUT), člen
Veselý Vladimír, Ing., Ph.D. (DIFS FIT BUT), člen

Citation

VALUŠEK, Ondřej. Extrakce informací z Wikipedie. Brno, 2019. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2019-06-10. Supervised by Smrž Pavel. Available from: https://www.fit.vut.cz/study/thesis/18942/

BibTeX

@bachelorsthesis{FITBT18942,
    author = "Ond\v{r}ej Valu\v{s}ek",
    type = "Bachelor's thesis",
    title = "Extrakce informac\'{i} z Wikipedie",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2019,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/18942/"
}

Theses