Thesis Details

Extrakce informací z Wikipedie

Bachelor's Thesis Student: Valušek Ondřej Academic Year: 2018/2019 Supervisor: Smrž Pavel, doc. RNDr., Ph.D.
English title
Information Extraction from Wikipedia
Language
Czech
Abstract

This thesis deals with automatic type extraction in English Wikipedia articles and their attributes. Several approaches with the use of machine learning will be presented. Furthermore, important features like date of birth in articles regarding people, or area in those about lakes, and many more, will be extracted. With the use of the system presented in this thesis, one can generate a well structured knowledge base, using a file with Wikipedia articles (called dump file) and a small training set containing a few well-classed articles. Such knowledge base can then be used for semantic enrichment of text. During this process a file with so called definition words is generated. Definition words are features extracted by natural text analysis, which could be used also in other ways than in this thesis. There is also a component that can determine, which articles were added, deleted or modified in between the creation of two different knowledge bases.

Keywords

article classification, entity type detection, natural text, natural language processing, partof-speech tagging, SpaCy, Stanford CoreNLP, Wikipedia, SVM, Support Vector Machine,machine learning, artificial intelligence, attribute extraction

Department
Degree Programme
Information Technology
Files
Status
defended, grade D
Date
10 June 2019
Reviewer
Committee
Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Fučík Otto, doc. Dr. Ing. (DCSY FIT BUT), člen
Holík Lukáš, doc. Mgr., Ph.D. (DITS FIT BUT), člen
Szőke Igor, Ing., Ph.D. (DCGM FIT BUT), člen
Veselý Vladimír, Ing., Ph.D. (DIFS FIT BUT), člen
Citation
VALUŠEK, Ondřej. Extrakce informací z Wikipedie. Brno, 2019. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2019-06-10. Supervised by Smrž Pavel. Available from: https://www.fit.vut.cz/study/thesis/18942/
BibTeX
@bachelorsthesis{FITBT18942,
    author = "Ond\v{r}ej Valu\v{s}ek",
    type = "Bachelor's thesis",
    title = "Extrakce informac\'{i} z Wikipedie",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2019,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/18942/"
}
Back to top