Thesis Details

Named Entity Recognition Exploiting Sub Word Information

Bachelor's Thesis Student: Dobrovodský Patrik Academic Year: 2021/2022 Supervisor: Kesiraju Santosh
Czech title
Named entity recognition exploiting sub word information
Language
English
Abstract

The aim of this thesis is the creation of a Named Entity Recognition system based on an older state-of-the-art model and studying how subword information can improve the recognition of out-of-vocabulary words. This proposed system besides English has to support two additional Indo-European languages: German and Hungarian. This work features a named entity tagger based on deep learning using pretrained and custom-trained word embeddings, sparse features, and character embeddings extracted by a Convolutional Neural Network. All these features are then processed by sequence-based (bidirectional Long Short-Term Memory) and feature-based (Conditional Random Field) approaches with the goal of achieving a F1-score similar to the work it is based on, and to compare how far present time state-of-the-art systems have evolved. The result is a system that achieves a 90.98% F1-score on the CoNLL 2003 English test dataset using pretrained word embeddings, not far behind the original work's 91.26%. For the other two languages, the model scores 89.34% on the WikiAnn German test dataset and 93.04% on the WikiAnn Hungarian test dataset with the usage of custom-trained embeddings.

Keywords

Natural Language Processing, Named Entity Recognition, neural networks, Convolutional Neural Network, Conditional Random Fields, Long Short-Term Memory, subword information

Department
Degree Programme
Files
Status
defended, grade A
Date
15 June 2022
Reviewer
Committee
Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT), předseda
Bartík Vladimír, Ing., Ph.D. (DIFS FIT BUT), člen
Češka Milan, doc. RNDr., Ph.D. (DITS FIT BUT), člen
Jaroš Jiří, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Orság Filip, Ing., Ph.D. (DITS FIT BUT), člen
Citation
DOBROVODSKÝ, Patrik. Named Entity Recognition Exploiting Sub Word Information. Brno, 2022. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2022-06-15. Supervised by Kesiraju Santosh. Available from: https://www.fit.vut.cz/study/thesis/24847/
BibTeX
@bachelorsthesis{FITBT24847,
    author = "Patrik Dobrovodsk\'{y}",
    type = "Bachelor's thesis",
    title = "Named Entity Recognition Exploiting Sub Word Information",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2022,
    location = "Brno, CZ",
    language = "english",
    url = "https://www.fit.vut.cz/study/thesis/24847/"
}
Back to top