Thesis Details
Named Entity Recognition Exploiting Sub Word Information
The aim of this thesis is the creation of a Named Entity Recognition system based on an older state-of-the-art model and studying how subword information can improve the recognition of out-of-vocabulary words. This proposed system besides English has to support two additional Indo-European languages: German and Hungarian. This work features a named entity tagger based on deep learning using pretrained and custom-trained word embeddings, sparse features, and character embeddings extracted by a Convolutional Neural Network. All these features are then processed by sequence-based (bidirectional Long Short-Term Memory) and feature-based (Conditional Random Field) approaches with the goal of achieving a F1-score similar to the work it is based on, and to compare how far present time state-of-the-art systems have evolved. The result is a system that achieves a 90.98% F1-score on the CoNLL 2003 English test dataset using pretrained word embeddings, not far behind the original work's 91.26%. For the other two languages, the model scores 89.34% on the WikiAnn German test dataset and 93.04% on the WikiAnn Hungarian test dataset with the usage of custom-trained embeddings.
Natural Language Processing, Named Entity Recognition, neural networks, Convolutional Neural Network, Conditional Random Fields, Long Short-Term Memory, subword information
Bartík Vladimír, Ing., Ph.D. (DIFS FIT BUT), člen
Češka Milan, doc. RNDr., Ph.D. (DITS FIT BUT), člen
Jaroš Jiří, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Orság Filip, Ing., Ph.D. (DITS FIT BUT), člen
@bachelorsthesis{FITBT24847, author = "Patrik Dobrovodsk\'{y}", type = "Bachelor's thesis", title = "Named Entity Recognition Exploiting Sub Word Information", school = "Brno University of Technology, Faculty of Information Technology", year = 2022, location = "Brno, CZ", language = "english", url = "https://www.fit.vut.cz/study/thesis/24847/" }