Named Entity Recognition Exploiting Sub Word Information

Czech title

Named entity recognition exploiting sub word information

Language

English

Abstract

The aim of this thesis is the creation of a Named Entity Recognition system based on an older state-of-the-art model and studying how subword information can improve the recognition of out-of-vocabulary words. This proposed system besides English has to support two additional Indo-European languages: German and Hungarian. This work features a named entity tagger based on deep learning using pretrained and custom-trained word embeddings, sparse features, and character embeddings extracted by a Convolutional Neural Network. All these features are then processed by sequence-based (bidirectional Long Short-Term Memory) and feature-based (Conditional Random Field) approaches with the goal of achieving a F1-score similar to the work it is based on, and to compare how far present time state-of-the-art systems have evolved. The result is a system that achieves a 90.98% F1-score on the CoNLL 2003 English test dataset using pretrained word embeddings, not far behind the original work's 91.26%. For the other two languages, the model scores 89.34% on the WikiAnn German test dataset and 93.04% on the WikiAnn Hungarian test dataset with the usage of custom-trained embeddings.

Keywords

Natural Language Processing, Named Entity Recognition, neural networks, Convolutional Neural Network, Conditional Random Fields, Long Short-Term Memory, subword information

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Information Technology

Files

Status

defended, grade A

Date

15 June 2022

Reviewer

Egorova Ekaterina, Ing., Ph.D.

Committee

Černocký Jan, prof. Dr. Ing. (DCGM FIT BUT), předseda
Bartík Vladimír, Ing., Ph.D. (DIFS FIT BUT), člen
Češka Milan, doc. RNDr., Ph.D. (DITS FIT BUT), člen
Jaroš Jiří, prof. Ing., Ph.D. (DCSY FIT BUT), člen
Orság Filip, Ing., Ph.D. (DITS FIT BUT), člen

Citation

DOBROVODSKÝ, Patrik. Named Entity Recognition Exploiting Sub Word Information. Brno, 2022. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2022-06-15. Supervised by Kesiraju Santosh. Available from: https://www.fit.vut.cz/study/thesis/24847/

BibTeX

@bachelorsthesis{FITBT24847,
    author = "Patrik Dobrovodsk\'{y}",
    type = "Bachelor's thesis",
    title = "Named Entity Recognition Exploiting Sub Word Information",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2022,
    location = "Brno, CZ",
    language = "english",
    url = "https://www.fit.vut.cz/study/thesis/24847/"
}