Thesis Details

Využití neanotovaných dat pro trénování OCR

Master's Thesis Student: Buchal Petr Academic Year: 2020/2021 Supervisor: Hradiš Michal, Ing., Ph.D.
English title
OCR Trained with Unanotated Data
Language
Czech
Abstract

The creation of a high-quality optical character recognition system (OCR) requires a large amount of labeled data. Obtaining, or in other words creating, such a quantity of labeled data is a costly process. This thesis focuses on several methods which efficiently use unlabeled data for the training of an OCR neural network. The proposed methods fall into the category of self-training algorithms. The general approach of all proposed methods can be summarized as follows. Firstly, the seed model is trained on a limited amount of labeled data. Then, the seed model in combination with the language model is used for producing pseudo-labels for unlabeled data. Machine-labeled data are then combined with the training data used for the creation of the seed model and they are used again for the creation of the target model. The successfulness of individual methods is measured on the handwritten ICFHR 2014 Bentham dataset. Experiments were conducted on two datasets which represented different degrees of labeled data availability. The best model trained on the smaller dataset achieved 3.70 CER [%], which is a relative improvement of 42 % in comparison with the seed model, and the best model trained on the bigger dataset achieved 1.90 CER [%], which is a relative improvement of 26 % in comparison with the seed model. This thesis shows that the proposed methods can be efficiently used to improve the OCR error rate by means of unlabeled data.

Keywords

neural network, text recognition, self-training, unlabeled data, language model

Department
Degree Programme
Information Technology, Field of Study Computer Graphics and Multimedia
Files
Status
defended, grade A
Date
21 June 2021
Reviewer
Committee
Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Čadík Martin, doc. Ing., Ph.D. (DCGM FIT BUT), člen
Češka Milan, prof. RNDr., CSc. (DITS FIT BUT), člen
Hradiš Michal, Ing., Ph.D. (DCGM FIT BUT), člen
Chudý Peter, doc. Ing., Ph.D. MBA (DCGM FIT BUT), člen
Szőke Igor, Ing., Ph.D. (DCGM FIT BUT), člen
Citation
BUCHAL, Petr. Využití neanotovaných dat pro trénování OCR. Brno, 2021. Master's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-21. Supervised by Hradiš Michal. Available from: https://www.fit.vut.cz/study/thesis/24175/
BibTeX
@mastersthesis{FITMT24175,
    author = "Petr Buchal",
    type = "Master's thesis",
    title = "Vyu\v{z}it\'{i} neanotovan\'{y}ch dat pro tr\'{e}nov\'{a}n\'{i} OCR",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/24175/"
}
Back to top