Využití neanotovaných dat pro trénování OCR

English title

OCR Trained with Unanotated Data

Language

Czech

Abstract

The creation of a high-quality optical character recognition system (OCR) requires a large amount of labeled data. Obtaining, or in other words creating, such a quantity of labeled data is a costly process. This thesis focuses on several methods which efficiently use unlabeled data for the training of an OCR neural network. The proposed methods fall into the category of self-training algorithms. The general approach of all proposed methods can be summarized as follows. Firstly, the seed model is trained on a limited amount of labeled data. Then, the seed model in combination with the language model is used for producing pseudo-labels for unlabeled data. Machine-labeled data are then combined with the training data used for the creation of the seed model and they are used again for the creation of the target model. The successfulness of individual methods is measured on the handwritten ICFHR 2014 Bentham dataset. Experiments were conducted on two datasets which represented different degrees of labeled data availability. The best model trained on the smaller dataset achieved 3.70 CER [%], which is a relative improvement of 42 % in comparison with the seed model, and the best model trained on the bigger dataset achieved 1.90 CER [%], which is a relative improvement of 26 % in comparison with the seed model. This thesis shows that the proposed methods can be efficiently used to improve the OCR error rate by means of unlabeled data.

Keywords

neural network, text recognition, self-training, unlabeled data, language model

Department

Department of Computer Graphics and Multimedia FIT BUT

Degree Programme

Information Technology, Field of Study Computer Graphics and Multimedia

Files

Status

defended, grade A

Date

21 June 2021

Reviewer

Dobeš Petr, Ing.

Committee

Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Čadík Martin, doc. Ing., Ph.D. (DCGM FIT BUT), člen
Češka Milan, prof. RNDr., CSc. (DITS FIT BUT), člen
Hradiš Michal, Ing., Ph.D. (DCGM FIT BUT), člen
Chudý Peter, doc. Ing., Ph.D. MBA (DCGM FIT BUT), člen
Szőke Igor, Ing., Ph.D. (DCGM FIT BUT), člen

Citation

BUCHAL, Petr. Využití neanotovaných dat pro trénování OCR. Brno, 2021. Master's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-21. Supervised by Hradiš Michal. Available from: https://www.fit.vut.cz/study/thesis/24175/

BibTeX

@mastersthesis{FITMT24175,
    author = "Petr Buchal",
    type = "Master's thesis",
    title = "Vyu\v{z}it\'{i} neanotovan\'{y}ch dat pro tr\'{e}nov\'{a}n\'{i} OCR",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/24175/"
}