Detail výsledku

AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions

KIŠŠ, M.; BENEŠ, K.; HRADIŠ, M. AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021. Lecture Notes in Computer Science. Lausanne: Springer Nature Switzerland AG, 2021. p. 463-477. ISBN: 978-3-030-86336-4.

Typ

článek ve sborníku konference

Jazyk

angličtina

Autoři

Kišš Martin, Ing., UPGM (FIT)
Beneš Karel, Ing., Ph.D., UPGM (FIT)
Hradiš Michal, Ing., Ph.D., UAMT (FEKT), UPGM (FIT)

Abstrakt

This paper addresses text recognition for domains with limited manual annotations by a simple self-training strategy. Our approach should reduce human annotation effort when target domain data is plentiful, such as when transcribing a collection of single person's correspondence or a large manuscript. We propose to train a seed system on large scale data from related domains mixed with available annotated data from the target domain. The seed system transcribes the unannotated data from the target domain which is then used to train a better system. We study several confidence measures and eventually decide to use the posterior probability of a transcription for data selection. Additionally, we propose to augment the data using an aggressive masking scheme. By self-training, we achieve up to 55 % reduction in character error rate for handwritten data and up to 38 % on printed data. The masking augmentation itself reduces the error rate by about 10 % and its effect is better pronounced in case of difficult handwritten data.

Klíčová slova

self-training, text recognition, language model, unlabelled
data, confidence measures, data augmentation.

URL

https://pero.fit.vutbr.cz/publications

Rok

2021

Strany

463–477

Sborník

Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021

Řada

Lecture Notes in Computer Science

Svazek

12824

Konference

International Conference on Document Analysis and Recognition

ISBN

978-3-030-86336-4

Vydavatel

Springer Nature Switzerland AG

Místo

Lausanne

DOI

10.1007/978-3-030-86337-1_31

UT WoS

000711880100031

EID Scopus

2-s2.0-85115292729

BibTeX

@inproceedings{BUT175776,
  author="Martin {Kišš} and Karel {Beneš} and Michal {Hradiš}",
  title="AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions",
  booktitle="Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition - ICDAR 2021",
  year="2021",
  series="Lecture Notes in Computer Science",
  volume="12824",
  pages="463--477",
  publisher="Springer Nature Switzerland AG",
  address="Lausanne",
  doi="10.1007/978-3-030-86337-1\{_}31",
  isbn="978-3-030-86336-4",
  url="https://pero.fit.vutbr.cz/publications"
}

Projekty

Neuronové reprezentace v multimodálním a mnohojazyčném modelování, GAČR, Grantové projekty exelence v základním výzkumu EXPRO - 2019, GX19-26934X, zahájení: 2019-01-01, ukončení: 2023-12-31, ukončen
OCR, ClassificAtion & Machine Translation, EU, Connecting Europe Facility (CEF), zahájení: 2019-10-01, ukončení: 2021-09-30, ukončen
Pokročilá extrakce a rozpoznávání obsahu tištěných a rukou psaných digitalizátů pro zvýšení jejich přístupnosti a využitelnosti, MK, Program na podporu aplikovaného výzkumu a experimentálního vývoje národní a kulturní identity na léta 2016 až 2022 (NAKI II), DG18P02OVV055, zahájení: 2018-03-01, ukončení: 2022-12-31, ukončen

Výzkumné skupiny

Výzkumná skupina dolování dat z řeči BUT Speech@FIT (VZ SPEECH)
Výzkumná skupina počítačové grafiky (VZ GRAPH)

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)