Project Details

Jazyková paměť regionů České republiky. Metody strojového učení pro uchování, dokumentaci a prezentaci nářečí českého jazyka

Project Period: 1. 3. 2023 – 31. 12. 2027

Project Type: grant

Code: DH23P03OVV010

Agency: Ministerstvo kultury ČR

Program: NAKI III – program na podporu aplikovaného výzkumu v oblasti národní a kulturní identity na léta 2023 až 2030

English title

Language memory of the regions of the Czech Republic. Machine learning methods for preservation, documentation and presentation of the dialects of the Czech language

Type

grant

Keywords

Czech language, dialects, dialectology, artificial intelligence, speech and
language data, automatic dialect identification, automatic speech recognition,
interactive maps, language memory of regions

Abstract

Language is a fundamental connecting element of every nation and its territorial
dialects are an important part of regional identity. In the modern world,
dialects are gradually disappearing, their variability is diminishing and they
are gradually assimilating into the language represented by the mainstream media
and the Internet. Due to the significant costs of acquiring and annotating
training language data, the dialects have virtually zero support in modern
artificial intelligence (AI) and machine learning (ML) technologies, represented
mainly by automatic speech recognition (ASR). In Czechia, the dialectology
department of the Czech Academy of Sciences, Czech Language Institute (ÚJČ AV ČR)
is systematically engaged in research of colloquial phenomena of the Czech
national language, is dedicated to the study of dialects. However, ÚJČ lacks any
modern technology for automatic processing, storage, documentation and
presentation of dialects. Also, the outputs of the dialectology department are
available primarily to the scientific community; there is a lack of modern
interactive web applications or services that could be used by the general
public. The project, proposed by ASR specialists (BUT), dialectologists (ÚJČ) and
interactive map imaging experts (UPOL), aims to adapt existing technologies and
develop new procedures for automatic processing, storage, documentation and
presentation of Czech language dialects. A detailed methodology for the transfer
of structured knowledge from dialectology to machine learning (where work with
data is dominant) will be developed. The existing Archive of Sound Recordings of
Dialect Speech (built in ÚJČ from 1952 to the present and containing over 750
hours of recordings) will be supplemented with metadata and prepared for machine
learning. As a prerequisite, we will develop software for dialect detection based
on audio recording.

Team members

Karafiát Martin, Ing., Ph.D. (DCGM) – research leader
Kocour Martin, Ing. (DCGM)
Kotolan Martin (DFIT-ISD)
Plchot Oldřich, Ing., Ph.D. (DCGM)
Sedláček Šimon, Ing. (DCGM)
Yusuf Bolaji (DCGM)
Žižka Josef, Ing. (DCGM)

Publications

2024

BENEŠ, K.; KOCOUR, M.; BURGET, L. Hystoc: Obtaining Word Confidences for Fusion of End-To-End ASR Systems. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Seoul: IEEE Signal Processing Society, 2024. p. 11276-11280. ISBN: 979-8-3503-4485-1. Detail

2023

MATĚJKA, P.; SILNOVA, A.; SLAVÍČEK, J.; MOŠNER, L.; PLCHOT, O.; KLČO, M.; PENG, J.; STAFYLAKIS, T.; BURGET, L. Description and Analysis of ABC Submission to NIST LRE 2022. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Proceedings of Interspeech. Dublin: International Speech Communication Association, 2023. p. 511-515. ISSN: 1990-9772. Detail

Products

2025

Methodology for Transferring Structured Knowledge from Dialectology into Machine Learning, realized certified methodology, 2025
Authors: ŠIMEČKOVÁ, M.; STUPŇÁNEK, B.; KARAFIÁT, M.; VONDRÁKOVÁ, A.; VOŽENÍLEK, V.; NÉTEK, R.

2024

Automatic Dialect Detector Based on Audio Recording, software, 2024
Authors: PLCHOT, O.; ODEHNAL, O.; KARAFIÁT, M.; ŽIŽKA, J.; ŠIMEČKOVÁ, M.