FIT team helps build a unique dialect map

A team from the Faculty of Information Technology at the BUT, led by Martin Karafiát, is involved in a unique dialect mapping project. In cooperation with the Academy of Sciences of the Czech Republic and Palacký University in Olomouc, they are creating a website where you can select a region of the Czech Republic and listen to the dialects characteristic of a given place. In addition, the project team is categorising the recordings, which go back to the 1950s, according to various criteria, such as the themes of the narratives.

Researchers from the Dialectology Department of the Institute of Czech Language of the Czech Academy of Sciences have long been trying to map and preserve the various dialects across the Czech Republic. In 2023, they have enlisted the help of experts from the Speech@FIT group, who are currently working on creating a system that would be able to identify the dialect. And also create an automatic transcription of the recordings. "Our speech group has had great success in the areas of language identification, speaker identification and speech transcription. So the primary idea is to bring these areas together, work with unique data and create a system that will be able to automatically transcribe the audio data, which will be a huge help to researchers at the Academy of Sciences. Especially because the data is specific and the classic transcribers from Google or Microsoft fail," explains Martin Karafiát from FIT BUT. Marta Šimečková from the Institute of Czech Language of the Czech Academy of Sciences confirms this. "Our aim is to create a set of tools that would make our work easier as dialectologists. On the one hand, it is software for automatic recognition of a particular dialect based on audio recordings, and on the other hand, software that would transcribe dialectal speeches for us. These are transcriptions in a special dialectological transcription, which differs in many ways from the written transcription," Šimečková explains.

[img]
The project will offer, among other things, a map with unique recordings of various dialects.

The archive of dialect recordings has been created since the 1950s and the data is still being added. "There used to be one reel-to-reel tape recorder in the dialectology department. In addition, tapes were expensive, so it was economical and only small sections were recorded. Today, however, the recordings are left to run for several hours. The data stored on the old sound carriers were digitised in cooperation with Czech Radio, then annotated and catalogued. However, the cataloguing system is now inadequate, so a thorough revision of the recordings was undertaken and a new, modern catalogue was created, in which the data are annotated in a uniform manner. Among other things, information about their content is also provided," says Martin Karafiát.

In the future, interested parties should be able to easily search the recordings by selected dialect and topic. "We want people to be able to say, for example, that they are interested in how it sounds when someone talks in Haná about baking bread. And the system will immediately offer him such a recording," says Karafiát. According to Marta Šimečková, most traditional dialects have already been mapped. "Especially thanks to the collections that took place in the 1960s and 1970s. The recordings from that time form the core of our sound archive. The only white spots are the borderlands, which is an area that is dialectally non-native and so tended not to have been explored before. Our main effort will be to add recordings from traditionally dialectal areas, which will make it possible to track some of the shifts in dialects over time," he adds.

Municipal authorities and folklore associations help to find witnesses, who speak the dialect.

Although at first glance the project may seem like no difficult challenge for the FIT BUT researchers, Martin Karafiát points out the complexity of transcribing dialects. "It is similar to Vietnamese, for example. The latter also uses the Latin alphabet to write it down, but it also helps itself with a set of auxiliary symbols that determine how a particular character should be pronounced," Karafiát explains, adding that they will have to teach the system to record, for example, the so-called enveloped l, which is typical of some dialects in southern Moravia and the Jablunkov region.

The team had already created a first version of the dialect identification system last autumn. "It is roughly 90 percent able to distinguish the four main dialect groups. But when we divide them into 13 subgroups, the success rate is only around 60 percent. It will be improved over time because the system has not yet been trained on data from the Academy of Sciences. Our neural network has been trained on 106 foreign languages, but it has not yet seen dialectical Czech," Martin Karafiát points out.

He is focusing his project on text transcription. Other team members are working on automatic dialect identification and web interface development. In cooperation with Palacký University in Olomouc, they are also working on the creation of a dialect map and visual data processing. "This will be the first map of its kind in the Czech Republic. Users will be able to play dialect samples from different regions and at the same time browse their transcripts. The map will allow the display of data on different backgrounds and also search through the recordings and transcriptions according to various parameters, for example, according to the topics of the narration," says Marta Šimečková.

Municipal authorities, schools and folklore associations help researchers in the search for witnesses who still speak the dialect of the time. "Usually we manage to record at least four speakers in each village, we are only rarely rejected. We mainly document the speeches of seniors, who are often happy that someone is willing to listen to their stories. And in addition, this will help good things, namely the preservation of linguistic cultural heritage for future generations ," adds Marta Šimečková. At the beginning of the project, Martin Karafiát's team was also considering the creation of a chatbot that could control individual dialects and in the future would be able to communicate with interested parties in the selected dialect. "But transcribing the speech well and teaching the system to reliably identify that it is, for example, a Southwest Bohemian or East Moravian dialect, will be enough work in itself," adds Martin Karafiát at the end with a laugh.

Author: Horná Petra, Mgr.

Last modified: 2024-02-08 10:39:11

Back to press releases