Day: 7 February 2024
FIT team helps build a unique dialect map
A team from the Faculty of Information Technology at the BUT, led by Martin Karafiát, is involved in a unique dialect mapping project. In cooperation with the Academy of Sciences of the Czech Republic and Palacký University in Olomouc, they are creating a website where you can select a region of the Czech Republic and listen to the dialects characteristic of a given place. In addition, the project team is categorising the recordings, which go back to the 1950s, according to various criteria, such as the themes of the narratives.
Researchers from the Dialectology Department of the Institute of Czech Language of the Czech Academy of Sciences have long been trying to map and preserve the various dialects across the Czech Republic. In 2023, they have enlisted the help of experts from the Speech@FIT group, who are currently working on creating a system that would be able to identify the dialect. And also create an automatic transcription of the recordings. "Our speech group has had great success in the areas of language identification, speaker identification and speech transcription. So the primary idea is to bring these areas together, work with unique data and create a system that will be able to automatically transcribe the audio data, which will be a huge help to researchers at the Academy of Sciences. Especially because the data is specific and the classic transcribers from Google or Microsoft fail," explains Martin Karafiát from FIT BUT. Marta Šimečková from the Institute of Czech Language of the Czech Academy of Sciences confirms this. "Our aim is to create a set of tools that would make our work easier as dialectologists. On the one hand, it is software for automatic recognition of a particular dialect based on audio recordings, and on the other hand, software that would transcribe dialectal speeches for us. These are transcriptions in a special dialectological transcription, which differs in many ways from the written transcription," Šimečková explains.
The project will offer, among other things, a map with unique recordings of various dialects.
The archive of dialect recordings has been created since the 1950s and the data is still being added. "There used to be one reel-to-reel tape recorder in the dialectology department. In addition, tapes were expensive, so it was economical and only small sections were recorded. Today, however, the recordings are left to run for several hours. The data stored on the old sound carriers were digitised in cooperation with Czech Radio, then annotated and catalogued. However, the cataloguing system is now inadequate, so a thorough revision of the recordings was undertaken and a new, modern catalogue was created, in which the data are annotated in a uniform manner. Among other things, information about their content is also provided," says Martin Karafiát.
In the future, interested parties should be able to easily search the recordings by selected dialect and topic. "We want people to be able to say, for example, that they are interested in how it sounds when someone talks in Haná about baking bread. And the system will immediately offer him such a recording," says Karafiát. According to Marta Šimečková, most traditional dialects have already been mapped. "Especially thanks to the collections that took place in the 1960s and 1970s. The recordings from that time form the core of our sound archive. The only white spots are the borderlands, which is an area that is dialectally non-native and so tended not to have been explored before. Our main effort will be to add recordings from traditionally dialectal areas, which will make it possible to track some of the shifts in dialects over time," he adds.
The team had already created a first version of the dialect identification system last autumn. "It is roughly 90 percent able to distinguish the four main dialect groups. But when we divide them into 13 subgroups, the success rate is only around 60 percent. It will be improved over time because the system has not yet been trained on data from the Academy of Sciences. Our neural network has been trained on 106 foreign languages, but it has not yet seen dialectical Czech," Martin Karafiát points out.
He is focusing his project on text transcription. Other team members are working on automatic dialect identification and web interface development. In cooperation with Palacký University in Olomouc, they are also working on the creation of a dialect map and visual data processing. "This will be the first map of its kind in the Czech Republic. Users will be able to play dialect samples from different regions and at the same time browse their transcripts. The map will allow the display of data on different backgrounds and also search through the recordings and transcriptions according to various parameters, for example, according to the topics of the narration," says Marta Šimečková.
Author: Horná Petra, Mgr.
Last modified: 2024-02-08T10:39:11