Project Details

EOARD - Improving the capacity of language recognition systems to handle rare languages using radio broadcast data

Project Period: 15. 10. 2008 – 14. 12. 2010

Project Type: contract

Czech title
EOARD - Zlepšení schopnosti detekce méně známých jazyků systémy pro automatickou identifikaci jazyka s použitím rozhlasových dat
Type
contract
Keywords

language recognition, broadcast data

Abstract

Current situation in language recognition
The last editions of NIST
Language recognition (LRE) evaluations have shown substantial improvement in the
performance of LRE systems. Both acoustic and phonotactic approaches have reached
a certain maturity in both the actual modeling of target languages and coping
with the adverse influences of changing channel. There are several ways how to
further improve the current LRE systems and some of them were investigated in the
Brno University of Technology (BUT) 2007 submission to this evaluation, for
example:

  • discriminative training and channel compensation techniques
    for both acoustic and phonotactic modeling.
  • use of large vocabulary
    continuous speech recognition (LVCSR) with following confidence measures.

However, with all this beautiful science, we are still facing the old
problem of any recognizer's training and testing: the lack of data. While it is
easy to train and test an LRE system for languages with established speech and
language resources, such as English, Mandarin, etc., rare languages lack these
standard resources. Consider the example of Thai: this language is spoken by 65
million speakers, but for the NIST 2007 LRE evaluations, we disposed only of less
than 2 hours distributed by NIST as part of the development package, although we
have contacted several Thai speech processing labs - a large spontaneous
telephone database for this language simply does not exist.

The
proposed solution

This proposal aims at filling this gap by using the
data acquired from public sources, namely radio broadcasts. This approach (which
is pretty intuitive and we do not declare Speech@FIT to be the only place having
this idea) should provide us with plenty of data that we believe will lead to:

  • improved performance for known languages.
  • ability to process
    languages that were so far excluded because of unavailability of data.

This approach is however far from "we record the data, push a button
and will have a much better LRE system within a month". There is significant
amount of work especially on the selection of data and channel normalization.


Team members
Burget Lukáš, doc. Ing., Ph.D. (DCGM) – research leader
Černocký Jan, prof. Dr. Ing. (DCGM)
Hubeika Valiantsina, Ing.
Matějka Pavel, Ing., Ph.D.
Plchot Oldřich, Ing., Ph.D. (DCGM)
Schwarz Petr, Ing., Ph.D. (DCGM)
Publication Results

2010

2009

Back to top