Detail výsledku

Study of Large Data Resources for Multilingual Training and System Porting

GRÉZL, F.; EGOROVA, E.; KARAFIÁT, M. Study of Large Data Resources for Multilingual Training and System Porting. In Procedia Computer Science. Procedia Computer Science. Yogyakarta: Elsevier Science, 2016. no. 81, p. 15-22. ISSN: 1877-0509.
Typ
článek ve sborníku konference
Jazyk
anglicky
Autoři
Grézl František, Ing., Ph.D., UPGM (FIT)
Egorova Ekaterina, Ing., Ph.D., UPGM (FIT)
Karafiát Martin, Ing., Ph.D., UPGM (FIT)
Abstrakt

This study investigates the behavior of a feature extraction neural network model trained on a large amount of single language data("source language") on a set of under-resourced target languages. The coverage of the source language acoustic space was changedin two ways: (1) by changing the amount of training data and (2) by altering the level of detail of acoustic units (by changingthe triphone clustering). We observe the effect of these changes on the performance on target language in two scenarios: (1) thesource-language NNs were used directly, (2) NNs were first ported to target language.The results show that increasing coverage as well as level of detail on the source language improves the target language systemperformance in both scenarios. For the first one, both source language characteristic have about the same effect. For the secondscenario, the amount of data in source language is more important than the level of detail.The possibility to include large data into multilingual training set was also investigated. Our experiments point out possiblerisk of over-weighting the NNs towards the source language with large data. This degrades the performance on part of the targetlanguages, compared to the setting where the amounts of data per language are balanced.

Klíčová slova

Stacked Bottle-Neck; feature extraction; multilingual training; large data; Fisher database

URL
Rok
2016
Strany
15–22
Časopis
Procedia Computer Science, roč. 2016, č. 81, ISSN 1877-0509
Sborník
Procedia Computer Science
Konference
The 5th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU'16)
Vydavatel
Elsevier Science
Místo
Yogyakarta
DOI
UT WoS
000387446500002
EID Scopus
BibTeX
@inproceedings{BUT130953,
  author="František {Grézl} and Ekaterina {Egorova} and Martin {Karafiát}",
  title="Study of Large Data Resources for Multilingual Training and System Porting",
  booktitle="Procedia Computer Science",
  year="2016",
  journal="Procedia Computer Science",
  volume="2016",
  number="81",
  pages="15--22",
  publisher="Elsevier Science",
  address="Yogyakarta",
  doi="10.1016/j.procs.2016.04.024",
  issn="1877-0509",
  url="http://www.sciencedirect.com/science/article/pii/S1877050916300382"
}
Soubory
Projekty
Analytika velkých řečových dat pro kontaktní centra, EU, Horizon 2020, zahájení: 2015-01-01, ukončení: 2017-12-31, ukončen
IARPA Tvorba rozpoznávačů řeči pro vyhledávání klíčových slov v novém jazyce s omezenými trénovacími daty za týden (BABEL) - Babelon, BBN, zahájení: 2012-03-05, ukončení: 2016-11-04, ukončen
Meeting assistant (MINT), TAČR, Program aplikovaného výzkumu a experimentálního vývoje ALFA, TA04011311, zahájení: 2014-10-01, ukončení: 2017-12-31, ukončen
Výzkumné skupiny
Pracoviště
Nahoru