Detail výsledku

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

ANIKINA, T.; ČEGIŇ, J.; ŠIMKO, J.; OSTERMANN, S. A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages. Suzhou, China: Association for Computational Linguistics, 2025. p. 8293-8314. ISBN: 979-8-89176-332-6.
Typ
článek ve sborníku konference
Jazyk
anglicky
Autoři
Anikina Tatiana
Čegiň Ján, Ing., UPGM (FIT)
Šimko Jakub, doc. Ing., PhD., UPGM (FIT)
Ostermann Simon
Abstrakt

Large Language Models (LLMs) are increasingly used to generate synthetic textual
data for training smaller specialized models. However, a comparison of various
generation strategies for low-resource language settings is lacking. While
various prompting strategies have been proposed-such as demonstrations,
label-based summaries, and self-revision-their comparative effectiveness remains
unclear, especially for low-resource languages. In this paper, we systematically
evaluate the performance of these generation strategies and their combinations
across 11 typologically diverse languages, including several extremely
low-resource ones. Using three NLP tasks and four open-source LLMs, we assess
downstream model performance on generated versus gold-standard data. Our results
show that strategic combinations of generation methods - particularly
target-language demonstrations with LLM-based revisions - yield strong
performance, narrowing the gap with real data to as little as 5% in some
settings. We also find that smart prompting techniques can reduce the advantage
of larger LLMs, highlighting efficient generation strategies for synthetic data
generation in low-resource scenarios with smaller models

Klíčová slova

multilingual evaluation, less-resourced languages, model analysis, synthetic data
generation

URL
Rok
2025
Strany
8293–8314
Konference
Conference on Empirical Methods in Natural Language Processing
ISBN
979-8-89176-332-6
Vydavatel
Association for Computational Linguistics
Místo
Suzhou, China
DOI
BibTeX
@inproceedings{BUT198568,
  author="{} and Ján {Čegiň} and Jakub {Šimko} and  {}",
  title="A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages",
  year="2025",
  pages="8293--8314",
  publisher="Association for Computational Linguistics",
  address="Suzhou, China",
  doi="10.18653/v1/2025.emnlp-main.418",
  isbn="979-8-89176-332-6",
  url="https://aclanthology.org/2025.emnlp-main.418/"
}
Pracoviště
Nahoru