Result Details

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

ANIKINA, T.; ČEGIŇ, J.; ŠIMKO, J.; OSTERMANN, S. A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages. Suzhou, China: Association for Computational Linguistics, 2025. p. 8293-8314. ISBN: 979-8-89176-332-6.
Type
conference paper
Language
English
Authors
Anikina Tatiana
Čegiň Ján, Ing., DCGM (FIT)
Šimko Jakub, doc. Ing., PhD., DCGM (FIT)
Ostermann Simon
Abstract

Large Language Models (LLMs) are increasingly used to generate synthetic textual
data for training smaller specialized models. However, a comparison of various
generation strategies for low-resource language settings is lacking. While
various prompting strategies have been proposed-such as demonstrations,
label-based summaries, and self-revision-their comparative effectiveness remains
unclear, especially for low-resource languages. In this paper, we systematically
evaluate the performance of these generation strategies and their combinations
across 11 typologically diverse languages, including several extremely
low-resource ones. Using three NLP tasks and four open-source LLMs, we assess
downstream model performance on generated versus gold-standard data. Our results
show that strategic combinations of generation methods - particularly
target-language demonstrations with LLM-based revisions - yield strong
performance, narrowing the gap with real data to as little as 5% in some
settings. We also find that smart prompting techniques can reduce the advantage
of larger LLMs, highlighting efficient generation strategies for synthetic data
generation in low-resource scenarios with smaller models

Keywords

multilingual evaluation, less-resourced languages, model analysis, synthetic data
generation

URL
Published
2025
Pages
8293–8314
Conference
Conference on Empirical Methods in Natural Language Processing
ISBN
979-8-89176-332-6
Publisher
Association for Computational Linguistics
Place
Suzhou, China
DOI
BibTeX
@inproceedings{BUT198568,
  author="{} and Ján {Čegiň} and Jakub {Šimko} and  {}",
  title="A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages",
  year="2025",
  pages="8293--8314",
  publisher="Association for Computational Linguistics",
  address="Suzhou, China",
  doi="10.18653/v1/2025.emnlp-main.418",
  isbn="979-8-89176-332-6",
  url="https://aclanthology.org/2025.emnlp-main.418/"
}
Departments
Back to top