Result Details
A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages
Čegiň Ján, Ing., DCGM (FIT)
Šimko Jakub, doc. Ing., PhD., DCGM (FIT)
Ostermann Simon
Large Language Models (LLMs) are increasingly used to generate synthetic textual
data for training smaller specialized models. However, a comparison of various
generation strategies for low-resource language settings is lacking. While
various prompting strategies have been proposed-such as demonstrations,
label-based summaries, and self-revision-their comparative effectiveness remains
unclear, especially for low-resource languages. In this paper, we systematically
evaluate the performance of these generation strategies and their combinations
across 11 typologically diverse languages, including several extremely
low-resource ones. Using three NLP tasks and four open-source LLMs, we assess
downstream model performance on generated versus gold-standard data. Our results
show that strategic combinations of generation methods - particularly
target-language demonstrations with LLM-based revisions - yield strong
performance, narrowing the gap with real data to as little as 5% in some
settings. We also find that smart prompting techniques can reduce the advantage
of larger LLMs, highlighting efficient generation strategies for synthetic data
generation in low-resource scenarios with smaller models
multilingual evaluation, less-resourced languages, model analysis, synthetic data
generation
@inproceedings{BUT198568,
author="{} and Ján {Čegiň} and Jakub {Šimko} and {}",
title="A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages",
year="2025",
pages="8293--8314",
publisher="Association for Computational Linguistics",
address="Suzhou, China",
doi="10.18653/v1/2025.emnlp-main.418",
isbn="979-8-89176-332-6",
url="https://aclanthology.org/2025.emnlp-main.418/"
}