Result Details

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

ANIKINA, T.; ČEGIŇ, J.; ŠIMKO, J.; OSTERMANN, S. A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages. Suzhou, China: Association for Computational Linguistics, 2025. p. 8293-8314. ISBN: 979-8-89176-332-6.

Type

conference paper

Language

English

Authors

Anikina Tatiana
Čegiň Ján, Ing., DCGM (FIT)
Šimko Jakub, doc. Ing., PhD., DCGM (FIT)
Ostermann Simon

Abstract

Large Language Models (LLMs) are increasingly used to generate synthetic textual
data for training smaller specialized models. However, a comparison of various
generation strategies for low-resource language settings is lacking. While
various prompting strategies have been proposed-such as demonstrations,
label-based summaries, and self-revision-their comparative effectiveness remains
unclear, especially for low-resource languages. In this paper, we systematically
evaluate the performance of these generation strategies and their combinations
across 11 typologically diverse languages, including several extremely
low-resource ones. Using three NLP tasks and four open-source LLMs, we assess
downstream model performance on generated versus gold-standard data. Our results
show that strategic combinations of generation methods - particularly
target-language demonstrations with LLM-based revisions - yield strong
performance, narrowing the gap with real data to as little as 5% in some
settings. We also find that smart prompting techniques can reduce the advantage
of larger LLMs, highlighting efficient generation strategies for synthetic data
generation in low-resource scenarios with smaller models

Keywords

multilingual evaluation, less-resourced languages, model analysis, synthetic data
generation

URL

https://aclanthology.org/2025.emnlp-main.418/

Published

2025

Pages

8293–8314

Conference

Conference on Empirical Methods in Natural Language Processing

ISBN

979-8-89176-332-6

Publisher

Association for Computational Linguistics

Place

Suzhou, China

DOI

10.18653/v1/2025.emnlp-main.418

BibTeX

@inproceedings{BUT198568,
  author="{} and Ján {Čegiň} and Jakub {Šimko} and  {}",
  title="A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages",
  year="2025",
  pages="8293--8314",
  publisher="Association for Computational Linguistics",
  address="Suzhou, China",
  doi="10.18653/v1/2025.emnlp-main.418",
  isbn="979-8-89176-332-6",
  url="https://aclanthology.org/2025.emnlp-main.418/"
}

Departments

Ústav počítačové grafiky a multimédií (DCGM)