Detail výsledku

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

ČEGIŇ, J.; ŠIMKO, J. LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Albuquerque, New Mexico: Association for Computational Linguistics, 2025. p. 10476-10496. ISBN: 979-8-8917-6189-6.

Typ

článek ve sborníku konference

Jazyk

anglicky

Autoři

Čegiň Ján, Ing., UPGM (FIT)
Šimko Jakub, doc. Ing., PhD., UPGM (FIT)
a další

Abstrakt

The generative large language models (LLMs) are increasingly being used for data
augmentation tasks, where text samples are LLM-paraphrased and then used for
classifier fine-tuning. Previous studies have compared LLM-based augmentations
with established augmentation techniques, but the results are contradictory: some
report superiority of LLM-based augmentations, while other only marginal
increases (and even decreases) in performance of downstream classifiers.
A research that would confirm a clear cost-benefit advantage of LLMs over more
established augmentation methods is largely missing. To study if (and when) is
the LLM-based augmentation advantageous, we compared the effects of recent LLM
augmentation methods with established ones on 6 datasets, 3 classifiers and 2
fine-tuning methods. We also varied the number of seeds and collected samples to
better explore the downstream model accuracy space. Finally, we performed
a cost-benefit analysis and show that LLM-based methods are worthy of deployment
only when very small number of seeds is used. Moreover, in many cases,
established methods lead to similar or better model accuracies.

Klíčová slova

data-efficient training, data augmentation, analysis

URL

https://aclanthology.org/2025.naacl-long.526/

Rok

2025

Strany

10476–10496

Sborník

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Konference

2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics

ISBN

979-8-8917-6189-6

Vydavatel

Association for Computational Linguistics

Místo

Albuquerque, New Mexico

DOI

10.18653/v1/2025.naacl-long.526

BibTeX

@inproceedings{BUT193745,
  author="Ján {Čegiň} and Jakub {Šimko}",
  title="LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?",
  booktitle="Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
  year="2025",
  pages="10476--10496",
  publisher="Association for Computational Linguistics",
  address="Albuquerque, New Mexico",
  doi="10.18653/v1/2025.naacl-long.526",
  isbn="979-8-8917-6189-6",
  url="https://aclanthology.org/2025.naacl-long.526/"
}

Pracoviště

Ústav počítačové grafiky a multimédií (UPGM)