Detail výsledku

A Comparative Study of Text Retrieval Models on DaReCzech

ŠTĚTINA, J.; FAJČÍK, M.; HRADIŠ, M.; ŠTEFÁNIK, M. A Comparative Study of Text Retrieval Models on DaReCzech. Recent Advances in Slavonic Natural Language Processing, 2024, no. 7, p. 85-100.
Typ
článek v časopise
Jazyk
angličtina
Autoři
Štětina Jakub, Bc., FIT (FIT)
Fajčík Martin, Ing., Ph.D., UPGM (FIT)
Hradiš Michal, Ing., Ph.D., UAMT (FEKT), UPGM (FIT)
Štefánik Michal
Abstrakt

This article presents a comprehensive evaluation of off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.

Klíčová slova

Information Retrieval; Evaluation; Comparison; Czech Language; Performance Assessment; Document Retrieval; Model Analysis

URL
Rok
2024
Strany
85–100
Časopis
Recent Advances in Slavonic Natural Language Processing, č. 7, ISSN
EID Scopus
BibTeX
@article{BUT193747,
  author="Jakub {Štětina} and Martin {Fajčík} and Michal {Hradiš} and  {}",
  title="A Comparative Study of Text Retrieval Models on DaReCzech",
  journal="Recent Advances in Slavonic Natural Language Processing",
  year="2024",
  number="7",
  pages="85--100",
  issn="2336-4289",
  url="https://www.scopus.com/inward/record.uri?eid=2-s2.0-105004656879&partnerID=40&md5=952c98d094a89128a0781f11c0c3b8b8"
}
Projekty
Multilingvální a mezikulturní interakce v dialogových systémech pro bezpečnostně kritické aplikace závislé na kontextu a kontrolou zaujatosti, EU, HORIZON EUROPE, zahájení: 2024-01-01, ukončení: 2026-12-31, řešení
semANT - Sémantický průzkumník textového kulturního dědictví, MK, NAKI III – program na podporu aplikovaného výzkumu v oblasti národní a kulturní identity na léta 2023 až 2030, DH23P03OVV060, zahájení: 2023-03-01, ukončení: 2027-12-31, řešení
Pracoviště
Nahoru