Dissertation Topic

Measuring output quality of large language models

Academic Year: 2024/2025

Supervisor: Šimko Jakub, doc. Ing., PhD.

Programs:
Information Technology (DIT) - combined study
Information Technology (DIT-EN) - combined study

The advent of large language models (LLMs) is raising research questions about how to measure quality and properties of their outputs. Such measures are needed for benchmarking, model improvements or prompt engineering. Some evaluation techniques pertain to specific domains and scenarios of use (e.g., how accurate are the answers to factual questions in such and such domain? how well can we use the generated answers to train a model for a specific task?), others are more general (e.g., what is the diversity of paraphrases generated by an LLM? how easy to detect it is that the content is generated?).

Through replication studies, benchmarking experiments, metric design, prompt engineering and other approaches, the candidate will advance the methods and experimental methodologies of LLM output quality measurement. Of particular interest are two general scenarios:

Dataset generation and/or augmentation, where LLMs are prompted with (comparatively small) sets of seeds to create much larger datasets. Such an approach can be very useful, when dealing with a domain/task with limited availability of original (labelled) training data (such as disinformation detection).
Detection of generated content, where stylometric-based, deep learning-based, statistics-based, or hybrid methods are used to estimate whether a piece of content was generated or modified by a machine. The detection ability is crucial for many real-world scenarios (e.g., detection of disinformation or frauds), but feeds back also to research methodologies (e.g., detecting the presence of generated content in published datasets or in crowdsourced data).

The candidate will select (but will not be limited to) one of the two general scenarios, identify, and refine specific research questions and experimentally answer them.

Relevant publications:

Cegin, J., Simko, J. and Brusilovsky, P., 2023. ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://arxiv.org/pdf/2305.12947.pdf
Macko, D., Moro, R., Uchendu, A., Lucas, J.S., Yamashita, M., Pikuliak, M., Srba, I., Le, T., Lee, D., Simko, J. and Bielikova, M., 2023. MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://arxiv.org/pdf/2310.13606.pdf

The research will be performed at the Kempelen Institute of Intelligent Technologies (KInIT, https://kinit.sk) in Bratislava in cooperation with industrial partners or researchers from highly respected research units. A combined (external) form of study and full employment at KInIT is expected.