Dissertation Topic

Human-AI collaboration in dataset creation

Academic Year: 2024/2025

Supervisor: Šimko Jakub, doc. Ing., PhD.

Programs:
Information Technology (DIT) - combined study
Information Technology (DIT-EN) - combined study

The models created in machine learning can only be as good as the data on which they are trained. Researchers and practitioners thus strive to provide their training processes with the best data possible. It is not uncommon to spend much human effort in achieving upfront good general data quality (e.g. through annotation). Yet sometimes, upfront dataset preparation cannot be done properly, sufficiently or at all.

In such cases the solutions, colloquially denoted as human-in-the-loop solutions, employ the human effort in improving the machine learned models through actions taken during the training process and/or during the deployment of the models (e.g. user feedback on automated translations). They are particularly useful for surgical improvements of training data through identification and resolving of border cases.

Human-in-the-loop approaches draw from a wide palette of techniques, including active and interactive learning, human computation, and crowdsourcing (also with motivation schemes of gamification and serious games). With recent emergence of large language models (LLM), the original human-in-the-loop techniques can be further boosted to create extensive synthetic training sets with comparatively small human effort.

The domains of application of human-in-the-loop are predominantly those with a lot of heterogeneity and volatility of data. Such domains include online false information detection, online information spreading (including spreading of narratives or memes), auditing of social media algorithms and their tendencies for disinformation spreading, support of manual/automated fact-checking and more.

Relevant publications:

Cegin, J., Simko, J. and Brusilovsky, P., 2023. ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing https://arxiv.org/pdf/2305.12947.pdf
J. Šimko and M. Bieliková. Semantic Acquisition Games: Harnessing Manpower for Creating Semantics. 1st Edition. Springer Int. Publ. Switzerland. 150 p. https://link.springer.com/book/10.1007/978-3-319-06115-3

The research will be performed at the Kempelen Institute of Intelligent Technologies (KInIT, https://kinit.sk) in Bratislava in cooperation with industrial partners or researchers from highly respected research units. A combined (external) form of study and full employment at KInIT is expected.