Dissertation Topic

Machine Learning for Information Identification on the Web

Academic Year: 2024/2025

Supervisor: Burget Radek, doc. Ing., Ph.D.

Department: Department of Information Systems

Programs:
Information Technology (DIT) - full-time study
Information Technology (DIT) - combined study
Information Technology (DIT-EN) - full-time study
Information Technology (DIT-EN) - combined study

Although there are technologies that allow publishing data on the WWW in machine-readable form (such as JSON-LD, RDFa, etc.), a large amount of structured data is still published on the web in the form of plain HTML/CSS code, which greatly limits the possibilities of their further use.

Recently, new machine learning methods (especially deep learning methods) are gaining importance, which show interesting results, e.g., in recognizing important entities in weakly structured or unstructured data (e.g., text or images). However, the area of web document processing has not received much attention from this perspective. Existing works deal with the identification of simple data items and neglect structured data and more complex usage scenarios.

The goal of this topic is to analyze and develop web content models suitable as input for machine learning and, at the same time, machine learning methods suitable for recognizing structured data in web documents.