Thesis Details

Inteligentní extrakce dat ve webovém prohlížeči

Bachelor's Thesis Student: Maštera František Academic Year: 2020/2021 Supervisor: Burget Radek, doc. Ing., Ph.D.

English title

Intelligent Data Scraping in a Web Browser

Language

Czech

Abstract

The goal of this thesis is to extract data from web pages without the knowledge of their internal structure. The point is to recognize the structure using an algorithm and a given input information about the content that the user wants to extract. The structure analysis is then followed by the content extraction itself. An average success rate of over 80% was achieved on selected sets of websites. The resulting algorithm represents a new approach to data extraction and can be deployed in the real world or can be a part of further development.

Keywords

Document processing, data extraction, document structure recognition, web, TypeScript, Puppeteer

Department

Department of Information Systems FIT BUT

Degree Programme

Information Technology

Files

Status

defended, grade A

Date

16 June 2021

Reviewer

Bartík Vladimír, Ing., Ph.D.

Committee

Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Burgetová Ivana, Ing., Ph.D. (DIFS FIT BUT), člen
Kreslíková Jitka, doc. RNDr., CSc. (DIFS FIT BUT), člen
Peringer Petr, Dr. Ing. (DITS FIT BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen

Citation

MAŠTERA, František. Inteligentní extrakce dat ve webovém prohlížeči. Brno, 2021. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-16. Supervised by Burget Radek. Available from: https://www.fit.vut.cz/study/thesis/23533/

BibTeX

@bachelorsthesis{FITBT23533,
    author = "Franti\v{s}ek Ma\v{s}tera",
    type = "Bachelor's thesis",
    title = "Inteligentn\'{i} extrakce dat ve webov\'{e}m prohl\'{i}\v{z}e\v{c}i",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/23533/"
}

Theses