Thesis Details

Inteligentní extrakce dat ve webovém prohlížeči

Bachelor's Thesis Student: Maštera František Academic Year: 2020/2021 Supervisor: Burget Radek, doc. Ing., Ph.D.
English title
Intelligent Data Scraping in a Web Browser
Language
Czech
Abstract

The goal of this thesis is to extract data from web pages without the knowledge of their internal structure. The point is to recognize the structure using an algorithm and a given input information about the content that the user wants to extract. The structure analysis is then followed by the content extraction itself. An average success rate of over 80% was achieved on selected sets of websites. The resulting algorithm represents a new approach to data extraction and can be deployed in the real world or can be a part of further development.

Keywords

Document processing, data extraction, document structure recognition, web, TypeScript, Puppeteer

Department
Degree Programme
Information Technology
Files
Status
defended, grade A
Date
16 June 2021
Reviewer
Committee
Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Burgetová Ivana, Ing., Ph.D. (DIFS FIT BUT), člen
Kreslíková Jitka, doc. RNDr., CSc. (DIFS FIT BUT), člen
Peringer Petr, Dr. Ing. (DITS FIT BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen
Citation
MAŠTERA, František. Inteligentní extrakce dat ve webovém prohlížeči. Brno, 2021. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-16. Supervised by Burget Radek. Available from: https://www.fit.vut.cz/study/thesis/23533/
BibTeX
@bachelorsthesis{FITBT23533,
    author = "Franti\v{s}ek Ma\v{s}tera",
    type = "Bachelor's thesis",
    title = "Inteligentn\'{i} extrakce dat ve webov\'{e}m prohl\'{i}\v{z}e\v{c}i",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/23533/"
}
Back to top