Thesis Details

Metody extrakce dat z webových stránek

Bachelor's Thesis Student: Perina Lukáš Academic Year: 2020/2021 Supervisor: Burget Radek, doc. Ing., Ph.D.
Language
Slovak
Abstract

The purpose of this bachelor thesis is to design an architecture and subsequent implementation of an application designed for data extraction (web scraping) from web documents. Unlike conventional methods, it is an extraction based on defining data types and regular expressions of requested elements. Extraction is executed in such a manner, where it is not necessary to know the detailed structure of given web document and the possibility of using just one definition to detect requested elements on different web pages. Algorithm is able to achieve overall accuracy of 85,51% and recall 80,28%. This approach can reduce the time required for analysis of web pages significantly and not to take the structure of the code as a determining factor while creating web scraping requests.

Keywords

Web scraping, Javascript, Node.js, Google Chrome, Chromium, JSON, data extraction, scraping, web, DOM, CSS, HTML, Puppeteer

Department
Degree Programme
Information Technology
Files
Status
defended, grade B
Date
16 June 2021
Reviewer
Committee
Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Burgetová Ivana, Ing., Ph.D. (DIFS FIT BUT), člen
Kreslíková Jitka, doc. RNDr., CSc. (DIFS FIT BUT), člen
Peringer Petr, Dr. Ing. (DITS FIT BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen
Citation
PERINA, Lukáš. Metody extrakce dat z webových stránek. Brno, 2021. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-16. Supervised by Burget Radek. Available from: https://www.fit.vut.cz/study/thesis/23941/
BibTeX
@bachelorsthesis{FITBT23941,
    author = "Luk\'{a}\v{s} Perina",
    type = "Bachelor's thesis",
    title = "Metody extrakce dat z webov\'{y}ch str\'{a}nek",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "slovak",
    url = "https://www.fit.vut.cz/study/thesis/23941/"
}
Back to top