Thesis Details

Metody extrakce dat z webových stránek

Bachelor's Thesis Student: Perina Lukáš Academic Year: 2020/2021 Supervisor: Burget Radek, doc. Ing., Ph.D.

Language

Slovak

Abstract

The purpose of this bachelor thesis is to design an architecture and subsequent implementation of an application designed for data extraction (web scraping) from web documents. Unlike conventional methods, it is an extraction based on defining data types and regular expressions of requested elements. Extraction is executed in such a manner, where it is not necessary to know the detailed structure of given web document and the possibility of using just one definition to detect requested elements on different web pages. Algorithm is able to achieve overall accuracy of 85,51% and recall 80,28%. This approach can reduce the time required for analysis of web pages significantly and not to take the structure of the code as a determining factor while creating web scraping requests.

Keywords

Web scraping, Javascript, Node.js, Google Chrome, Chromium, JSON, data extraction, scraping, web, DOM, CSS, HTML, Puppeteer

Department

Department of Information Systems FIT BUT

Degree Programme

Information Technology

Files

Status

defended, grade B

Date

16 June 2021

Reviewer

Křivka Zbyněk, Ing., Ph.D.

Committee

Smrž Pavel, doc. RNDr., Ph.D. (DCGM FIT BUT), předseda
Burgetová Ivana, Ing., Ph.D. (DIFS FIT BUT), člen
Kreslíková Jitka, doc. RNDr., CSc. (DIFS FIT BUT), člen
Peringer Petr, Dr. Ing. (DITS FIT BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen

Citation

PERINA, Lukáš. Metody extrakce dat z webových stránek. Brno, 2021. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-16. Supervised by Burget Radek. Available from: https://www.fit.vut.cz/study/thesis/23941/

BibTeX

@bachelorsthesis{FITBT23941,
    author = "Luk\'{a}\v{s} Perina",
    type = "Bachelor's thesis",
    title = "Metody extrakce dat z webov\'{y}ch str\'{a}nek",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "slovak",
    url = "https://www.fit.vut.cz/study/thesis/23941/"
}

Theses