Thesis Details
Metody extrakce dat z webových stránek
The purpose of this bachelor thesis is to design an architecture and subsequent implementation of an application designed for data extraction (web scraping) from web documents. Unlike conventional methods, it is an extraction based on defining data types and regular expressions of requested elements. Extraction is executed in such a manner, where it is not necessary to know the detailed structure of given web document and the possibility of using just one definition to detect requested elements on different web pages. Algorithm is able to achieve overall accuracy of 85,51% and recall 80,28%. This approach can reduce the time required for analysis of web pages significantly and not to take the structure of the code as a determining factor while creating web scraping requests.
Web scraping, Javascript, Node.js, Google Chrome, Chromium, JSON, data extraction, scraping, web, DOM, CSS, HTML, Puppeteer
Burgetová Ivana, Ing., Ph.D. (DIFS FIT BUT), člen
Kreslíková Jitka, doc. RNDr., CSc. (DIFS FIT BUT), člen
Peringer Petr, Dr. Ing. (DITS FIT BUT), člen
Strnadel Josef, Ing., Ph.D. (DCSY FIT BUT), člen
@bachelorsthesis{FITBT23941, author = "Luk\'{a}\v{s} Perina", type = "Bachelor's thesis", title = "Metody extrakce dat z webov\'{y}ch str\'{a}nek", school = "Brno University of Technology, Faculty of Information Technology", year = 2021, location = "Brno, CZ", language = "slovak", url = "https://www.fit.vut.cz/study/thesis/23941/" }