Thesis Details

Rozšíření Apache Tika o extrakci textu ze souborů průmyslových formátů

Bachelor's Thesis Student: Rešetár René Academic Year: 2020/2021 Supervisor: Rychlý Marek, RNDr., Ph.D.
English title
Extension of Apache Tika with Industrial File Formats Text Extraction
Language
Czech
Abstract

The goal of the bachelor's thesis was to extend the parsers of the Apache Tika project with data and table extraction from industrial document formats from laboratory instruments. These data will be stored in a structured format according to a certain scheme. In the theoretical part, the supplied industrial formats, the Apache Tika project and the possibilities of its expansion were examined. In the practical part, a tool was designed and implemented, which classifies documents using the Apache Tika project, processes them, creates structured data from them in the JSON format and subsequently validates them. Finally, a set of tests was created to verify and demonstrate the properties of the solution.

Keywords

Java, Apache Tika, Maven, weka, .arff, JSON, pdf, xlsx, csv, software, laboratories, control laboratories, non-paper laboratories, SVP, farmaceutic industry, data integrity, Service Provider, structured data, MIME-types, data extraction, table extraction

Department
Degree Programme
Information Technology
Files
Status
defended, grade C
Date
14 June 2021
Reviewer
Committee
Kolář Dušan, doc. Dr. Ing. (DIFS FIT BUT), předseda
Chudý Peter, doc. Ing., Ph.D. MBA (DCGM FIT BUT), člen
Lengál Ondřej, Ing., Ph.D. (DITS FIT BUT), člen
Rychlý Marek, RNDr., Ph.D. (DIFS FIT BUT), člen
Vašíček Zdeněk, doc. Ing., Ph.D. (DCSY FIT BUT), člen
Citation
REŠETÁR, René. Rozšíření Apache Tika o extrakci textu ze souborů průmyslových formátů. Brno, 2021. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2021-06-14. Supervised by Rychlý Marek. Available from: https://www.fit.vut.cz/study/thesis/23586/
BibTeX
@bachelorsthesis{FITBT23586,
    author = "Ren\'{e} Re\v{s}et\'{a}r",
    type = "Bachelor's thesis",
    title = "Roz\v{s}\'{i}\v{r}en\'{i} Apache Tika o extrakci textu ze soubor\r{u} pr\r{u}myslov\'{y}ch form\'{a}t\r{u}",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2021,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/23586/"
}
Back to top