Thesis Details
Rozšíření Apache Tika o extrakci textu ze souborů průmyslových formátů
The goal of the bachelor's thesis was to extend the parsers of the Apache Tika project with data and table extraction from industrial document formats from laboratory instruments. These data will be stored in a structured format according to a certain scheme. In the theoretical part, the supplied industrial formats, the Apache Tika project and the possibilities of its expansion were examined. In the practical part, a tool was designed and implemented, which classifies documents using the Apache Tika project, processes them, creates structured data from them in the JSON format and subsequently validates them. Finally, a set of tests was created to verify and demonstrate the properties of the solution.
Java, Apache Tika, Maven, weka, .arff, JSON, pdf, xlsx, csv, software, laboratories, control laboratories, non-paper laboratories, SVP, farmaceutic industry, data integrity, Service Provider, structured data, MIME-types, data extraction, table extraction
Chudý Peter, doc. Ing., Ph.D. MBA (DCGM FIT BUT), člen
Lengál Ondřej, Ing., Ph.D. (DITS FIT BUT), člen
Rychlý Marek, RNDr., Ph.D. (DIFS FIT BUT), člen
Vašíček Zdeněk, doc. Ing., Ph.D. (DCSY FIT BUT), člen
@bachelorsthesis{FITBT23586, author = "Ren\'{e} Re\v{s}et\'{a}r", type = "Bachelor's thesis", title = "Roz\v{s}\'{i}\v{r}en\'{i} Apache Tika o extrakci textu ze soubor\r{u} pr\r{u}myslov\'{y}ch form\'{a}t\r{u}", school = "Brno University of Technology, Faculty of Information Technology", year = 2021, location = "Brno, CZ", language = "czech", url = "https://www.fit.vut.cz/study/thesis/23586/" }