Information Extraction Tools from CEUR Workshop Pages

Czech title

Nástroje pro extrakci informací ze stránek workshopů CEUR

Type

software

License

In order to use the result by another entity, it is always necessary to acquire a license

License Fee

The licensor does not require a license fee for the result

Authors

Burget Radek, doc. Ing., Ph.D. (DIFS)
Milička Martin, Ing.

Keywords

information extraction, web mining, document analysis, text classification

Description

This project implements the applications and tools for automatic information extraction from the CEUR-WS.org workshop proceedings pages. The tools take the CEUR HTML pages as an input and produce a structured linked dataset in RDF format. The implementation is based on the existing FITLayout document analysis framework with many extensions specific for the given task. The resulting data may be used for evaluating the quality of the individual CEUR workshops. The tools were created as a proposed solution of the Task 1 of the Semantic Publishing Challenge 2015 colocated with the Extended Semantic Web Conference 2015. They were awarded as the Best performing tool and the Most innovative approach. They provide a case study that demonstrates the developed document analysis methods.

Location

https://github.com/FitLayout/ToolsEswc

License Conditions

Free software under the terms of the GNU GPL license.

Projects

Centrum excelence IT4Innovations, MŠMT, Operační program Výzkum a vývoj pro inovace, ED1.1.00/02.0070, start: 2011-01-01, end: 2015-12-31, completed
Výzkum pokročilých metod ICT a jejich aplikace, BUT, Vnitřní projekty VUT, FIT-S-14-2299, start: 2014-01-01, end: 2016-12-31, completed

Research groups

Information and Database Systems Research Group (RG IS)

Departments

Department of Information Systems (DIFS)