Detail výsledku
Layout Based Information Extraction from HTML Documents
BURGET, R. Layout Based Information Extraction from HTML Documents. 9th International Conference on Document Analysis and Recognition ICDAR 2007. Curitiba: IEEE Computer Society, 2007. p. 624-629. ISBN: 0-7695-2822-8.
Typ
článek ve sborníku konference
Jazyk
anglicky
Autoři
Burget Radek, doc. Ing., Ph.D., UIFS (FIT)
Abstrakt
We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
Klíčová slova
page segmentation, layout analysis, information extraction
Rok
2007
Strany
624–629
Sborník
9th International Conference on Document Analysis and Recognition ICDAR 2007
Konference
9th International Conference on Document Analysis and Recognition
ISBN
0-7695-2822-8
Vydavatel
IEEE Computer Society
Místo
Curitiba
BibTeX
@inproceedings{BUT28821,
author="Radek {Burget}",
title="Layout Based Information Extraction from HTML Documents",
booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
year="2007",
pages="624--629",
publisher="IEEE Computer Society",
address="Curitiba",
isbn="0-7695-2822-8"
}
Projekty
Výzkum informačních technologií z hlediska bezpečnosti, MŠMT, Institucionální prostředky SR ČR (např. VZ, VC), MSM0021630528, zahájení: 2007-01-01, ukončení: 2013-12-31, řešení
Výzkumné skupiny
Pracoviště
Ústav informačních systémů
(UIFS)