Detail výsledku

Layout Based Information Extraction from HTML Documents

BURGET, R. Layout Based Information Extraction from HTML Documents. 9th International Conference on Document Analysis and Recognition ICDAR 2007. Curitiba: IEEE Computer Society, 2007. p. 624-629. ISBN: 0-7695-2822-8.

Typ

článek ve sborníku konference

Jazyk

anglicky

Autoři

Burget Radek, doc. Ing., Ph.D., UIFS (FIT)

Abstrakt

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Klíčová slova

page segmentation, layout analysis, information extraction

Rok

2007

Strany

624–629

Sborník

9th International Conference on Document Analysis and Recognition ICDAR 2007

Konference

9th International Conference on Document Analysis and Recognition

ISBN

0-7695-2822-8

Vydavatel

IEEE Computer Society

Místo

Curitiba

BibTeX

@inproceedings{BUT28821,
  author="Radek {Burget}",
  title="Layout Based Information Extraction from HTML Documents",
  booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
  year="2007",
  pages="624--629",
  publisher="IEEE Computer Society",
  address="Curitiba",
  isbn="0-7695-2822-8"
}

Projekty

Výzkum informačních technologií z hlediska bezpečnosti, MŠMT, Institucionální prostředky SR ČR (např. VZ, VC), MSM0021630528, zahájení: 2007-01-01, ukončení: 2013-12-31, řešení

Výzkumné skupiny

Výzkumná skupina informačních a databázových systémů (VZ IS)

Pracoviště

Ústav informačních systémů (UIFS)