Result Details

Layout Based Information Extraction from HTML Documents

BURGET, R. Layout Based Information Extraction from HTML Documents. 9th International Conference on Document Analysis and Recognition ICDAR 2007. Curitiba: IEEE Computer Society, 2007. p. 624-629. ISBN: 0-7695-2822-8.

Type

conference paper

Language

English

Authors

Burget Radek, doc. Ing., Ph.D., DIFS (FIT)

Abstract

We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.

Keywords

page segmentation, layout analysis, information extraction

Published

2007

Pages

624–629

Proceedings

9th International Conference on Document Analysis and Recognition ICDAR 2007

Conference

9th International Conference on Document Analysis and Recognition

ISBN

0-7695-2822-8

Publisher

IEEE Computer Society

Place

Curitiba

BibTeX

@inproceedings{BUT28821,
  author="Radek {Burget}",
  title="Layout Based Information Extraction from HTML Documents",
  booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007",
  year="2007",
  pages="624--629",
  publisher="IEEE Computer Society",
  address="Curitiba",
  isbn="0-7695-2822-8"
}

Projects

Security-Oriented Research in Information Technology, MŠMT, Institucionální prostředky SR ČR (např. VZ, VC), MSM0021630528, start: 2007-01-01, end: 2013-12-31, running

Research groups

Information and Database Systems Research Group (RG IS)

Departments

Department of Information Systems (DIFS)