Result Details

Extracting Visually Presented Element Relationships from Web Documents

BURGET, R.; SMRŽ, P. Extracting Visually Presented Element Relationships from Web Documents. International Journal of Cognitive Informatics and Natural Intelligence, 2013, vol. 2013, no. 2, p. 13-29. ISSN: 1557-3958.
Type
journal article
Language
English
Authors
Abstract

Many documents in the World Wide Web present structured information that consists of multiplepieces of data with certain relationships among them. Although it is usually not difficult to identifythe individual data values in the document text, their relationships are often not explicitly describedin the document content. They are expressed by visual presentation of the document content that isexpected to be interpreted by a human reader. In this paper, we propose a formal generic model oflogical relationships in a document based on an interpretation of visual presentation patterns in thedocuments. The model describes the visually expressed relationships between individual parts of thecontents independently of the document format and the particular way of presentation. Therefore, itcan be used as an appropriate document model in many information retrieval or extraction applica-tions. We formally define the model, we introduce a method of extracting the relationships betweenthe content parts based on the visual presentation analysis and we discuss the expected applications.We also present a new dataset consisting of programmes of conferences and other scientific eventsand we discuss its suitability for the task in hand. Finally, we use the dataset to evaluate results ofthe implemented system.

Keywords

logical document structure; page segmentation; document analysis; web documents

Published
2013
Pages
13–29
Journal
International Journal of Cognitive Informatics and Natural Intelligence, vol. 2013, no. 2, ISSN 1557-3958
DOI
EID Scopus
BibTeX
@article{BUT105971,
  author="Radek {Burget} and Pavel {Smrž}",
  title="Extracting Visually Presented Element Relationships from Web Documents",
  journal="International Journal of Cognitive Informatics and Natural Intelligence",
  year="2013",
  volume="2013",
  number="2",
  pages="13--29",
  doi="10.4018/ijcini.2013040102",
  issn="1557-3958",
  url="https://www.fit.vut.cz/research/publication/10468/"
}
Files
Projects
Centrum excelence IT4Innovations, MŠMT, Operační program Výzkum a vývoj pro inovace, ED1.1.00/02.0070, start: 2011-01-01, end: 2015-12-31, completed
Digital Environment for Cultural Interfaces; Promoting Heritage, Education and Research, MŠMT, Podpora projektů sedmého rámcového programu Evropského společenství pro výzkum, technologický rozvoj a demonstrace (2007 až 2013) podle zákona č. 171/2007 Sb., 7E11023, start: 2011-01-01, end: 2013-12-31, running
Research groups
Departments
Back to top