Course details

Data Storage and Preparation

UPA Acad. year 2020/2021 Winter semester 5 credits

The course focuses on modern database systems as typical data sources for knowledge discovery and further on the preparation of data for knowledge discovery. Discussed are extended relational (object-relational, with support for working with XML and JSON documents), spatial, and NoSQL database systems. The corresponding database model, the way of working with data and some methods of indexing are explained. In the context of the knowledge discovery process, attention is paid to the descriptive characteristics of data and visualization techniques used to data understanding. In addition, approaches to solving typical data pre-processing tasks for knowledge discovery, such as data cleaning, integration, transformation, reduction, etc. are explained. Approaches to information extraction from the web are also presented and several real case studies are presented. As a part of the course, students solve a project focused on ...

Guarantor

Deputy Guarantor

Language of instruction

Czech

Completion

Credit+Examination (written)

Time span

26 hrs lectures, 6 hrs exercises, 6 hrs pc labs, 14 hrs projects

Assessment points

60 exam, 20 mid-term test, 20 projects

Department

Lecturer

Instructor

Subject specific learning outcomes and competences

Students will be able to store and manipulate data in suitable database systems, to explore data and prepare data for modelling within knowledge discovery process.

Generic learning outcomes and competences

  • Student is better able to work with data in various situations.
  • Student improves in solving small projects in a small team.

Learning objectives

The aim of the course is to explain the historical development of database technologies, motivation of knowledge discovery from data and basic steps of knowledge discovery process, to explain essence, properties and the use of extended relational and NoSQL databases as data sources for knowledge discovery and to explain approaches and methods used for data understanding and data pre-processing for knowledge discovery.

Why is the course taught

The aim of this course is to demonstrate how to work with complex data around us, how to store such data, how to get oriented in such data, obtain useful descriptive characteristics from such data, and how to prepare such data for extraction of hidden information/knowledge by application of machine learning methods and other advanced analytical methods.

Prerequisite kwnowledge and skills

  • Fundamentals of relational databases and SQL.
  • Object-oriented paradigm.
  • Fundamentals of XML.
  • Fundaments of computational geometry.
  • Fundaments of statistics and probability.

Study literature

  • Lecture materials (slides, scripts, etc.)
  • Lemahieu, W., Broucke, S., Baesens, B.: Principles of Database Management. Cambridge University Press. 2018, 780 p.
  • Kim, W. (ed.): Modern Database Systems, ACM Press, 1995, ISBN 0-201-59098-0
  • Melton, J.: Advanced SQL: 1999 - Understanding Object-Relational and Other Advanced. Morgan Kaufmann, 2002, 562 p., ISBN 1-558-60677-7
  • Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Third Edition. Morgan Kaufmann Publishers, 2012, 703 p., ISBN 978-0-12-381479-1
  • Skiena, S.S.: The Data Science Design Manual. Springer, 2017, 445 p., ISBN 978-3-319-55443-3.
  • Shekhar, S., Chawla, S.: Spatial Databases: A Tour, Prentice Hall, 2002/2003, 262 p., ISBN 0-13-017480-7
  • Gaede, V., Günther, O.: Multidimensional Access Methods, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 170-231.

Fundamental literature

  • Lemahieu, W., Broucke, S., Baesens, B.: Principles of Database Management. Cambridge University Press. 2018, 780 p.
  • Kim, W. (ed.): Modern Database Systems, ACM Press, 1995, ISBN 0-201-59098-0
  • Melton, J.: Advanced SQL: 1999 - Understanding Object-Relational and Other Advanced. Morgan Kaufmann, 2002, 562 p., ISBN 1-558-60677-7
  • Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Third Edition. Morgan Kaufmann Publishers, 2012, 703 p., ISBN 978-0-12-381479-1
  • Skiena, S.S.: The Data Science Design Manual. Springer, 2017, 445 p., ISBN 978-3-319-55443-3.
  • Shekhar, S., Chawla, S.: Spatial Databases: A Tour, Prentice Hall, 2002/2003, 262 p., ISBN 0-13-017480-7
  • Gaede, V., Günther, O.: Multidimensional Access Methods, ACM Computing Surveys, Vol. 30, No. 2, 1998, pp. 170-231.

Syllabus of lectures

  1. History of database technology and knowledge discovery, process of knowledge discovery.
  2. Object-oriented approach in databases.
  3. NoSQL databases I - introduction to NoSQL, CAP theorem and BASE, key-value databases, data partitioning and distribution.
  4. NoSQL databases II -data models in NoSQL databases (column, document, and graph databases), querying and data aggregation, NewSQL databases.
  5. Web scraping.
  6. Data preparation - data understanding: descriptive characteristics, visualization techniques, correlation analysis.
  7. Data preparation - data pre-processing I: data cleaning and integration.
  8. Data preparation - data pre-processing II: data reduction, imbalanced data, data transformation, other data pre-processing tasks.
  9. Mid-term exam
  10. Languages and systems for knowledge discovery, real case studies.
  11. Support for working with XML and JSON documents in databases.
  12. Spatial databases.
  13. Indexing of multidimensional data.

Syllabus of numerical exercises

DEMO excercises

  1. Object-relational and spatial databases, data definition and manipulation, peculiarities
  2. Multimedia and XML databases, data indices
  3. NoSQL databases

Syllabus of computer exercises

  1. Application binding to object-relational databases, application building in spatial databases
  2. Multimedia and XML databases, building and exploiting data indices
  3. NoSQL databases in applications

Syllabus - others, projects and individual work of students

  1. Creation and feature demonstration of both structured and unstructured data processing, where data may be of various nature.

Progress assessment

  • Mid-term exam, for which there is only one schedule and, thus, there is no possibility to have another trial.
  • One project should be solved and delivered in a given date during a term.

Controlled instruction

  • Mid-term written exam, there is no resit, excused absences are solved by the guarantor.
  • The formulation of the data mining task in the prescribed term, excused absences are solved by the assistent.
  • The presentation of the project results in the prescribed term, excused absences are solved by the assistent.
  • Final exam, The minimal number of points which can be obtained from the final exam is 20. Otherwise, no points will be assigned to the student. excused absences are solved by the guarantor.

Exam prerequisites

At the end of a term, a student should have at least 50% of points that he or she could obtain during the term; that means at least 20 points out of 40.
Plagiarism and not allowed cooperation will cause that involved students are not classified and disciplinary action can be initiated.

Schedule

DayTypeWeeksRoomStartEndLect.grpGroupsInfo
Moncomp.lab4., 5., 10. of lectures N103 N104 N105 12:0013:50 1MIT 2MIT xx
Moncomp.lab2020-12-07 N103 N104 12:0013:50obhajoba proj. (rezerva)
Tuelecturelectures E104 E105 E112 08:0009:50 1MIT 2MIT NBIO - NSPE NHPC - NEMB NISY NSEC - NGRI xx
Tuelecture5. of lectures E112v 08:0009:50TM, MST
Tuecomp.lab2020-12-08 N103 N104 12:0014:50obhajoba proj.
Tuecomp.lab4., 5., 10. of lectures N103 N104 N105 13:0014:50 1MIT 2MIT xx
Wedcomp.lab4., 5., 10. of lectures N204 N205 14:0015:50 1MIT 2MIT xx
Wedcomp.lab2020-12-09 N204 N205 14:0016:50obhajoba proj.
Thuexam2021-01-07 A112 A113 C228 D0206 D0207 D105 E104 E105 E112 G202 M103 M104 M105 N103 N104 N105 N203 N204 N205 09:0011:50 1MIT 2MIT řádná
Thucomp.lab2020-12-10 N103 N104 10:0012:50obhajoba proj.
Thuexam2021-01-21 D0206 D0207 D105 E104 E105 E112 G202 M103 M104 M105 N103 N104 11:0013:50 1MIT 2MIT 1. oprava
Thucomp.lab4., 5., 10. of lectures N103 N104 N105 11:0012:50 1MIT 2MIT xx
Thuexam2021-01-07 D0206 D0207 D105 E104 E105 E112 G202 12:0014:50 1MIT 2MIT řádná
Thuexam2021-02-04 A112 D0206 D0207 D105 E105 E112 G202 13:0015:50 1MIT 2MIT 2. oprava
Friexercise3., 4., 9. of lectures D105 16:0017:50 1MIT 2MIT NBIO - NSPE NHPC - NEMB NISY NSEC - NGRI xx demo

Course inclusion in study plans

Back to top