Data Integration (#22771)

Last day to withdraw: MARCH 29

Instructor

Dr. Jan Chomicki, Associate Professor

Location and time

Norton 214, TTh 12:30-1:50.

Textbook

Anhai Doan, Alon Y. Halevy, Zachary Ives. Principles of Data Integration. Morgan Kaufmann Publishers, 2012.

Piazza

Handouts

  1. Datalog
  2. XML
  3. Querying Data through Ontologies (slides for the book Web Data Management by S. Abiteboul et al.)
  4. RDF and SPARQL (Marcelo Arenas) bibliography
  5. Provenance
  6. Schema matching and mapping
  7. Querying integrated databases
  8. Source capabilities
  9. Consistent query answering (my ICDT'07 keynote)
  10. CQA: query rewriting

Resources

Tests

  1. Test 1 (due 03/05/13)
  2. Test 2 (test due 05/07/13)

Summary

The availability of integrated data from multiple independent, heterogenous data sources is crucial for many applications. Data integration requires combining and matching information from different sources, and resolving a variety of discrepancies. XML is becoming a de facto data integration standard. Recently, Semantic Web techniques have been considered for data integration.

This course will survey selected issues arising in data integration, focusing on the theoretical foundations of the area, and algorithms and software systems facilitating integration. The students in the class will be working on team projects (2-3 people) involving research and/or programming, and will give class presentations about their projects. There will also be two take-home exams.

The project in this course can be used as a CSE M.S. project (under the condition that the course grade is at least B+).

Projects

Prerequisites

Good knowledge of database systems, some knowledge of logic and computational complexity.

Policies

Grading:

  1. projects (50%, includes class presentation and final report)
  2. midterm (25%)
  3. final exam (25%)

Academic integrity policy: I will follow the UB and CSE academic integrity policies.

Make-up policy: The request should be made sufficiently in advance of the test, for valid reasons.

Late submission policy: The submissions are due at midnight on the due date. No late submissions are accepted. Exceptions will be made only for medical reasons. Questions about the grading have to be raised within a week after the graded assignment has been returned.

Course outline (tentative)

  1. Datalog, negation, recursive query evaluation.
  2. XML: data model, schemas, integrity constraints, logics.
  3. XML query languages: XPath, XQuery, query evaluation.
  4. Semantic Web: RDF, SPARQL, OWL, description logics.
  5. Schema matching and mapping.
  6. Query containment and rewriting.
  7. Query evaluation, source descriptions and limitations.
  8. Data exchange, source-to-target dependencies.
  9. Database incompleteness and inconsistency, nulls, consistent query answers.
  10. Schematic discrepancies, metadata, SchemaSQL.
  11. Data cleaning.
  12. Data quality.
  13. Data provenance.

Bibliography (under construction)

Information Integration Systems

SystemInstitution/CompanyTypeTechnology
DB2 Information Integrator IBMCommercialSQL
Oracle Data Integration SuiteOracle CommercialJava
JBoss EDSPJBoss CommercialSQL
XQuare Fusion/Bridge XQuareOpen sourceXQuery
Raccoon UCIAcademic prototypeXML
Garlic IBM Almaden Industrial prototypeObject-relational

Useful URL's

Tutorials

Massive data integration and mining projects

Real-life stories

On-line bibliography

This bibliography is far from complete and typically does not contain references to papers for which I haven't been able to find a freely available on-line version. Any additions/modifications are appreciated.

Collections of articles

General background reference:

Schema integration

Schematic discrepancies

Data cleaning

Consistent query answers

Query Evaluation for Distributed Data Sources

Source limitations

XPath/XQuery

Incremental validation of XML documents Repairing XML documents XML security XML data exchange XML query relaxation XML indexing Mediators and wrappers for semistructured data and XML

Storing semistructured data in relational DBMS

Semantic Web

Provenance

Keyword queries

Combining rank information