Data Integration

Instructor

Dr. Jan Chomicki, Associate Professor, University at Buffalo (office hours: Thursday, 16:00-17:00).

Location and time

Lecture: Thursday, 14:15-16:00, Room 5820 (no lectures May 3, May 10)

Recitations: Friday, 14:15-16:00, Room 5840 (no recitations April 6, April 13, May 4, May 11)

Presentation schedule

DateTopicPresenter
May 18Preference XPath 2.0P. Buczek, M. Dukielska

Handouts

  1. Datalog
  2. Relational Calculus
  3. Negation
  4. Schema mapping
  5. Query evaluation
  6. Consistent query answering (ICDT'07 keynote)
  7. CQA: query rewriting
  8. Metadata
  9. Query containment (A. Cali)
  10. XML
  11. XPath/XQuery
  12. RDF
  13. XML Transducers (W. Fan)

Resources

  1. Problem set #1

Tests

  1. Test 1 (due April 23, 2007)
  2. Test 2 (due June 15, 2007)

Summary

The availability of integrated data from multiple independent, heterogenous data sources is crucial for many applications. Data integration requires combining and matching information in different sources, and resolving a variety of discrepancies. XML is becoming a de facto data integration standard. This course will survey selected issues arising in data integration, focusing on the theoretical foundations of the area. The students in the class will be working on team projects (2-3 people) involving research and/or programming, and will give class presentations about their projects. There will also be 1-2 take-home exams.

Projects

Prerequisites

Some knowledge of database systems, logic and computational complexity.

Course outline (tentative):

1. Datalog: syntax, semantics, query evaluation.

2. Datalog with negation: stratified programs, stable models.

3. Schema mapping and integration.

4. Data integration and exchange, source-to-target dependencies.

5. Schematic discrepancies, SchemaSQL.

6. Query rewriting for distributed data sources, query containment, source capabilities.

7. Database inconsistency and incompleteness, consistent query answers.

8. XML: data model, schemas, types, integrity constraints, logics.

9. XML: XPath, XQuery, query evaluation.

10. Semantic Web: RDF, OWL, description logics.

Bibliography (under construction)

Information Integration Systems

SystemInstitution/CompanyTypeTechnology
XQuark Bridge/Fusion XQuarkOpen sourceXML/XQuery
Liquid Data (Enosys) BEACommercialXML
NimbleActuate CommercialXML
DB2 Information Integrator IBMCommercialSQL
Power CenterInformatica CommercialWeb services
XML Information WorkbenchXML Global CommercialXML
Callixacallixa.com CommercialSQL
Metamatrixmetamatrix.com CommercialXML/SQL
Xylemexyleme.com CommercialXML
Infomaster Stanford U.Academic, operationalSQL
SIMS ISI (USC)Academic prototypeRelational
Tukwila U. WashingtonAcademic prototypeXML
Raccoon UCIAcademic prototypeXML
Garlic IBM Almaden Industrial prototypeObject-relational

Useful URL's

Tutorials

Massive data integration and mining projects

Real-life stories

On-line bibliography

This bibliography is far from complete and typically does not contain references to papers for which I haven't been able to find a freely available on-line version. Any additions/modifications are appreciated.

Collections of articles

General background reference: J. D. Ullman, J. Widom: "A First Course in Database Systems," 2nd. edition, Prentice Hall, 2002.

Datalog and negation

Schema integration

Schematic discrepancies

Data cleaning

Consistent query answers

Query Evaluation for Distributed Data Sources

Source limitations

OEM and XML basics

XQuery

XPath/XQuery Incremental validation of XML documents Repairing XML documents XML security XML data exchange XML query relaxation XML indexing Mediators and wrappers for semistructured data and XML

Storing semistructured data in relational DBMS

Semantic Web

Combining rank information ˙