Data Integration (#170481)

Final presentations:

12/7, 12/9 (138 Bell).

12/13, 11:45-2:45 (117 Baldy).

Instructor

Dr. Jan Chomicki, Associate Professor

Location and time

138 Bell, T R 2:00-3:20.

Course blog

http://ub-cse-636.blogspot.com

Handouts

  1. Datalog
  2. XML
  3. XPath/XQuery
  4. Schema mapping
  5. Querying integrated databases
  6. Unfolding example
  7. Source capabilities
  8. Metadata
  9. Consistent query answering (ICDT'07 keynote)
  10. CQA: query rewriting
  11. RDF and SPARQL (Marcelo Arenas); bibliography
  12. Description Logic Reasoning (Ian Horrocks)
  13. Description logics
  14. Provenance

Tests

  1. Test 1 (due 10/21/2010)
  2. Test 2 (NEW DATE 12/13/2010)

Summary

The availability of integrated data from multiple independent, heterogenous data sources is crucial for many applications. Data integration requires combining and matching information from different sources, and resolving a variety of discrepancies. XML is becoming a de facto data integration standard.

This course will survey selected issues arising in data integration, focusing on the theoretical foundations of the area and software systems facilitating integration. The students in the class will be working on team projects (2-3 people) involving research and/or programming, and will give class presentations about their projects. There will also be two take-home exams.

The project in this course can be used as the CSE M.S. project (under the condition that the course grade is at least B+).

Projects

Prerequisites

Good knowledge of database systems, some knowledge of logic and computational complexity.

Policies

Grading:

  1. projects (50%, includes class presentation and final report)
  2. midterm (25%)
  3. final exam (25%)

Academic integrity policy: I will follow the UB and CSE academic integrity policies.

Make-up policy: The request should be made sufficiently in advance of the test, for valid reasons.

Late submission policy: The submissions are due at midnight on the due date. No late submissions are accepted. Exceptions will be made only for medical reasons. Questions about the grading have to be raised within a week after the graded assignment has been returned.

Course outline (tentative)

  1. Relational query languages: SQL, conjunctive queries, Datalog, query evaluation.
  2. Query containment and rewriting.
  3. XML: data model, schemas, integrity constraints, logics.
  4. XML query languages: XPath, XQuery, query evaluation.
  5. Schema matching and mapping.
  6. Query evaluation, source descriptions and limitations.
  7. Database incompleteness and inconsistency, nulls, consistent query answers.
  8. Data exchange, source-to-target dependencies.
  9. Schematic discrepancies, metadata, SchemaSQL.
  10. Data cleaning.
  11. Data provenance.
  12. Semantic Web: RDF, OWL, description logics.

Bibliography (under construction)

Information Integration Systems

SystemInstitution/CompanyTypeTechnology
XQuark Bridge/Fusion XQuarkOpen sourceXML/XQuery
Liquid Data (Enosys) BEACommercialXML
NimbleActuate CommercialXML
DB2 Information Integrator IBMCommercialSQL
Power CenterInformatica CommercialWeb services
XML Information WorkbenchXML Global CommercialXML
Callixacallixa.com CommercialSQL
Metamatrixmetamatrix.com CommercialXML/SQL
Xylemexyleme.com CommercialXML
Infomaster Stanford U.Academic, operationalSQL
SIMS ISI (USC)Academic prototypeRelational
Tukwila U. WashingtonAcademic prototypeXML
Raccoon UCIAcademic prototypeXML
Garlic IBM Almaden Industrial prototypeObject-relational

Useful URL's

Tutorials

Massive data integration and mining projects

Real-life stories

On-line bibliography

This bibliography is far from complete and typically does not contain references to papers for which I haven't been able to find a freely available on-line version. Any additions/modifications are appreciated.

Collections of articles

General background reference: J. D. Ullman, J. Widom: "A First Course in Database Systems," 3rd edition, Prentice Hall, 2008.

Datalog and negation

Schema integration

Schematic discrepancies

Data cleaning

Consistent query answers

Query Evaluation for Distributed Data Sources

Source limitations

OEM and XML basics

XQuery

XPath/XQuery Incremental validation of XML documents Repairing XML documents XML security XML data exchange XML query relaxation XML indexing Mediators and wrappers for semistructured data and XML

Storing semistructured data in relational DBMS

Semantic Web

Combining rank information