University at Buffalo, The State University of New York

Menu

CSE 636: Data Integration

Fall 2006 (#295856)


What's New

Please check the newsgroup for announcements. Lecture notes will be on this web page. A substantial part of the site will be in HTML and PDF formats. You can download Adobe Acrobat Reader here.

Staff

Newsgroup

sunyab.cse.636

Course Description

Data integration has been recognized as a research topic of big practical importance. The availability of integrated data from multiple independent, heterogenous data sources is crucial for many applications. Data integration requires combining and matching information in different sources, and resolving a variety of conflicts. XML is becoming a de facto data integration standard. With the number of data sources growing very quickly, data integration is bound to become even more important in the future. This course will survey selected theoretical and practical issues arising in data integration.

The course is based on the instructor's lectures and guest lectures. The lectures cover most of the material in the course.

Schedule

. Mon Tue Wed Thu Fri
Lecture 2:00pm - 2:50pm
213 Norton Hall
. 2:00pm - 2:50pm
213 Norton Hall
. 2:00pm - 2:50pm
213 Norton Hall
Instructor Office Hours 3:30pm - 4:30pm
246 Bell Hall
. . . 3:30pm - 4:30pm
246 Bell Hall
TA Office Hours . . 3:00pm - 4:00pm
329 Bell Hall
3:30pm - 4:30pm
329 Bell Hall
.

The following is a tentative schedule of lectures. Changes will appear in the "What's New" Section above.

  Week Monday
Lecture
Wednesday
Lecture
Friday
Lecture
1 08/28 Introduction & Overview
Lecture Slides: Set 1, Set 2
XML Data Model / Document Type Definition (DTD)
Lecture Slides: Set 1
XML Data Model / Document Type Definition (DTD)
Assignment 1 Out
2 09/04 No Lecture (Labor Day) XML Schema
Lecture Slides: Set 1
3 09/11 XPath / XQuery
Lecture Slides: Set 1, Set 2
The XML files used on the XQuery slides
4 09/18 Schema Design Issues
Assignment 1 Due (2:00pm)
Web Services
Lecture Slides: Set 1
In-Class XML Schema Design
Project 1 Out
5 09/25 Project Discussion Global-As-View / Local-As-View
Lecture Slides: Set 1
6 10/02 No Lecture (Yom Kippur) Global-As-View Example
Example Files
7 10/09 Limited Source Capabilities
Lecture Slides: Set 1
Project 1 Due (2:00pm)
Distributed Query Processing
Lecture Slides: Set 1
Midterm Out
No Lecture (UB Closed)
8 10/16 No Lecture (UB Closed) Distributed Query Processing
Project 2 Out
Midterm Due (2:00pm)
Datalog
Lecture Slides: Set 1
9 10/23 Datalog Query Containment
Lecture Slides: Set 1
Query Containment
Assignment 2 Out
10 10/30 Answering Queries Using Views Algorithms (Overview)
Lecture Slides: Set 1
Answering Queries Using Views Algorithms (Bucket Algorithm)
Lecture Slides: Set 1
11 11/06 Answering Queries Using Views Algorithms (MiniCon Algorithm)
Lecture Slides: Set 1
Interactive Query Formulation
Lecture Slides: Set 1
12 11/13 Interactive Query Formulation
Assignment 2 Due (2:00pm)
Assignment 2 Discussion Community-Based Integration
Lecture Slides: Set 1
Project 2 Due (2:00pm)
13 11/20 Community-Based Integration No Lectures (Fall Recess)
14 11/27 SchemaSQL
Lecture Slides: Set 1
Project 3 Out
SchemaSQL
Lecture Slides: Set 2
SchemaSQL Exercise
15 12/04 Schema Matching
Lecture Slides: Set 1
Conflict Resolution, Repairs, Consistent Query Answers
Guest Lectures by Jan Chomicki
Lecture Slides
16 12/11 Final Exam & Project 2 Test Cases Due
Tuesday 12/12 3:30pm
Bell Hall 246
Project 3 Due (2:00pm)
Friday 12/15

Prerequisites

An introductory database course equivalent to CSE 562 (Database Systems)

Text

None required. A list of recommended books goes as follows:

Grade Computation

  • Assignments: 10% (set of 2, 5% each)
  • Midterm: 20%
  • Final: 20%
  • Project: 45%
  • Participation: 5%
  • Grades

Practice Problems

Project

Resources

XML & DTDs XML Schema
  • XML Schema W3C Recommendation
    Consists of three specifications. Start with XML Schema Part 0: Primer.
  • XML Schemas: Best Practices
  • XML Schema Tutorials ( ZVON, W3Schools )
  • Apache Xerces
    XML Parser, XML Schema Parser and XML Schema Validator. Implementation of both DOM & SAX.
    In order to validate your XML Schemas and your XML files against them, follow these steps:
    • Make sure you have Java JDK 5.0 installed on your computer. If not, download the latest version from here.
    • Download the latest binary distribution of Apache Xerces for Java from here: .zip | .gz
    • Extract all files from the Apache Xerces .zip or .gz file on your computer.
    • From the command line, change the current directory to where you extracted Apache Xerces.
    • Execute the following command:

      On Windows
      java -cp xercesImpl.jar;xercesSamples.jar dom.Counter -v -s <path to your XML file>
      Example: java -cp xercesImpl.jar;xercesSamples.jar dom.Counter -v -s ..\..\project\data.xml

      On Linux
      java -cp xercesImpl.jar:xercesSamples.jar dom.Counter -v -s <path to your XML file>
      Example: java -cp xercesImpl.jar:xercesSamples.jar dom.Counter -v -s ../../project/data.xml

    • The above command will validate your XML file against the XML Schema specified and will print your XML file in standard output.
    • Here is an XML Schema and an XML file you can use to test the above command.
    • In the example XML file, note how you specify the XML Schema that it conforms to.
  • Another Online XML Schema Validator
XQuery/XPath Web Services Datalog

Rules & Policies

Exam Rules

Group study and discussion are encouraged, but midterms and finals must be your own work. For coding assignments, if you use a piece of code which you borrowed from elsewhere and therefore did not write yourself, make sure you comment it to show this. Zero tolerance on plagiarism/cheating: consult the University Code of Conduct for details on consequences of academic misconduct, and see also the academic integrity policy of the CSE department.

Midterms & Finals

Midterms and finals are due at the beginning of the lecture of the due date. Late midterms and finals will not be accepted.

Grading Policies

Write clear arguments. Be neat and precise. Getting the right answer may not be enough. The derivation and quality of writing counts! Don't write many different things in hope that you'll get the points if one of them is the right one. Indeed, you will lose points if you follow such a policy.