Data Integration
Instructor
Dr. Jan Chomicki, Associate Professor, University at Buffalo
(office hours: Thursday, 16:00-17:00).
Location and time
Lecture: Thursday, 14:15-16:00, Room 5820 (no lectures May 3, May 10)
Recitations: Friday, 14:15-16:00, Room 5840 (no recitations April 6, April 13, May 4, May 11)
Presentation schedule
| Date | Topic | Presenter |
| May 18 | Preference XPath 2.0 | P. Buczek, M. Dukielska |
Handouts
- Datalog
- Relational Calculus
- Negation
- Schema mapping
- Query evaluation
- Consistent query answering (ICDT'07 keynote)
- CQA: query rewriting
- Metadata
- Query containment (A. Cali)
- XML
- XPath/XQuery
- RDF
- XML Transducers (W. Fan)
Resources
- Problem set #1
Tests
- Test 1 (due April 23, 2007)
- Test 2 (due June 15, 2007)
Summary
The availability of integrated data from multiple independent,
heterogenous data sources is crucial for many applications. Data
integration requires combining and matching information in different
sources, and resolving a variety of discrepancies. XML is becoming a
de facto data integration standard. This course will survey selected
issues arising in data integration, focusing on the theoretical
foundations of the area. The students in the class will be working on
team projects (2-3 people) involving research and/or programming, and
will give class presentations about their projects. There will also be
1-2 take-home exams.
Projects
Prerequisites
Some knowledge of database systems, logic and computational complexity.
Course outline (tentative):
1. Datalog: syntax, semantics, query evaluation.
2. Datalog with negation: stratified programs, stable models.
3. Schema mapping and integration.
4. Data integration and exchange, source-to-target dependencies.
5. Schematic discrepancies, SchemaSQL.
6. Query rewriting for distributed data sources, query containment, source capabilities.
7. Database inconsistency and incompleteness, consistent query answers.
8. XML: data model, schemas, types, integrity constraints, logics.
9. XML: XPath, XQuery, query evaluation.
10. Semantic Web: RDF, OWL, description logics.
Bibliography (under construction)
Information Integration Systems
Useful URL's
Tutorials
Massive data integration and mining projects
Real-life stories
On-line bibliography
This bibliography is far from complete and typically does not contain references to papers for which I haven't been able
to find a freely available on-line version. Any additions/modifications are appreciated.
Collections of articles
- Special
issue on Information Integration on the Web,
IEEE Intelligent Systems, Sept/Oct. 2003.
- Special issue
on Information Integration, IBM Systems Journal, 41(4), 2002.
-
Special issue on Integration Management, Data Engineering Bulletin, Sept. 2002, IEEE Computer Society.
-
Special issue on Data Cleaning, Data Engineering Bulletin, Dec. 2000, IEEE Computer Society.
-
Special issue on XML Data Management, Data Engineering Bulletin, June 2001, IEEE Computer Society.
General background reference:
J. D. Ullman, J. Widom: "A First Course in Database Systems," 2nd. edition, Prentice Hall, 2002.
Datalog and negation
Schema integration
- E. Rahm, P. A. Bernstein:
A survey of approaches to automatic schema matching. VLDB Journal, 2001.
- S. Melnik, H. Garcia-Molina, E. Rahm.
Similarity Flooding: A Versatile Graph Matching Algorithm. ICDE'02.
Full version.
RECOMMENDED
- A. Doan, P. Domingos, and A. Halevy.
Learning to Match the Schemas of Databases: A Multistrategy Approach.
Machine Learning, 2003. Doan's Ph.D. won the 2003 ACM Doctoral Dissertation Award!
- J. Madhavan, P. A. Bernstein, E. Rahm.
Generic Schema Matching with CUPID.
VLDB'01.
- L. Popa et al.
Translating Web Data. VLDB'02.
RECOMMENDED
- R. J. Miller, L. M. Haas, M. A. Hernandez.
Schema Mapping as Query Discovery.
VLDB'00.
Schematic discrepancies
- Ravi Krishnamurthy, Witold Litwin, William Kent:
Language Features for Interoperability of Databases with Schematic Discrepancies.
SIGMOD 1991, 40-49.
- Laks V. S. Lakshmanan, Fereidoon Sadri, Iyer N. Subramanian:
SchemaSQL: An extension to SQL for multidatabase interoperability.
ACM TODS, 26(4), December 2001.
- Catharine M. Wyss, Edward M. Robinson: Relational languages for metadata integration.
ACM TODS, 30(2), June 2005.
RECOMMENDED
Data cleaning
- Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, C-A. Saita:
Declarative Data Cleaning: Language, Models, and Algorithms. VLDB'01.
RECOMMENDED
- Vijayshankar Raman, Joseph M. Hellerstein.
Potter's Wheel: An Interactive Data Cleaning System.
VLDB'01.
- M. Hernandez and S. Stolfo:
Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem.
Journal of Data Mining and Knowledge Discovery, 1997.
- S. Chaudhuri, K. Ganjam, V. Ganti, R. Motwani.
Robust and Efficient Fuzzy Match for Online Data Cleaning.
SIGMOD'03.
- Alvaro E. Monge.
An adaptive and efficient algorithm for detecting approximately duplicate
database records. Submitted.
Consistent query answers
- J. Chomicki:
Consistent Query Answering: Five Easy Pieces. ICDT'07.
- L. Bertossi:
Consistent Query Answering in Databases. ACM SIGMOD Record, June 2006.
RECOMMENDED
- L. Bertossi, J. Chomicki:
Query Answering in Inconsistent Databases. In Logics for
Emerging Applications of Databases, Springer-Verlag, 2003.
RECOMMENDED
- M. Arenas, L. Bertossi, J. Chomicki:
Consistent Query Answers in Inconsistent Databases.
PODS'99.
- M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, J. Spinrad:
Scalar Aggregation in Inconsistent Databases.
TCS, 2003.
Query Evaluation for Distributed Data Sources
- Jeffrey D. Ullman:
Information Integration Using Logical Views. ICDT 1997: 19-40.
- M. Lenzerini: Data Integration; A Theoretical Perspective, Proc. ACM Symposium
on Principles of Database Systems, 2002.
Slides.
- Alon Y. Halevy:
Theory of Answering Queries Using Views, SIGMOD Record 29(4), December 2000.
- Alon Y. Levy:
Answering Queries Using Views: A Survey,VLDB Journal, 2001.
- Alon Y. Levy, Anand Rajaraman, Joann J. Ordille:
Query-Answering Algorithms for Information Agents.
AAAI 1996.
- Oliver M. Duschka:
Query Planning and Optimization in Information Integration.
Ph.D. Thesis, Stanford University,
Computer Science Technical Report STAN-CS-TR-97-1598,
1997.
- Rachel Pottinger, Alon Levy:
A Scalable Algorithm for Answering Queries Using Views, VLDB Journal, 2001.
- Oliver Duschka, Michael Genesereth, Alon Levy:
Recursive Query Plans for Data Integration.
Journal of Logic Programming, special issue on Logic Based
Heterogeneous Information Systems, 2000.
RECOMMENDED
-
Jarek Gryz:
Query Rewriting Using Views in the Presence of Functional and
Inclusion Dependencies.
Information Systems, Vol. 24, No. 7, 1999, pp. 597-612.
-
R. Fagin, P. G. Kolaitis, R. J. Miller, L. Popa.
Data Exchange: Semantics and Query Answering.
ICDT'03. RECOMMENDED
Source limitations
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman: A Query Translation Scheme
for Rapid Implementation of Wrappers. Deductive and
Object-Oriented Databases (DOOD) 95.
- Y. Papakonstantinou, A. Gupta, L. Haas: Capabilities-Based Query
Rewriting in Mediator Systems. Distributed And Parallel
Databases, 6(1), Jan 98.
- Alon Levy, Anand Rajaraman, Jeffrey Ullman:
Answering Queries Using Limited External Query Processors.
Journal of Computer and System Sciences, special issue dedicated to PODS-96, 1999.
OEM and XML basics
XQuery
XPath/XQuery
- M. Benedikt, Ch. Koch:
XPath Leashed. ACM TOCL, to appear.
- G. Gottlob, Ch. Koch, R. Pichler:
XPath Processing in a Nutshell. SIGMOD Record, March'03.
- G. Gottlob, Ch. Koch, R. Pichler:
Efficient Algorithms for Processing XPath Queries.
ACM TODS, June 2005.
- David DeHaan, David Toman, Mariano P. Consens, and M. Tamer Qzsu:
A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding.
SIGMOD'03.
Incremental validation of XML documents
- A. Balmin, Y. Papakonstantinou, V. Vianu:
Incremental Validation of XML Documents. ACM TODS, December 2004.
- Denilson Barbosa, Alberto Mendelzon, Leonid Libkin, Laurent Mignet and Marcelo Arenas:
Efficient Incremental Validation of XML Documents. ICDE'04.
- B. Bouchou, M. Halfeld Ferrari Alves:
Updates and Incremental Validation of XML Documents. DBPL'03.
- Maria Adriana Abrão, Béatrice Bouchou, Mirian Halfeld Ferrari Alves, Dominique Laurent, Martin A. Musicante:
Incremental Constraint Checking for XML Documents. XSym'04.
Repairing XML documents
XML security
XML data exchange
XML query relaxation
XML indexing
Mediators and wrappers for semistructured data and XML
- G.Wiederhold:
Value-added Middleware: Mediators.
- H. Garcia-Molina et al:
The TSIMMIS Approach to Mediation: Data Models and Languages.
JIIS.
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman:
A Query Translation Scheme for Rapid Implementation of Wrappers.
DOOD'95.
- Y. Papakonstantinou, H. Garcia-Molina, J. Ullman:
MedMaker: A Mediation System Based on Declarative Specifications.
ICDE'96.
- Y.Papakonstantinou, S. Abiteboul, H. Garcia-Molina:
Object fusion in Mediator systems.VLDB'96.
- J-R. Gruser, L. Raschid, M. E. Vidal, L. Bright:
Wrapper Generation for Web Accessible Data Sources.
CoopIS'98.
Wrapper Generation Project.
- I. Manolescu, D. Florescu, D. Kossmann:
Answering XML Queries over
Heterogenous Data Sources. VLDB'01.
Storing semistructured data in relational DBMS
Semantic Web
Combining rank information
- R. Fagin:
Combining Fuzzy Information: an Overview.
ACM SIGMOD Record 31, 2, 2002. RECOMMENDED
- R. Fagin, A. Lotem, M. Naor:
Optimal aggregation algorithms for middleware.
Extended abstract in PODS'01.
- W.F. Cody, L.M. Haas, W. Niblack, M. Arya, M.J. Carey, R. Fagin, D. Lee, D. Petkovic, P.M. Schwarz, J. Thomas,
M. Tork Roth, J.H. Williams and E.L. Wimmers:
Querying Multimedia Data from Multiple Repositories by Content: The Garlic Project.
IFIP 2.6 Third Working Conference on Visual Database Systems (VDB-3), 1995.
- R. Fagin:
Combining Fuzzy Information From Multiple Systems.
JCSS, 1999.
- E.L. Wimmers, L.M. Haas, M. Tork Roth, Ch. Braendli:
Using Fagin's Algorithm for Merging Ranked Results in Multimedia Middleware.
COOPIS'99.
- R. Fagin, E.L. Wimmers:
A Formula for Incorporating Weights into Scoring Rules.
ICDT'97.
- R. Fagin:
Fuzzy Queries in Multimedia Database Systems.
PODS'98, invited talk.