CSE 703 Provenance and lineage

Registration #20919


Dr. Jan Chomicki, Professor. Office hours: TBD.

Time and location

Mon 11:00-1:50 pm, Davis 338A.




To access the papers in the UB digital library, you may need to use the proxy server and reload the appropriate page:
Many papers can be googled on the author pages or retrieved from


  1. Prepare and present a talk based on one or more papers from the current computer science literature (I will distribute the papers and help with the presentation).
  2. The presentations will be problem-oriented, not paper-oriented. It may be necessary to read more than one paper and/or split the work with another presenter. For example, one presenter may give a general introduction to the area and the other present a specific approach in depth.
  3. Prepare a report based on the same material.
  4. Attend all the classes and participate in the discussions.
  5. There will also be presentations by the instructor and/or invited speakers.


Required background: a course in databases. Some knowledge of logic is helpful.


The seminar is graded S/U and can be taken for 3 credits. An implementation project is a possibility: see the instructor.


  1. Academic integrity
  2. Class attendance is mandatory. No late arrivals in class.
  3. Other relevant university policies.

Accessibility resources

Students with physical or learning disabilities should register with the Accessibility Resources in order to receive accommodation.


Provenance is one of the central topics in Big Data. Data provenance keeps track of how the data is derived from the sources. Workflow provenance/lineage represents the information about specific processes and makes it possible to query and replay them. An important class of provenance applications deals with query result explanation. Another class deals with scientific workflows. This seminar will address practical and theoretical issues in provenance, lineage and related areas. Each participating student will present one or more research papers from the current database literature, and prepare a report.



James Cheney, Laura Chiticariu, Wang Chiew Tan: Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases 1(4): 379-474 (2009).

Todd J. Green, Val Tannen: The Semiring Framework for Database Provenance. PODS 2017: 93-99.


Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, Val Tannen: Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB 5(4): 346-357 (2011).

Peter Buneman, Adriane Chapman, James Cheney: Provenance management in curated databases. SIGMOD Conference 2006: 539-550.

Chen Chen, Harshal Tushar Lehri, Lay Kuan Loh, Anupam Alur, Limin Jia, Boon Thau Loo, Wenchao Zhou: Distributed Provenance Compression. SIGMOD Conference 2017: 203-218.

Amit Chavan, Silu Huang, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya G. Parameswaran: Towards a Unified Query Language for Provenance and Versioning. TaPP 2015

Luc Moreau, Paul T. Groth, James Cheney, Timothy Lebo, Simon Miles: The rationale of PROV. J. Web Sem. 35: 235-257 (2015).

Umut A. Acar, Peter Buneman, James Cheney, Jan Van den Bussche, Natalia Kwasnikowska, Stijn Vansummeren: A Graph Model of Data and Workflow Provenance. TaPP 2010.

Peter Buneman, James Cheney, Stijn Vansummeren: On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst. 33(4): 28:1-28:47 (2008).

Yael Amsterdamer, Daniel Deutch, Val Tannen: Provenance for aggregate queries. PODS 2011: 153-164.

Yael Amsterdamer, Daniel Deutch, Tova Milo, Val Tannen: On Provenance Minimization. ACM Trans. Database Syst. 37(4): 30:1-30:36 (2012).


Daniel Deutch, Yuval Moskovitch, Val Tannen: Provenance-based analysis of data-centric processes. VLDB J. 24(4): 583-607 (2015).

Marcin Wylot, Philippe Cudré-Mauroux, Manfred Hauswirth, Paul T. Groth: Storing, Tracking, and Querying Provenance in Linked Data. IEEE Trans. Knowl. Data Eng. 29(8): 1751-1764 (2017).

Daniel Deutch, Nave Frost, Amir Gilad: Provenance for Natural Language Queries. PVLDB 10(5): 577-588 (2017).

Val Tannen: Provenance analysis for FOL model checking. SIGLOG News 4(1): 24-36 (2017).

Peter Buneman, Susan B. Davidson, James Frew: Why data citation is a computational problem. Commun. ACM 59(9): 50-57 (2016).

Peter Buneman, Sanjeev Khanna, Keishi Tajima, Wang Chiew Tan: Archiving scientific data. ACM Trans. Database Syst. 29: 2-42 (2004)

Zhepeng Yan, Val Tannen, Zachary G. Ives: Fine-grained Provenance for Linear Algebra Operators. TaPP 2016.

Saumen C. Dey, Sven Köhler, Shawn Bowers, Bertram Ludäscher: Computing Location-Based Lineage from Workflow Specifications to Optimize Provenance Queries. IPAW 2014: 180-193.

Grigoris Karvounarakis, Todd J. Green, Zachary G. Ives, Val Tannen: Collaborative data sharing via update exchange and provenance. ACM Trans. Database Syst. 38(3): 19:1-19:42 (2013)

David W. Archer, Lois M. L. Delcambre, David Maier: User Trust and Judgments in a Curated Database with Explicit Provenance. In Search of Elegance in the Theory and Practice of Computation 2013: 89-111.


Rajendra Bose, James Frew: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 37(1): 1-28 (2005).

Susan B. Davidson, Sanjeev Khanna, Tova Milo: To Show or Not to Show in Workflow Provenance. In Search of Elegance in the Theory and Practice of Computation 2013: 217-226.

Peter Alvaro, Joshua Rosen, Joseph M. Hellerstein: Lineage-driven Fault Injection. SIGMOD Conference 2015: 331-346.

Paolo Missier, Norman W. Paton, Khalid Belhajjame: Fine-grained and efficient lineage querying of collection-based workflow provenance. EDBT 2010: 299-310.

Query result explanation, causality

Seokki Lee, Sven Köhler, Bertram Ludäscher, Boris Glavic: A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries. ICDE 2017: 485-496.

Sudeepa Roy, Laurel Orr, Dan Suciu: Explaining Query Answers with Explanation-Ready Databases. PVLDB 9(4): 348-359 (2015).

Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu: The Complexity of Causality and Responsibility for Query Answers and non-Answers. PVLDB 4(1): 34-45 (2010).

Alexandra Meliou, Wolfgang Gatterbauer, Joseph Y. Halpern, Christoph Koch, Katherine F. Moore, Dan Suciu: Causality in Databases. IEEE Data Eng. Bull. 33(3): 59-67 (2010).

Melanie Herschel: A Hybrid Approach to Answering Why-Not Questions on Relational Query Results. J. Data and Information Quality 5(3): 10:1-10:29 (2015).

Data profiling

Ziawasch Abedjan, Lukasz Golab, Felix Naumann: Profiling relational data: a survey. VLDB J. 24(4): 557-581 (2015).

Jens Ehrlich, Mandy Roick, Lukas Schulze, Jakob Zwiener, Thorsten Papenbrock, Felix Naumann: Holistic Data Profiling: Simultaneous Discovery of Various Metadata. EDBT 2016: 305-316.

Schema evolution

Carlo Curino, Hyun Jin Moon, Alin Deutsch, Carlo Zaniolo: Automating the database schema evolution process. VLDB J. 22(1): 73-98 (2013).

Ramanathan V. Guha, Dan Brickley, Steve Macbeth: Schema.org: evolution of structured data on the web. Commun. ACM 59(2): 44-51 (2016)