Project Title: A Semantic Summarization Approach to Data Warehousing and Online Analytical Processing
กก

Project Award Number: IIS-0308001


Principal Investigator

Aidong Zhang
Department of Computer Science and Engineering
State University of New York at Buffalo

201 Bell Hall

Buffalo, NY 14260-2000
Phone Number: (716) 645-3180
Fax Number: (716) 645-3464
Email: azhang@cse.buffalo.edu
URL: http://www.cse.buffalo.edu/faculty/azhang/

Co-PI

Jian Pei
School of Computing Science
Simon Fraser University
8888 University Drive
Buraby, BC Canada V5A 1S6
Phone Number: (604) 268-6851
Fax Number: (604) 291-3045
Email: jpei@cs.sfu.ca
URL: http://www.cs.sfu.ca/~jpei

Keywords

data warehouse
online analytical processing
semantic summarization
OLAP query answering
data mining

Project Summary

Data warehousing and online analytical processing (OLAP) are essential facilities for many data analysis tasks and applications, including business planning, market analysis, financial analysis, security management, etc. A data warehouse is often huge in practice. Without any "semantic navigation'', it is hard for a user to browse and analyze the data in a data warehouse effectively. Moreover, it is hard to compress a data warehouse and use the compressed database directly to answer various queries and conduct advanced analysis. Our preliminary research results indicates that the critical obstacle is that a semantic summary structure is needed in a data warehouse.

This project is to systematically study and develop a semantic summarization approach for data warehousing and online analytical processing. The research consists of (1) development of semantic summarization methods in data cubes; and (2) efficient implementation and utilizations of semantic summarization cubes, including data storage, index and query answering.

Broader impacts: The results of this project will provide novel methods for effective and efficient data exploration that supports decision making. They will also bring benefits to some other related researches on advanced data analysis, including data mining, data visualization and interactive data exploration. The techniques developed in this project will be illustrated in a research prototype, which will be used in the related courses and also made available on the internet. A course on data warehousing and data mining will be developed.
กก

Publications and Products (most of the publications are available online at http://www.cs.sfu.ca/~jpei/publications)

The materials published below are partially based upon work supported by the National Science Foundation under Grant No. 0308001. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

  1. L. Wang, A. Zhang and M. Ramanathan, BioStar Models of Clinical and Genomic Data for Biomedical Data Warehouse Design, International Journal of Bioinformatics Research and Applications (IJBRA), 2005, Vol. 1, No.1, pp. 63-80.

  2. J. Pei, H. Wang, J. Liu, K. Wang, J. Wang, and P. S. Yu. "Discovering Frequent Closed Partial Orders from Strings" IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society.

  3. I. Pekerskaya, J. Pei, and K. Wang. "Mining Changing Regions from Access-Constrained Snapshots: A Cluster-Embedded Decision Tree Approach". Journal of Intelligent Information Systems (Special Issue on Mining Spatio-Temporal Data), Springer-Verlag.

  4. Y. Huang, J. Pei, and H. Xiong. "Co-location Mining with Rare Spatial Features''. Journal of GeoInformatica, Springer Science+Business Media B.V.

  5. J. Pei, J. Han, and W. Wang. "Constraint-Based Sequential Pattern Mining: The Pattern-Growth Methods". Journal of Intelligent Information Systems, Springer-Verlag.

  6. M. Cho, J. Pei, and K. Wang. "Answering Ad Hoc Aggregate Queries from Data Streams Using Prefix Aggregate Trees". Knowledge and Information Systems: An International Journal, Springer-Verlag.

  7. D. Jiang, J. Pei, M. Ramanathan, C. Lin, C. Tang, and A. Zhang. "Mining Gene-Sample-Time Microarray Data: A Coherent Gene Cluster Discovery Approach". Knowledge and Information Systems: An International Journal, Springer-Verlag.

  8. J. Wang, J. Han, and J. Pei. "Closed Constrained-Gradient Mining in Retail Databases". IEEE Transactions on Knowledge and Data Engineering, Volume 18, Number 6, pages 764-769, June, 2006, IEEE Computer Society.

  9. J. Han, Y. Chen, G. Dong, J. Pei, B. W. Wah, J. Wang, and Y. D. Cai. "Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams". Distributed and Parallel Databases, Volume 18, Number 2, pages 173-197, September 2005, Springer Science + Business Media.

  10. D. Jiang, J. Pei, and A. Zhang. "An Interactive Approach to Mining Gene Expression Data". IEEE Transactions on Knowledge and Data Engineering, Volume 17, Number 10, pages 1363-1378, October 2005, IEEE Computer Society.

  11. M. Cho, J. Pei, H. Wang, and W. Wang. "Preference-based Frequent Pattern Mining". International Journal of Data Warehousing and Mining, Volume 1, Number 4, pages 56-77, October-December 2005, Idea Group, Inc.

  12. J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.C. Hsu. "Mining Sequential Patterns by Pattern-growth: The  PrefixSpan Approach". IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 11, pages 1424-1440, November 2004, IEEE Computer Society.

  13. J. Pei, G. Dong, W. Zou, and J. Han. "Mining Condensed Frequent Pattern Bases". Knowledge and Information Systems: An International Journal, Volume 6, Number 5, pages 570-594, September 2004, Springer-Verlag.

  14. G. Dong, J. Han, J. Lam, J. Pei, K. Wang, and W. Zou. "Mining Constrained Gradients in Large Databases". IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 8, pages 922-938, August 2004, IEEE Computer Society.

  15. J. Pei, J. Han, and L.V.S. Lakshmanan. "Pushing Convertible Constraints in Frequent Itemset Mining". Data Mining and Knowledge Discovery: An International Journal, Volume 8, Issue 3, pages 227-252, May, 2004, Kluwer Academic Publishers.

  16. J. Han, J. Pei, and X. Yan. "From Sequential Pattern Mining to Structured Pattern Mining: A Pattern-Growth Approach". (Invited paper) Journal of Computer Science and Technology, Vol. 19, No. 3, pages 257-279. May 2004. Allerton Press, Inc.

  17. J. Han, J. Pei, Y. Yin, and R. Mao. "Mining Frequent Patterns without Candidate Generation: A Frequent-pattern Tree Approach". Data Mining and Knowledge Discovery: An International Journal, Volume 8, Issue 1, pages 53-87, January 2004,  Kluwer Academic Publishers.

  18. D. Jiang, J. Pei, and A. Zhang. "Towards Interactive Exploration of Gene Expression Patterns". ACM SIGKDD Explorations (Special Issue on Microarray Data Analysis), Volume 5, Issue 2, pages 79-90, 2003.

  19. W. Zhu, J. Pei, J. Yin, Y. Xie. "Granularity Adaptive Density Estimation and on-Demand Clustering of Concept-Drifting Data Streams". In Proceedings of the 8th International
    Conference on Data Warehousing and Knowledge Discovery (DaWaK'06), Krakow, Poland, September 4-8, 2006.

  20. J. Li, R. C-W. Wong, A. W-C. Fu, and J. Pei. "Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures". In Proceedings of the 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK'06), Krakow, Poland, September 4-8, 2006.

  21. C. Aggarwal, J. Pei, and B. Zhang. "On Privacy Preservation against Adversarial Data Mining". In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), Philadelphia, PA, USA, August 20-23, 2006.

  22. J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W-C. Fu. "Utility-Based Anonymization Using Local Recoding". In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), Philadelphia, PA, USA, August 20-23, 2006.

  23. J. Li, H. Li, L. Wong, J. Pei, and G. Dong. "Minimum Description Length (MDL) Principle: Generators Are Preferable to Closed Patterns". In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06), Boston, MA, USA, July 16-20, 2006.

  24. T. Xie and J. Pei. "MAPO: Mining API Usages from Open Source Repositories" (short paper). In Proceedings of the 3rd International Workshop on Mining Software Repositories (MSR 2006), Shanghai, China, May 22-23, 2006.

  25. B.-W. On, D. Lee, E. Elmacioglu, J. Kang, and J. Pei. "An Effective Approach to Entity Resolution Problem Using Quasi-Clique and its Application to Digital Libraries" (short paper). In Proceedings of the ACM/IEEE 2006 Joint Conf. on Digital Libraries (JCDL'06), Chapel Hill, NC, USA, June 11-15, 2006.

  26. Y. Tao, X. Xiao, and J. Pei. "SUBSKY: Efficient Computation of Skylines in Subspaces". In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, USA, April 3-7, 2006.

  27. J. Pei, J. Liu, H. Wang, K. Wang, P. S. Yu, and J. Wang. "Efficiently Mining Frequent Closed Partial Orders", In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM'05), New Orleans, Louisiana, USA, November 27-30 2005.

  28. H. Wang and J. Pei. "A Random Method for Quantifying Changing Distributions in Data Streams", In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05), Porto, Portugal, October 3-7, 2005.

  29. J. Ye, X. Zhou, J. Pei, L. Chen, and L. Zhang. "A Stratification-Based Approach to Accurate and Fast Image Annotation". In Proceedings of the 6th International Conference on Web-Age Information Management (WAIM'05), Hangzhou, China, October 11-13, 2005.

  30. C. Liu, K. Wu, and J. Pei. "A Dynamic Clustering and Scheduling Approach to Energy Saving in Data Collection from Wireless Sensor Networks". In Proceedings of the 2nd Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON'05), Santa Clara, California, USA, September 26-29, 2005.

  31. J. Pei, W. Jin, M. Ester, and Y. Tao. "Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces". In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05), Trondheim, Norway, August 30-September 2, 2005.

  32. J. Pei, D. Jiang, and A. Zhang. "On Mining Cross-Graph Quasi-Cliques". In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), Chicago, IL, USA, August 21-24, 2005.

  33. H. Wang, J. Pei, and P. S. Yu. "Pattern based Similarity Search for Microarray Data" (Industrial and Government Track poster paper). In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), Chicago, IL, USA, August 21-24, 2005.

  34. H. Yu, J. Pei, S. Tang, and D. Yang. "Mining Most General Multidimensional Summarization of Probable Groups in Data Warehouses". In Proceedings of the 17th International Scientific and Statistical Database Management Conference (SSDBM'05), Santa Barbara, California, USA, June 27-29, 2005.

  35. W. Wang, C. Wang, Y. Zhu, B. Shi, J. Pei, X. Yan, and J. Han. "GraphMiner: A Structural Pattern Mining System for Large Disk-Based Graph Databases and Its Applications" (demo paper). In Proceedings of the 24th ACM SIGMOD International Conference on Management of Data (SIGMOD'05), Baltimore, Maryland, USA, June 14-16, 2005.

  36. M. Cho, J. Pei and D. Cheung. "Cross Table Cubing: Mining Iceberg Cubes from Data Warehouses" (poster paper). In Proceedings of the 5th SIAM International Conference on Data Mining (SDM'05), Newport Beach, CA, USA,  April 21-23, 2005.

  37. D. Jiang, J. Pei and A. Zhang. "A General Approach to Mining Quality Pattern-based Clusters from Gene Expression Data". In Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA'05), Beijing, China, April 18-20, 2005.

  38. G. Dong, C. Jiang, J. Pei, J. Li and L. Wong. "Mining Succinct Systems of Minimal Generators of Formal Concepts". In Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA'05), Beijing, China, April 18-20, 2005.

  39. J. Pei, D. Jiang and A. Zhang. "Mining Cross-graph Quasi-cliques in Gene Expression and Protein Interaction Data" (research poster paper). In Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 5-8, 2005.

  40. C. Wang, W. Wang, J. Pei, Y. Zhu and B. Shi. "Scalable Mining of Large Disk-based Graph Databases" (research full paper). In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, USA, August 22 - 25, 2004.

  41. D. Jiang, J. Pei, M. Ramanathan, C. Tang and A. Zhang. "Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data" (industrial full paper, Runner-up for the best application paper award). In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, USA, August 22-25, 2004.

  42. L. Deng, J. Pei, J. Ma and D.L. Lee. "A Rank Sum Test Method for Informative Gene Discovery" (industrial full paper). In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, USA, August 22-25, 2004.

  43. D. Jiang, J. Pei and A. Zhang. "GPX: Interactive Mining of Gene Expression Data" (demo paper). In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), Toronto, ON, Canada, August 30-September 3, 2004.

  44. H. Wang, F. Chu, W. Fan, P.S. Yu and J. Pei. "A Fast Algorithm for Subspace Clustering by Pattern Similarity" (full paper). In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), Santorini Island, Greece, 21-23 June 2004.

  45. C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang and B. Shi. "Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining" (full paper). In Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia, May 26-28, 2004.

  46. J. Pei, X. Zhang, M. Cho, H. Wang and P.S. Yu. "MaPle: A Fast Algorithm for Maximal Pattern-based Clustering" (Regular paper). In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, USA, November 19-22, 2003.

  47. D. Jiang, J. Pei and A. Zhang. "Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data". In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington, DC, USA, August 24-27, 2003.

  48. C. Tang, A. Zhang, and J. Pei. "Mining Phenotypes and Informative Genes from Gene Expression Data". In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington, DC, USA, August 24-27, 2003.

  49. J. Wang, J. Han, and J. Pei. "CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets". In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington, DC, USA, August 24-27, 2003.

  50. L.V.S. Lakshmanan, J. Pei, and Y. Zhao. "Efficacious Data Cube Exploration by Semantic Summarization and Compression" (demo paper). In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), Berlin, Germany, September 9-12, 2003.

  51. L.V.S. Lakshmanan, J. Pei, and Y. Zhao. "QC-Trees: An Efficient Summary Structure for Semantic OLAP". In Proceedings of the 2003 ACM-SIGMOD International Conference on Management of Data (SIGMOD'03), San Diego, CA, June 9-12, 2003.

  52. L.V.S. Lakshmanan, J. Pei, and Y. Zhao. "SOCQET: Semantic OLAP with Compressed Cube and Summarization" (demo paper). In Proceedings of the 2003 ACM-SIGMOD International Conference on Management of Data (SIGMOD'03), San Diego, CA, June 9-12, 2003.

  53. Jian Pei. "A General Model for Online Analytical Processing of Complex Data". In Proceedings of the 22nd International Conference on Conceptual Modeling (ER'03), Chicago, IL, October 13-16, 2003.

Project Impact

Human resources: One Ph.D. student and one master student who are supported by this project have graduated at SUNY Buffalo.

Education and curriculum development: The PI and co-PI offered a graduate course on data warehousing and data mining at SUNY Buffalo (Fall 2003, Fall 2005, and Fall 2006) and at Shanghai Jiao Tong University (Shanghai, China, December 2003), during which the related research results were discussed. The course has been offered regularly at the PI and co-PI's universities. The PI and co-PI have offered graduate seminars on related topics, too.

Goals, Objectives and Targeted Activities

The PI and co-PI have worked on the following issues:

  • Approximate quotient cubes;
  • Advanced OLAP queries over data cubes; and
  • Privacy-preserving OLAP operations by auditing.
  • Data warehousing and biomedical applications.

Area Background

Many data analysis tasks need data warehouses as the essential infrastructures. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managemental decision making processes [Inmon, 1996]. Conceptually, the core of a data warehouse can be implemented as a data cube, which is a multi-level, multi-dimensional database with various aggregate data.

A data warehouse is often huge in practice. There are some inherent problems that the current techniques cannot handle well.

  • Problem 1: A data cube often registers insufficient information about the semantic relations among aggregate cells. The semantic relations among aggregate cells are critical for achieving effective OLAP services.
  • Problem 2: A data cube usually does not support the semantic navigation of aggregate cells. In practice, a data cube lattice could be huge. Exploring in a huge lattice is far from trivial. Suppose that a manager wants to identify exceptions by browsing the data. Without a proper navigation, the manager has no idea on which dimensions should be used to roll-up or drill-down. Many steps in her exploration may be just fruitless. To provide an effective navigation service, the data cube needs to store more semantics of the data than just the roll-up/drill-down relation.
  • Problem 3: No semantic compression of data cubes. It is well recognized that data cubes in many applications tend to be huge. Thus, many studies have been focus on compressing data cubes. However, almost all approaches proposed previously are syntactic instead of semantic. That is, they do not consider the relations among cells in a cube. The compressed database may not be used directly for query answering, data browsing, or advanced analysis.

In summary, the key problem here is that the conventional data warehousing and OLAP techniques have not involved effective semantic summarization in data cubes. Without a proper semantic summarization, users may not be able to understand and use the information from a huge data cube. This motivates our proposal of a systematic study on a semantic summarization approach for data warehousing and OLAP.

Area References

  • S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB'96), pages 506-521, Bombay, India, Sept. 1996.
  • Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an OLAP environment. In Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 220-231. Morgan Kaufmann, 2000.
  • K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), pages 359-370, Philadelphia, PA, June 1999.
  • Claudio Carpineto and Giovanni Romano. A lattice conceptual clustering system and its application to browsing retrieval. Machine Learning, 24:95-122, 1996.
  • S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record, 26:65-74, 1997.
  • J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh.Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
  • V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes e ciently. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'96), pages 205-216, Montreal, Canada, June 1996.
  • W. H. Inmon. Building the Data Warehouse. John Wiley & Sons, 1996.
  • Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. Answering queries using views. In Proceedings of the Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 22-25, 1995, San Jose, California, pages 95-104. ACM Press, 1995.
  • K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very Large Data Bases (VLDB'97), pages 116-125, Athens, Greece, Aug. 1997.
  • Nick Roussopoulos, Yannis Kotidis, and Mema Roussopoulos. Cubetree: Organization of and bulk updates on the data cube. In Joan Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 89-99. ACM Press, 1997.
  • S. Sarawagi. Indexing OLAP data. Bulletin of the Technical Committee on Data Engineering, 20:36-43, 1997.
  • Yannis Sismanis, Nick Roussopoulos, Antonios Deligiannakis, and Yannis Kotidis. Dwarf: Shrinking the petacube. In Proc. 2002 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'02), Madison, Wisconsin, June 2002.
  • W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed cube: An e ective approach to reducing data cube size. In Proc. 2002 Int. Conf. Data Engineering (ICDE'02), San Fransisco, CA, April 2002.

Potential Related Projects

Data warehousing, online analytical processing, data mining, query optimization

Project Websites
http://www.cse.buffalo.edu/DBGROUP/JianPei/index.htm
The homepage of the project.
กก

The materials in this website are based upon work supported by the National Science Foundation under Grant No. 0308001. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.