Data warehousing and online analytical processing (OLAP) are essential
facilities for many data analysis tasks and applications, including business
planning, market analysis, financial analysis, security management, etc. A data
warehouse is often huge in practice. Without any "semantic navigation'', it is
hard for a user to browse and analyze the data in a data warehouse effectively.
Moreover, it is hard to compress a data warehouse and use the compressed
database directly to answer various queries and conduct advanced analysis. Our
preliminary research results indicates that the critical obstacle is that a
semantic summary structure is needed in a data warehouse.
This project is to systematically study and develop a semantic summarization
approach for data warehousing and online analytical processing. The research
consists of (1) development of semantic summarization methods in data cubes; and
(2) efficient implementation and utilizations of semantic summarization cubes,
including data storage, index and query answering.
Broader impacts: The results of this project will provide novel
methods for effective and efficient data exploration that supports decision
making. They will also bring benefits to some other related researches on
advanced data analysis, including data mining, data visualization and
interactive data exploration. The techniques developed in this project will be
illustrated in a research prototype, which will be used in the related courses
and also made available on the internet. A course on data warehousing and data
mining will be developed.
กก
The materials published below are partially based upon work supported by the National Science Foundation under Grant No. 0308001. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
L. Wang, A. Zhang and M. Ramanathan, BioStar Models of Clinical and Genomic Data for Biomedical Data Warehouse Design, International Journal of Bioinformatics Research and Applications (IJBRA), 2005, Vol. 1, No.1, pp. 63-80.
J. Pei, H. Wang, J. Liu, K. Wang, J. Wang, and P. S. Yu. "Discovering Frequent Closed Partial Orders from Strings" IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society.
I. Pekerskaya, J. Pei, and K. Wang. "Mining Changing Regions from Access-Constrained Snapshots: A Cluster-Embedded Decision Tree Approach". Journal of Intelligent Information Systems (Special Issue on Mining Spatio-Temporal Data), Springer-Verlag.
Y. Huang, J. Pei, and H. Xiong. "Co-location Mining with Rare Spatial Features''. Journal of GeoInformatica, Springer Science+Business Media B.V.
J. Pei, J. Han, and W. Wang. "Constraint-Based Sequential Pattern Mining: The Pattern-Growth Methods". Journal of Intelligent Information Systems, Springer-Verlag.
M. Cho, J. Pei, and K. Wang. "Answering Ad Hoc Aggregate Queries from Data Streams Using Prefix Aggregate Trees". Knowledge and Information Systems: An International Journal, Springer-Verlag.
D. Jiang, J. Pei, M. Ramanathan, C. Lin, C. Tang, and A. Zhang. "Mining Gene-Sample-Time Microarray Data: A Coherent Gene Cluster Discovery Approach". Knowledge and Information Systems: An International Journal, Springer-Verlag.
J. Wang, J. Han, and J. Pei. "Closed Constrained-Gradient Mining in Retail Databases". IEEE Transactions on Knowledge and Data Engineering, Volume 18, Number 6, pages 764-769, June, 2006, IEEE Computer Society.
J. Han, Y. Chen, G. Dong, J. Pei, B. W. Wah, J. Wang, and Y. D. Cai. "Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams". Distributed and Parallel Databases, Volume 18, Number 2, pages 173-197, September 2005, Springer Science + Business Media.
D. Jiang, J. Pei, and A. Zhang. "An Interactive Approach to Mining Gene Expression Data". IEEE Transactions on Knowledge and Data Engineering, Volume 17, Number 10, pages 1363-1378, October 2005, IEEE Computer Society.
M. Cho, J. Pei, H. Wang, and W. Wang. "Preference-based Frequent Pattern Mining". International Journal of Data Warehousing and Mining, Volume 1, Number 4, pages 56-77, October-December 2005, Idea Group, Inc.
J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.C. Hsu. "Mining Sequential Patterns by Pattern-growth: The PrefixSpan Approach". IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 11, pages 1424-1440, November 2004, IEEE Computer Society.
J. Pei, G. Dong, W. Zou, and J. Han. "Mining Condensed Frequent Pattern Bases". Knowledge and Information Systems: An International Journal, Volume 6, Number 5, pages 570-594, September 2004, Springer-Verlag.
G. Dong, J. Han, J. Lam, J. Pei, K. Wang, and W. Zou. "Mining Constrained Gradients in Large Databases". IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 8, pages 922-938, August 2004, IEEE Computer Society.
J. Pei, J. Han, and L.V.S. Lakshmanan. "Pushing Convertible Constraints in Frequent Itemset Mining". Data Mining and Knowledge Discovery: An International Journal, Volume 8, Issue 3, pages 227-252, May, 2004, Kluwer Academic Publishers.
J. Han, J. Pei, and X. Yan. "From Sequential Pattern Mining to Structured Pattern Mining: A Pattern-Growth Approach". (Invited paper) Journal of Computer Science and Technology, Vol. 19, No. 3, pages 257-279. May 2004. Allerton Press, Inc.
J. Han, J. Pei, Y. Yin, and R. Mao. "Mining Frequent Patterns without Candidate Generation: A Frequent-pattern Tree Approach". Data Mining and Knowledge Discovery: An International Journal, Volume 8, Issue 1, pages 53-87, January 2004, Kluwer Academic Publishers.
D. Jiang, J. Pei, and A. Zhang. "Towards Interactive Exploration of Gene Expression Patterns". ACM SIGKDD Explorations (Special Issue on Microarray Data Analysis), Volume 5, Issue 2, pages 79-90, 2003.
W. Zhu, J. Pei, J.
Yin, Y. Xie. "Granularity Adaptive Density Estimation and on-Demand
Clustering of Concept-Drifting Data Streams". In Proceedings of the
8th International
Conference on Data Warehousing and Knowledge Discovery (DaWaK'06),
Krakow, Poland, September 4-8, 2006.
J. Li, R. C-W. Wong, A. W-C. Fu, and J. Pei. "Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures". In Proceedings of the 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK'06), Krakow, Poland, September 4-8, 2006.
C. Aggarwal, J. Pei, and B. Zhang. "On Privacy Preservation against Adversarial Data Mining". In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), Philadelphia, PA, USA, August 20-23, 2006.
J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W-C. Fu. "Utility-Based Anonymization Using Local Recoding". In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), Philadelphia, PA, USA, August 20-23, 2006.
J. Li, H. Li, L. Wong, J. Pei, and G. Dong. "Minimum Description Length (MDL) Principle: Generators Are Preferable to Closed Patterns". In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06), Boston, MA, USA, July 16-20, 2006.
T. Xie and J. Pei. "MAPO: Mining API Usages from Open Source Repositories" (short paper). In Proceedings of the 3rd International Workshop on Mining Software Repositories (MSR 2006), Shanghai, China, May 22-23, 2006.
B.-W. On, D. Lee, E. Elmacioglu, J. Kang, and J. Pei. "An Effective Approach to Entity Resolution Problem Using Quasi-Clique and its Application to Digital Libraries" (short paper). In Proceedings of the ACM/IEEE 2006 Joint Conf. on Digital Libraries (JCDL'06), Chapel Hill, NC, USA, June 11-15, 2006.
Y. Tao, X. Xiao, and J. Pei. "SUBSKY: Efficient Computation of Skylines in Subspaces". In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, USA, April 3-7, 2006.
J. Pei, J. Liu, H. Wang, K. Wang, P. S. Yu, and J. Wang. "Efficiently Mining Frequent Closed Partial Orders", In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM'05), New Orleans, Louisiana, USA, November 27-30 2005.
H. Wang and J. Pei. "A Random Method for Quantifying Changing Distributions in Data Streams", In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'05), Porto, Portugal, October 3-7, 2005.
J. Ye, X. Zhou, J. Pei, L. Chen, and L. Zhang. "A Stratification-Based Approach to Accurate and Fast Image Annotation". In Proceedings of the 6th International Conference on Web-Age Information Management (WAIM'05), Hangzhou, China, October 11-13, 2005.
C. Liu, K. Wu, and J. Pei. "A Dynamic Clustering and Scheduling Approach to Energy Saving in Data Collection from Wireless Sensor Networks". In Proceedings of the 2nd Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON'05), Santa Clara, California, USA, September 26-29, 2005.
J. Pei, W. Jin, M. Ester, and Y. Tao. "Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces". In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05), Trondheim, Norway, August 30-September 2, 2005.
J. Pei, D. Jiang, and A. Zhang. "On Mining Cross-Graph Quasi-Cliques". In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), Chicago, IL, USA, August 21-24, 2005.
H. Wang, J. Pei, and P. S. Yu. "Pattern based Similarity Search for Microarray Data" (Industrial and Government Track poster paper). In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), Chicago, IL, USA, August 21-24, 2005.
H. Yu, J. Pei, S. Tang, and D. Yang. "Mining Most General Multidimensional Summarization of Probable Groups in Data Warehouses". In Proceedings of the 17th International Scientific and Statistical Database Management Conference (SSDBM'05), Santa Barbara, California, USA, June 27-29, 2005.
W. Wang, C. Wang, Y. Zhu, B. Shi, J. Pei, X. Yan, and J. Han. "GraphMiner: A Structural Pattern Mining System for Large Disk-Based Graph Databases and Its Applications" (demo paper). In Proceedings of the 24th ACM SIGMOD International Conference on Management of Data (SIGMOD'05), Baltimore, Maryland, USA, June 14-16, 2005.
M. Cho, J. Pei and D. Cheung. "Cross Table Cubing: Mining Iceberg Cubes from Data Warehouses" (poster paper). In Proceedings of the 5th SIAM International Conference on Data Mining (SDM'05), Newport Beach, CA, USA, April 21-23, 2005.
D. Jiang, J. Pei and A. Zhang. "A General Approach to Mining Quality Pattern-based Clusters from Gene Expression Data". In Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA'05), Beijing, China, April 18-20, 2005.
G. Dong, C. Jiang, J. Pei, J. Li and L. Wong. "Mining Succinct Systems of Minimal Generators of Formal Concepts". In Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA'05), Beijing, China, April 18-20, 2005.
J. Pei, D. Jiang and A. Zhang. "Mining Cross-graph Quasi-cliques in Gene Expression and Protein Interaction Data" (research poster paper). In Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, April 5-8, 2005.
C. Wang, W. Wang, J. Pei, Y. Zhu and B. Shi. "Scalable Mining of Large Disk-based Graph Databases" (research full paper). In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, USA, August 22 - 25, 2004.
D. Jiang, J. Pei, M. Ramanathan, C. Tang and A. Zhang. "Mining Coherent Gene Clusters from Gene-Sample-Time Microarray Data" (industrial full paper, Runner-up for the best application paper award). In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, USA, August 22-25, 2004.
L. Deng, J. Pei, J. Ma and D.L. Lee. "A Rank Sum Test Method for Informative Gene Discovery" (industrial full paper). In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, USA, August 22-25, 2004.
D. Jiang, J. Pei and A. Zhang. "GPX: Interactive Mining of Gene Expression Data" (demo paper). In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04), Toronto, ON, Canada, August 30-September 3, 2004.
H. Wang, F. Chu, W. Fan, P.S. Yu and J. Pei. "A Fast Algorithm for Subspace Clustering by Pattern Similarity" (full paper). In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), Santorini Island, Greece, 21-23 June 2004.
C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang and B. Shi. "Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining" (full paper). In Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia, May 26-28, 2004.
J. Pei, X. Zhang, M. Cho, H. Wang and P.S. Yu. "MaPle: A Fast Algorithm for Maximal Pattern-based Clustering" (Regular paper). In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, USA, November 19-22, 2003.
D. Jiang, J. Pei and A. Zhang. "Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data". In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington, DC, USA, August 24-27, 2003.
C. Tang, A. Zhang, and J. Pei. "Mining Phenotypes and Informative Genes from Gene Expression Data". In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington, DC, USA, August 24-27, 2003.
J. Wang, J. Han, and J. Pei. "CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets". In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), Washington, DC, USA, August 24-27, 2003.
L.V.S. Lakshmanan, J. Pei, and Y. Zhao. "Efficacious Data Cube Exploration by Semantic Summarization and Compression" (demo paper). In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), Berlin, Germany, September 9-12, 2003.
L.V.S. Lakshmanan, J. Pei, and Y. Zhao. "QC-Trees: An Efficient Summary Structure for Semantic OLAP". In Proceedings of the 2003 ACM-SIGMOD International Conference on Management of Data (SIGMOD'03), San Diego, CA, June 9-12, 2003.
L.V.S. Lakshmanan, J. Pei, and Y. Zhao. "SOCQET: Semantic OLAP with Compressed Cube and Summarization" (demo paper). In Proceedings of the 2003 ACM-SIGMOD International Conference on Management of Data (SIGMOD'03), San Diego, CA, June 9-12, 2003.
Jian Pei. "A General Model for Online Analytical Processing of Complex Data". In Proceedings of the 22nd International Conference on Conceptual Modeling (ER'03), Chicago, IL, October 13-16, 2003.
The PI and co-PI have worked on the following issues:
Many data analysis tasks need data warehouses as the essential infrastructures. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managemental decision making processes [Inmon, 1996]. Conceptually, the core of a data warehouse can be implemented as a data cube, which is a multi-level, multi-dimensional database with various aggregate data.
A data warehouse is often huge in practice. There are some inherent problems that the current techniques cannot handle well.
In summary, the key problem here is that the conventional data warehousing and OLAP techniques have not involved effective semantic summarization in data cubes. Without a proper semantic summarization, users may not be able to understand and use the information from a huge data cube. This motivates our proposal of a systematic study on a semantic summarization approach for data warehousing and OLAP.
Data warehousing, online analytical processing, data mining, query optimization