CSE 601: Data Mining and Bioinformatics
Fall 2012
Basic Information
Overview
This course focuses on the fundamental techniques in data mining, including data warehousing, frequent pattern mining, clustering, classification, anomaly detection and feature selection methods. To demonstrate how data mining techniques are applied to various domains, we focus on the software systems design of bioinformatics, discussing the applications of data warehousing and data mining in biological and biomedical related fields. The class will discuss various software systems and provide insight that will help students gain a comprehensive understanding of the bioinformatics field. Projects will be designed based on these applications.
Textbooks
- Data Mining: Concepts and Techniques, 3rd ed. Jiawei Han and Micheline Kamber, ISBN-13: 978-1-55860-901-3, Morgan Kaufmann Publishers.
- Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley.
- Data Warehousing. Paulraj Ponniah. John Wiley & Sons, Inc.
References
- Bioinformatics: Managing Scientific Data. Zoe Lacroix and Terence Critchlow. 2003. Morgan Kaufmann Publishers.
- Advanced Analysis of Gene Expression Microarray Data. Aidong Zhang. ISBN 981-256-645-7. World Scientific Publishing Co.
Grading Policy
Grades will be computed based on the following factors (subject to changes):
- Class participation and presentation - 15%
- Quizzes -- 5%
- Projects (3) -- 60%
- Homework (3-4) -- 20%
Course Topics and Schedule
The lecture slides were developed based on materials from several sources. Please see copyright notice.
Date |
Topic |
Assignment |
Readings |
August 28 |
Introduction and Overview |
N/A |
N/A |
August 30 |
Data Warehouse |
N/A |
Chapters 4&5 (Han), Ponniah |
September 4 |
Data Warehouse |
Homework 1 out |
Chapters 4&5 (Han), Ponniah |
September 6 |
Data Warehouse |
N/A |
Chapters 4&5 (Han), Ponniah |
September 11 |
Homework 1 Presentation |
Homework 1 due |
N/A |
September 13 |
Homework 1 Presentation |
N/A |
N/A |
September 18 |
Association Rule |
Project 1 out |
Chapters 6&7 (Han), Chapters 6&7 (Tan) |
September 20 |
Association Rule |
Homework 2 out |
Chapters 6&7 (Han), Chapters 6&7 (Tan) |
September 25 |
Clustering Basics |
N/A |
Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10] |
September 27 |
Clustering Basics Partitional Clustering |
N/A |
Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10] |
October 2 |
Hierarchical Clustering Density-based Clustering |
N/A |
Chapters 8&9 (Tan), Chapters 10&11 (Han) |
October 4 |
Mixture Model Spectral Clustering |
N/A |
[Luxburg07], [ShMa00] |
October 11 |
Project 1 Demo |
Project 1 Due |
N/A |
October 16 |
Spectral Clustering Clustering Ensemble |
Project 2 Out |
[Luxburg07], [ShMa00] [GFJ10], [GhAc11] |
October 18 |
MapReduce |
N/A |
[Lin10] [SPY10] |
October 23 |
Clustering: Other Topics |
N/A |
Chapters 10&11 (Han) |
October 25 |
Classification: Basics |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) |
October 30 |
Classification: Basics Classification: Methods |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) |
November 1 |
Classification: Methods |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) [Mitchell05], [NgNotes] |
November 6 |
Classification: Methods |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) [SeEl10] |
November 8 |
Classification: Advanced Topics |
N/A |
[GFJ10],[Zhu08], [PaYa10] |
November 13 |
Network Mining |
Project 2 due Project 3 out |
|
November 15 |
Network Mining |
Homework 3 out |
|
November 20 |
Anomaly Detection |
N/A |
Chapter 10 (Tan), Chapter 12 (Han) [CBK09] |
November 27 |
Anomaly Detection |
N/A |
Chapter 10 (Tan), Chapter 12 (Han) [CBK09] |
November 29 |
Data Mining Competitions |
N/A |
[KDDCUP] |
December 4 |
Netflix Competition |
N/A |
[Netflix] |
December 6 |
Research Overview |
Homework 3 due |
N/A |
Homeworks & Projects
Homework 1: Schema Design for Data Warehouse: Due September 11.
Project 1: Data Warehouse/OLAP System: Due October 11.
Homework 2: Mining Association Rules from Gene Expression Data: Due October 1.
Homework 3: Clustering Analysis for Complex Networks: Due December 6 before class.
Supplementary Materials
[Jain10] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8): 651-666, 2010. [Paper]
[Luxburg07] Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 2007. [Paper]
[ShMa00] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 2000. [Paper]
[GFJ10] Jing Gao, Wei Fan, and Jiawei Han. On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled. Tutorial on SIAM Data Mining Conference (SDM), Columbus, OH, 2010. [Link]
[GhAc11] Joydeep Ghosh and Ayan Acharya. Cluster Ensembles. WIREs Data Mining and Knowledge Discovery 1(4), 2011. [Paper]
[Lin10] Jimmy Lin. Data-Intensive Information Processing Applications. University of Maryland, 2010. [Link]
[Mitchell05] Tom Mitchell. Machine Learning (sample chapter on Naive Bayes and Logistic Regression), 2005. [Chapter]
[NgNotes] Andrew Ng. CS 229 Lecture Notes on Support Vector Machines. [Notes]
[PaYa10] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 345-1359, 2010. [Link]
[SeEl10] Giovanni Seni and John F. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, 2010.
[SPY10] Jimeng Sun, Spiros Papadimitriou and Rong Yan. Large-scale data mining: MapReduce and beyond. Tutorial on ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2010. [Video]
[Zhu08] Xiaojin Zhu. Semi-supervised Learning Literature Survey. University of Wisconsin Madison, 2008. [Link]
[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection : A Survey. ACM Computing Surveys, Vol. 41(3), Article 15, 2009. [Link]
[Netflix] Netflix Prize. [Link]