CSE 601: Data Mining and Bioinformatics
Fall 2013
Basic Information
Overview
This course focuses on the fundamental techniques in data mining, including data warehousing, frequent pattern mining, clustering, classification, anomaly detection and feature selection methods. To demonstrate how data mining techniques are applied to various domains, we focus on the software systems design of bioinformatics, discussing the applications of data warehousing and data mining in biological and biomedical related fields. The class will discuss various software systems and provide insight that will help students gain a comprehensive understanding of the bioinformatics field. Projects will be designed based on these applications.
Textbooks
- Data Mining: Concepts and Techniques, 3rd ed. Jiawei Han and Micheline Kamber, ISBN-13: 978-1-55860-901-3, Morgan Kaufmann Publishers.
- Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley.
- Data Warehousing. Paulraj Ponniah. John Wiley & Sons, Inc.
References
- Bioinformatics: Managing Scientific Data. Zoe Lacroix and Terence Critchlow. 2003. Morgan Kaufmann Publishers.
- Advanced Analysis of Gene Expression Microarray Data. Aidong Zhang. ISBN 981-256-645-7. World Scientific Publishing Co.
Grading Policy
Grades will be computed based on the following factors (subject to changes):
- Class participation - 5%
- Quizzes -- 25%
- Projects (3) -- 45%
- Homework (3-4) -- 25%
Course Topics and Schedule
The lecture slides were developed based on materials from several sources. Please see copyright notice.
Date |
Topic |
Assignment |
Readings |
August 27 |
Introduction |
N/A |
N/A |
August 29 |
Data Warehouse |
N/A |
Chapters 4&5 (Han), Ponniah |
September 3 |
Data Warehouse |
N/A |
Chapters 4&5 (Han), Ponniah |
September 10 |
Homework 1 Presentation |
Homework 1 due |
N/A |
September 12 |
Data Warehouse |
N/A |
Chapters 4&5 (Han), Ponniah |
September 17 |
Data Warehouse |
N/A |
Chapters 4&5 (Han), Ponniah |
September 19 |
Association Rule |
N/A |
Chapters 6&7 (Han), Chapters 6&7 (Tan) |
September 24 |
Clustering Basics |
N/A |
Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10] |
September 26 |
Clustering Basics Partitional Clustering |
N/A |
Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10] |
October 1 |
Partitional Clustering Hierarchical Clustering |
N/A |
Chapters 8&9 (Tan), Chapters 10&11 (Han) |
October 3 |
Hierarchical Clustering Density-based Clustering |
Homework 2 Due |
Chapters 8&9 (Tan), Chapters 10&11 (Han) |
October 8 |
Mixture Model Spectral Clustering |
Project 2 Out |
[DoBa08] [Luxburg07], [ShMa00] |
October 10 |
Project 1 Demo |
Project 1 due |
N/A |
October 15 |
Spectral Clustering MapReduce |
N/A |
[Luxburg07], [ShMa00] [Lin10], [SPY10] |
October 17 |
MapReduce |
N/A |
[Lin10], [SPY10] |
October 22 |
Principle Component Analysis |
N/A |
[Smith02] |
October 24 |
Clustering: Other Topics |
N/A |
Chapters 10&11 (Han) |
October 29 |
Project 2 Demo |
Project 2 Due |
N/A |
October 31 |
Network Mining |
N/A |
|
November 5 |
Network Mining |
Homework 3 out |
|
November 7 |
Classification: Basics |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) |
November 12 |
Classification: Basics Classification: Methods |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) |
November 14 |
Classification: Basics Classification: Methods |
N/A |
Chapters 4&5 (Tan), Chapters 8&9 (Han) [Mitchell05], [NgNotes] |
November 19 |
Classification: Methods |
Project 3 Out |
Chapters 4&5 (Tan), Chapters 8&9 (Han) [SeEl10] |
November 21 |
Classification: Methods Classification: Advanced Topics |
N/A |
[SeEl10] [GFJ10],[Zhu08], [PaYa10] |
November 26 |
Classification: Advanced Topics |
N/A |
[GFJ10],[Zhu08], [PaYa10] |
December 3 |
Anomaly Detection |
N/A |
Chapter 10 (Tan), Chapter 12 (Han) [CBK09] |
December 5 |
Anomaly Detection |
N/A |
[GGA+13] |
Homeworks & Projects
Homework 1: Schema Design for Data Warehouse: Due September 10.
Project 1: Data Warehouse/OLAP System: Due October 10.
Homework 2: Mining Association Rules from Gene Expression Data: Due October 3.
Homework 3: Clustering Analysis for Complex Networks: Due November 21 before class.
Supplementary Materials
[Jain10] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8): 651-666, 2010. [Paper]
[Luxburg07] Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 2007. [Paper]
[ShMa00] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 2000. [Paper]
[DoBa08] Chuong B. Do and Serafim Batzoglou. What is the expectation maximization algorithm? Nature Biotechnology 26(8): 897-899, 2008. [Paper]
[Lin10] Jimmy Lin. Data-Intensive Information Processing Applications. University of Maryland, 2010. [Link]
[Mitchell05] Tom Mitchell. Machine Learning (sample chapter on Naive Bayes and Logistic Regression), 2005. [Chapter]
[NgNotes] Andrew Ng. CS 229 Lecture Notes on Support Vector Machines. [Notes]
[SeEl10] Giovanni Seni and John F. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, 2010.
[SPY10] Jimeng Sun, Spiros Papadimitriou and Rong Yan. Large-scale data mining: MapReduce and beyond. Tutorial on ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2010. [Video]
[Smith02] Lindsay I Smith. A Tutorial on Principle Component Analysis, 2002. [Paper]
[GFJ10] Jing Gao, Wei Fan, and Jiawei Han. On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled. Tutorial on SIAM Data Mining Conference (SDM), Columbus, OH, 2010. [Link]
[Zhu08] Xiaojin Zhu. Semi-supervised Learning Literature Survey. University of Wisconsin Madison, 2008. [Link]
[PaYa10] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 345-1359, 2010. [Link]
[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection : A Survey. ACM Computing Surveys, Vol. 41(3), Article 15, 2009. [Link]
[GGA+13] Manish Gupta, Jing Gao, Charu Aggarwal, and Jiawei Han. Outlier Detection for Temporal Data. Tutorial on ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, 2013. [Link]