CSE 601: Data Mining and Bioinformatics

Fall 2012

Basic Information
Overview

This course focuses on the fundamental techniques in data mining, including data warehousing, frequent pattern mining, clustering, classification, anomaly detection and feature selection methods. To demonstrate how data mining techniques are applied to various domains, we focus on the software systems design of bioinformatics, discussing the applications of data warehousing and data mining in biological and biomedical related fields. The class will discuss various software systems and provide insight that will help students gain a comprehensive understanding of the bioinformatics field. Projects will be designed based on these applications.

Textbooks
  • Data Mining: Concepts and Techniques, 3rd ed. Jiawei Han and Micheline Kamber, ISBN-13: 978-1-55860-901-3, Morgan Kaufmann Publishers.

  • Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley.

  • Data Warehousing. Paulraj Ponniah. John Wiley & Sons, Inc.
References
  • Bioinformatics: Managing Scientific Data. Zoe Lacroix and Terence Critchlow. 2003. Morgan Kaufmann Publishers.

  • Advanced Analysis of Gene Expression Microarray Data. Aidong Zhang. ISBN 981-256-645-7. World Scientific Publishing Co.
Grading Policy

Grades will be computed based on the following factors (subject to changes):
  • Class participation and presentation - 15%

  • Quizzes -- 5%

  • Projects (3) -- 60%

  • Homework (3-4) -- 20%
Course Topics and Schedule

The lecture slides were developed based on materials from several sources. Please see copyright notice.

Date Topic Assignment Readings
August 28 Introduction and Overview N/A N/A
August 30 Data Warehouse N/A Chapters 4&5 (Han), Ponniah
September 4 Data Warehouse Homework 1 out Chapters 4&5 (Han), Ponniah
September 6 Data Warehouse N/A Chapters 4&5 (Han), Ponniah
September 11 Homework 1 Presentation Homework 1 due N/A
September 13 Homework 1 Presentation N/A N/A
September 18 Association Rule Project 1 out Chapters 6&7 (Han), Chapters 6&7 (Tan)
September 20 Association Rule Homework 2 out Chapters 6&7 (Han), Chapters 6&7 (Tan)
September 25 Clustering Basics N/A Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10]
September 27 Clustering Basics
Partitional Clustering
N/A Chapters 8&9 (Tan), Chapters 10&11 (Han), [Jain10]
October 2 Hierarchical Clustering
Density-based Clustering
N/A Chapters 8&9 (Tan), Chapters 10&11 (Han)
October 4 Mixture Model
Spectral Clustering
N/A [Luxburg07], [ShMa00]
October 11 Project 1 Demo Project 1 Due N/A
October 16 Spectral Clustering
Clustering Ensemble
Project 2 Out [Luxburg07], [ShMa00]
[GFJ10], [GhAc11]
October 18 MapReduce N/A [Lin10]
[SPY10]
October 23 Clustering: Other Topics N/A Chapters 10&11 (Han)
October 25 Classification: Basics N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
October 30 Classification: Basics
Classification: Methods
N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
November 1 Classification: Methods N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
[Mitchell05], [NgNotes]
November 6 Classification: Methods N/A Chapters 4&5 (Tan), Chapters 8&9 (Han)
[SeEl10]
November 8 Classification: Advanced Topics N/A [GFJ10],[Zhu08], [PaYa10]
November 13 Network Mining Project 2 due
Project 3 out
November 15 Network Mining Homework 3 out
November 20 Anomaly Detection N/A Chapter 10 (Tan), Chapter 12 (Han)
[CBK09]
November 27 Anomaly Detection N/A Chapter 10 (Tan), Chapter 12 (Han)
[CBK09]
November 29 Data Mining Competitions N/A [KDDCUP]
December 4 Netflix Competition N/A [Netflix]
December 6 Research Overview Homework 3 due N/A

Homeworks & Projects

Homework 1: Schema Design for Data Warehouse: Due September 11.
Project 1: Data Warehouse/OLAP System: Due October 11.
Homework 2: Mining Association Rules from Gene Expression Data: Due October 1.
Project 2: Clustering Algorithms: Code due November 12 and report due November 13 before class.
Homework 3: Clustering Analysis for Complex Networks: Due December 6 before class.
Project 3: Classification Algorithms: Code, result and report due December 13 11am. Supplementary Description

Supplementary Materials

[Jain10] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8): 651-666, 2010. [Paper]
[Luxburg07] Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 2007. [Paper]
[ShMa00] Jianbo Shi and Jitendra Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 2000. [Paper]
[GFJ10] Jing Gao, Wei Fan, and Jiawei Han. On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled. Tutorial on SIAM Data Mining Conference (SDM), Columbus, OH, 2010. [Link]
[GhAc11] Joydeep Ghosh and Ayan Acharya. Cluster Ensembles. WIREs Data Mining and Knowledge Discovery 1(4), 2011. [Paper]
[Lin10] Jimmy Lin. Data-Intensive Information Processing Applications. University of Maryland, 2010. [Link]
[Mitchell05] Tom Mitchell. Machine Learning (sample chapter on Naive Bayes and Logistic Regression), 2005. [Chapter]
[NgNotes] Andrew Ng. CS 229 Lecture Notes on Support Vector Machines. [Notes]
[PaYa10] Sinno Jialin Pan and Qiang Yang. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 345-1359, 2010. [Link]
[SeEl10] Giovanni Seni and John F. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, 2010.
[SPY10] Jimeng Sun, Spiros Papadimitriou and Rong Yan. Large-scale data mining: MapReduce and beyond. Tutorial on ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, 2010. [Video]
[Zhu08] Xiaojin Zhu. Semi-supervised Learning Literature Survey. University of Wisconsin Madison, 2008. [Link]
[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection : A Survey. ACM Computing Surveys, Vol. 41(3), Article 15, 2009. [Link]
[KDDCUP] ACM KDD Cup. [Link]
[Netflix] Netflix Prize. [Link]