CSE 722 Selected Topics in Data Mining

CSE 722: Selected Topics in Data Mining

Spring 2012

Basic Information

Instructor: Jing Gao (jing@buffalo.edu)
Time: Monday 4-6pm
Location: 113A Davis Hall
Office Hour: Thursday 10am-12pm
Office: 350 Davis Hall

Overview

Data Mining is the process of discovering new and insightful knowledge from large bodies of data. The amount of data in our world has been exploding, and nearly every industry is desperate to infer actionable knowledge from data. There are great opportunities as well as numerous research challenges for data mining in social media analysis, medical domains, computer security and many other fields. This seminar will provide an overview of the state-of-the-art data mining techniques that arise in real applications. We will cover advanced techniques and algorithms for data mining as well as emerging data mining applications. This course will be highly beneficial to students whose research interests are in data mining, machine learning, bioinformatics, databases, information retrieval, artificial intelligence, and also to those who may need to apply data mining to any application.

Announcement

February 6: The office hour on February 9 is moved to February 14 10am-12pm.

February 6: Class location is temporarily changed to 338A on February 20 and March 5.

February 6: Suggested project topics are uploaded. Project proposal is due Februray 20.

January 30: Course schedule is updated with presenter and reviewer names.

January 24: Link to register paper to present and to review has been announced through emails. If you didn't receive the email, contact the instructor.

Course Credits and Prerequisties

1 credit (default): Require fundamental knowledge about data mining, or machine learning, or statistical analysis.

3 credits (based on instructor approval): You must have taken courses and have research experience in data mining and machine learning.

Grading Policy and Course Structure

We discuss two papers each week. Grading is S/U, and student performance will be evaluated from the following perspectives:

Reading: Read the two papers listed on class schedule before each class.

Presentation: Each student should present at least one paper in class (1 credit: 1 paper to present, 3 credits: 2 papers to present). The presentation of each paper is approximately 30 minutes.

Reviews: Each student should be a reviewer for two papers selected from the presented papers excluding the paper he/she presents. The reviewers should prepare for questions regarding the paper, take notes in class, and write a half-page review for the paper based on notes and class discussion.

In-class discussion: Students are expected to attend every class, actively participate in discussions and make comments. Discussion on each paper takes approximately 10 minutes.

Course project (only for 3 credits): Students taking three credits need to complete a course project related to data mining research. A project proposal is due on Februray 20. Before that, you should come to my office during office hours or make an appointment with me to discuss the topic of your project. Project presentation will be on April 30 and final report is due on May 7.

Course Topics and Schedule

Instructions on registering for the papers to present and to review have been sent out.

Date	Topic	Papers	Presenters	Reviewers
Week 1 (January 23)	Introduction and Overview	Slides	Jing	N/A
Week 2 (January 30)	Semi-supervised Clustering	[Basu et al. 02] [Bilenko et al. 04]	Gang Jing	N/A Gang
Week 3 (Feburary 6)	Clustering Ensemble	[Strehl and Ghosh 03] [Fern and Brodley 04]	Jing Nan	Gang Vishrawas
Week 4 (Feburary 13)	Semi-supervised Learning	[Zhou et al. 03] [Zhu et al. 03]	Hui Ying	Nan,Ran Shubhra,Suchismit
Week 5 (Feburary 20)	Transfer Learning	[Zadrozny 04] [Zhuang et al. 10]	Shubhra Xiaoyi	Sagar,Suchismit Hui,Ran
Week 6 (Feburary 27)	Multi-view Learning	[Blum and Mitchell 98] [Zhou et al. 07]	Xiaoyi Yiming	Nan,Zhiliang Shubhra,Ying
Week 7 (March 5)	Relational Learning	[Sen et al. 08] [Backstrom and Leskovec 11]	Suxin Liang	Juehui,Zhiliang Yi,Suxin
March 12	Spring Break No Class	N/A	N/A	N/A
Week 8 (March 19)	Pattern Summarization	[Yan et al. 05] [Wang and Parthasarathy 06]	Vishrawas Hui	Zhiliang,Juehui Xiaoyi,Suxin
Week 9 (March 26)	Discriminative Pattern Mining	[Cheng et al. 07] [Fang et al. 11]	Juehui Zhiliang	Yiming Yiming,Juehui
Week 10 (April 2)	Massive Data Analytics	[Chu et al. 06] [Cordeiro et al. 11]	Suxin Suchismit	Liang Yi,Xiaoyi
Week 11 (April 9)	Anomaly Detection	[Lakhina et al. 05] [Valko et al. 11]	Sagar Ran	Shubhra Gang
Week 12 (April 16)	Trustworthiness Analysis	[Yin et al. 07] [Galland et al. 10]	Vishrawas Nan	Ying Hui
Week 13 (April 23)	Spam Detection	[Jindal and Liu 08] [Lee et al. 10]	Yi Yi	Liang Vishrawas,Sagar
Week 14 (April 30)	Project Presentation	N/A	Students	N/A

Reading List

[Backstrom and Leskovec 11] Lars Backstrom and Jure Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks In Proc. of International Conference on Web Search and Web Data Mining (WSDM'11), 2011. [Paper]

[Basu et al. 02] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised Clustering by Seeding. In Proc. of International Conference on Machine Learning (ICML'02), 2002.[Paper]

[Bilenko et al. 04] Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney. Integrating Constraints and Metric Learning in Semi-Supervised Clustering. In Proc. of International Conference on Machine Learning (ICML'04), 2004. [Paper]

[Blum and Mitchell 98] Avrim Blum and Tom Mitchell. Combining Labeled and Unlabeled Data with Co-Training. In Proc. of Annual Conference on Computational Learning Theory (COLT'98), 1998. [Paper]

[Cheng et al. 07] Hong Cheng, Xifeng Yan, Jiawei Han and Chih-Wei Hsu. Discriminative Frequent Pattern Analysis for Effective Classification. In Proc. of IEEE International Conference on Data Engineering (ICDE'07), 2007. [Paper]

[Chu et al. 06] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, and Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. Advances in Neural Information Processing Systems (NIPS'06), 2006. [Paper]

[Cordeiro et al. 11] Robson L.F. Cordeiro, Caetano Traina Jr., Agma J.M. Traina, Julio López, U Kang, and Christos Faloutsos. Clustering Very Large Multi-dimensional Datasets with MapReduce. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11), 2011. [Paper]

[Fang et al. 11] Gang Fang, Wen Wang, Benjamin Oately, Brian Van Ness, Michael Steinbach, and Vipin Kumar. Characterizing Discriminative Patterns. Computing Research Repository (CoRR) abs/1102.4104, 2011. [Paper]

[Fern and Brodley 04] Xiaoli Zhang Fern and Carla E. Brodley. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning. In Proc. of International Conference on Machine Learning (ICML'04), 2004. [Paper]

[Galland et al. 10] Alban Galland, Serge Abiteboul, Amélie Marian, and Pierre Senellart. Corroborating Information from Disagreeing Views. In Proc. of International Conference on Web Search and Web Data Mining (WSDM'10), 2010. [Paper]

[Jindal and Liu 08] Nitin Jindal and Bing Liu. Opinion Spam and Analysis. In Proc. of International Conference on Web Search and Web Data Mining (WSDM'08), 2008. [Paper]

[Lakhina et al. 05] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining Anomalies Using Traffic Feature Distributions. In Proc. of ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM'05), 2005. [Paper]

[Lee et al. 10] Kyumin Lee, James Caverlee, and Steve Webb. Uncovering Social Spammers: Social Honeypots + Machine Learning. In Proc. of International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10), 2010. [Paper]

[Sen et al. 08] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective Classification in Network Data. AI Magazine, Vol. 29, No. 3, pages 93--106, 2008. [Paper]

[Strehl and Ghosh 03] Alexander Strehl and Joydeep Ghosh. Cluster Ensembles--A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, Vol. 3, pages 583--617, 2003. [Paper]

[Valko et al. 11] Michal Valko, Branislav Kveton, Hamed Valizadegan, Gregory F. Cooper, and Milos Hauskrecht. Conditional Anomaly Detection with Soft Harmonic Functions. In Proc. of IEEE International Conference on Data Mining (ICDM'11), 2011. [Paper]

[Wang and Parthasarathy 06] Chao Wang and Srinivasan Parthasarathy. Summarizing Itemset Patterns Using Probabilistic Models. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), 2006. [Paper]

[Yan et al. 05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarizing Itemset Patterns: A ProfileBased Approach. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), 2005. [Paper]

[Yin et al. 07] Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Truth Discovery with Multiple Conflicting Information Providers on the Web. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07), 2007. [Paper]

[Zadrozny 04] Bianca Zadrozny. Learning and Evaluating Classifiers under Sample Selection Bias. In Proc. of International Conference on Machine Learning (ICML'04), 2004. [Paper]

[Zhou et al. 03] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf. Learning with Local and Global Consistency. Advances in Neural Information Processing Systems (NIPS'03), 2003. [Paper]

[Zhou et al. 07] Zhi-Hua Zhou, De-Chuan Zhan, and Qiang Yang. Semi-Supervised Learning with Very Few Labeled Training Examples. In Proc. of National Conference on Artificial Intelligence (AAAI'07), 2007. [Paper]

[Zhu et al. 03] Xiaojin Zhu, John Lafferty, and Zoubin Ghahramani. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In Proc. of International Conference on Machine Learning (ICML'03), 2003.[Paper]

[Zhuang et al. 10] Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, and Zhongzhi Shi. Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization. In Proc. of SIAM International Conference on Data Mining (SDM'10), 2010. [Paper]