CSE 722 Selected Topics in Data Mining

CSE 722: Selected Topics in Data Mining

Spring 2013

Basic Information

Instructor: Jing Gao (jing@buffalo.edu)
Time: 1-3pm Wednesday
Location: 113A Davis Hall
Office Hour: 1:30-3:30pm Tuesday
Office: 350 Davis Hall

Overview

Data Mining is the process of discovering new and insightful knowledge from large bodies of data. The amount of data in our world has been exploding, and nearly every industry is desperate to infer actionable knowledge from data. There are great opportunities as well as numerous research challenges for data mining in social media analysis, medical domains, computer security and many other fields. This seminar will provide an overview of the state-of-the-art data mining techniques that arise in real applications. We will cover advanced techniques and algorithms for data mining as well as emerging data mining applications. This course will be highly beneficial to students whose research interests are in data mining, machine learning, bioinformatics, databases, information retrieval, artificial intelligence, and also to those who may need to apply data mining to any application.

Prerequisites

Have taken at least one course in data mining, machine learning, pattern recognition, information retrieval or other data analytics related field.

Course Structure

We discuss two papers each week. The papers will be selected from recent publications/surveys from top conferences/journals in data mining, machine learning or other relevant domains. Grading is S/U and each student is required to present one or two papers in class. Students who registered 3 credits are required to complete a research project in data mining (As stated in Prerequisites, you MUST have taken at least one course in data mining related areas.)

Course Topics and Schedule

In this semester, we will focus on the topics relevant to mining big data. Three features define big data: 1) Volume (i.e., unprecedentedly large data sets), 2) Velocity (i.e., evolving and streaming data), and 3) Variety (i.e., data from multiple sources). The topics we will discuss include integration of multi-source data, evolutionary pattern discovery from streaming data and parallel processing algorithm for large-scale data mining.

Date	Papers	Presenters
Week 1 (January 16)	Introduction	Jing Gao
Week 2 (January 23)	Truth Discovery with Multiple Conflicting Information Providers on the Web A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration	Yaliang Li Qi Li
Week 3 (January 30)	Making Better Informed Trust Decisions with Generalized Fact-Finding Content-driven Trust Propagation Framework	Feng Shen Wei Chen
Week 4 (February 6)	Integrating Conflicting Data: The Role of Source Dependence Truth Finding on the Deep Web: Is the Problem Solved?	Feng Chen Vineet Aguiar
Week 5 (February 13)	The Multidimensional Wisdom of Crowds Learning from the Wisdom of Crowds by Minimax Entropy	Wei Chen Kemal Guner
Week 6 (February 20)	Learning from Crowds in the Presence of Schools of Thought Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers	Qi Li Jingyu Ji
Week 7 (February 27)	Some objects are more equal than others: measuring and predicting importance Estimating Entity Importance via Counting Set Covers	Vineet Aguiar Kang Li
Week 8 (March 6)	Efficient Elastic Burst Detection in Data Streams Review Spam Detection via Temporal Pattern Discovery	Thomas Fusillo Rohit Pandey
March 13	Spring Break No Class	N/A
Week 9 (March 20)	Evolutionary Clustering Analyzing Communities and Their Evolutions in Dynamic Social Networks	Jingyu Ji Kang Li
Week 10 (March 27)	Streaming Graph Partitioning for Large Distributed Graphs On Evolutionary Spectral Clustering	Thomas Fusillo David Johnson
Week 11 (April 3)	Towards Evolutionary Nonnegative Matrix Factorization A Sparsity-Inducing Formulation for Evolutionary Co-Clustering	Liang Ge Nan Du
Week 12 (April 10)	Map-Reduce for Machine Learning on Multicore PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce	Rohit Pandey Hu Ding
Week 13 (April 17)	Fully Distributed EM for Very Large Datasets Parallel Spectral Clustering in Distributed Systems	Yaliang Li David Johnson
Week 14 (April 24)	Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations	Hu Ding Kemal Guner