CSE 722: Selected Topics in Data Mining

Spring 2013

Basic Information
Overview

Data Mining is the process of discovering new and insightful knowledge from large bodies of data. The amount of data in our world has been exploding, and nearly every industry is desperate to infer actionable knowledge from data. There are great opportunities as well as numerous research challenges for data mining in social media analysis, medical domains, computer security and many other fields. This seminar will provide an overview of the state-of-the-art data mining techniques that arise in real applications. We will cover advanced techniques and algorithms for data mining as well as emerging data mining applications. This course will be highly beneficial to students whose research interests are in data mining, machine learning, bioinformatics, databases, information retrieval, artificial intelligence, and also to those who may need to apply data mining to any application.

Prerequisites

Have taken at least one course in data mining, machine learning, pattern recognition, information retrieval or other data analytics related field.

Course Structure

We discuss two papers each week. The papers will be selected from recent publications/surveys from top conferences/journals in data mining, machine learning or other relevant domains. Grading is S/U and each student is required to present one or two papers in class. Students who registered 3 credits are required to complete a research project in data mining (As stated in Prerequisites, you MUST have taken at least one course in data mining related areas.)

Course Topics and Schedule

In this semester, we will focus on the topics relevant to mining big data. Three features define big data: 1) Volume (i.e., unprecedentedly large data sets), 2) Velocity (i.e., evolving and streaming data), and 3) Variety (i.e., data from multiple sources). The topics we will discuss include integration of multi-source data, evolutionary pattern discovery from streaming data and parallel processing algorithm for large-scale data mining.

Date Papers Presenters
Week 1 (January 16) Introduction Jing Gao
Week 2 (January 23) Truth Discovery with Multiple Conflicting Information Providers on the Web
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration
Yaliang Li
Qi Li
Week 3 (January 30) Making Better Informed Trust Decisions with Generalized Fact-Finding
Content-driven Trust Propagation Framework
Feng Shen
Wei Chen
Week 4 (February 6) Integrating Conflicting Data: The Role of Source Dependence
Truth Finding on the Deep Web: Is the Problem Solved?
Feng Chen
Vineet Aguiar
Week 5 (February 13) The Multidimensional Wisdom of Crowds
Learning from the Wisdom of Crowds by Minimax Entropy
Wei Chen
Kemal Guner
Week 6 (February 20) Learning from Crowds in the Presence of Schools of Thought
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers
Qi Li
Jingyu Ji
Week 7 (February 27) Some objects are more equal than others: measuring and predicting importance
Estimating Entity Importance via Counting Set Covers
Vineet Aguiar
Kang Li
Week 8 (March 6) Efficient Elastic Burst Detection in Data Streams
Review Spam Detection via Temporal Pattern Discovery
Thomas Fusillo
Rohit Pandey
March 13 Spring Break
No Class
N/A
Week 9 (March 20) Evolutionary Clustering
Analyzing Communities and Their Evolutions in Dynamic Social Networks
Jingyu Ji
Kang Li
Week 10 (March 27) Streaming Graph Partitioning for Large Distributed Graphs
On Evolutionary Spectral Clustering
Thomas Fusillo
David Johnson
Week 11 (April 3) Towards Evolutionary Nonnegative Matrix Factorization
A Sparsity-Inducing Formulation for Evolutionary Co-Clustering
Liang Ge
Nan Du
Week 12 (April 10) Map-Reduce for Machine Learning on Multicore
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Rohit Pandey
Hu Ding
Week 13 (April 17) Fully Distributed EM for Very Large Datasets
Parallel Spectral Clustering in Distributed Systems
Yaliang Li
David Johnson
Week 14 (April 24) Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations
Hu Ding
Kemal Guner