Data-Intensive Computing Research and Education
Project partially funded by National Science Foundation Grant NSF-DUE-CCLI-0920335



Data-intensive computing has been receiving much attention as a collective solution to address the data deluge that has been brought about by tremendous advances in distributed systems and Internet-based computing. An innovative programming models such as MapReduce and a peta-scale distributed file system to support it have revolutionized and fundamentally changed approaches to large scale data storage and processing.  These data-intensive computing approaches are expected to have profound impact on any application domain that deals with large scale data, from healthcare delivery to military intelligence. Given the omnipresent nature of large scale data and the tremendous impact they have on a wide variety of application domains, it is imperative to ready our workforce to face the challenges in this area. This project aims to improve the big-data preparedness of diverse STEM audience by defining a comprehensive framework for education and research in data-intensive computing.
Dr. Bina Ramamurthy is the director and principal investigator of this project.


CSE487/587 Data-Intensive Computing is a new course that has been designed to address big-data preparedness of our workforce.


CSE487/587 Course Description



Date
Topics
Lecture material
Demos/reading material
1/15
Data-intensive computing
What is it?

1/17
Two case studies: one from Fourth Paradigm text; another from Bioinformatics
DI:Climate
DI: BioInf

1/21
Introduction to MapReduce framework and Hadoop Ecosystem
MR Parallel Processing

1/24
Attend Distinguished Speaker's talk
John H. Reppy, The University of Chicago, Diderot: A Parallel Domain-Specific Language for Image Analysis And Visualization 3.30-4.30 PM
101 Davis
1/29
Hadoop infrastructure
HDFS Getting stared with Hadoop Yahoo hadoop
T. White's MR
Prelim Prj1
1/31
Introduction to CCR Hadoop Cluster
Hands-on tutorial; please bring your laptops
CCRIntro
HadoopHowTo
2/5
Inside MapReduce: programming in MapReduce
Jean & Ghemawat's paper;
Lin and Dyer (LD) Ch.1 and Ch.2
Ubuntu, Helios, MR Perspective
2/7
MapReduce Algorithm design
Ch.3 (LD)
Complete data aggregation by Tue 2/12
2/13
Continue with Ch.3: Best practices for MR algorithm
Design patterns for MR:
Pairs and stripes
Demo of co-occurrence using pairs and stripes
2/15
MR best practices continued


2/18
Exam review; Lets apply the MR best practices to Text Retrieval
Exam review
Inverted Index Ch4 LD
See yahoo tutorial chapter 4
2/21 MR.Graph algorithms GraphAlg Prj1Addendum
2/28
Prj2 discussion; continue MR.Graph and MR.RevIndex
InvertedIndexMore
Prj2
3/5
Midterm exam


3/7
MR.Graphs (continued); MR.Classificiation
MR.Classification

3/9-16
Spring Break


3/19
Graph processing; PageRank
Graph.PageRank (updated with pagerank details)

3/21
Introduction to Amazon AWS; esp. security model
AWS

3/26
Hive
MR.Hive

3/28
Fundamentals of Security
Security

3/28
Declarative script for MR: Pig Pig

4/2Pig continuedDeveloping Pig Script. Demo on aemr.
4/4Design of PigDP.MR
4/9
NoSQL Database: HBase
Pig;Some old material  HbaseIntro

4/11Hbase (contd.)HBase
4/16Hbase (contd.)Web services HBase on EMR
4/18KTAC; review for exam
4/23KTAC; review for exam
4/25
Exam 2