Data-Intensive Computing Research and Education
Project partially funded by National Science Foundation Grant NSF-DUE-CCLI-0920335



Data-intensive computing has been receiving much attention as a collective solution to address the data deluge that has been brought about by tremendous advances in distributed systems and Internet-based computing. An innovative programming models such as MapReduce and a peta-scale distributed file system to support it have revolutionized and fundamentally changed approaches to large scale data storage and processing.  These data-intensive computing approaches are expected to have profound impact on any application domain that deals with large scale data, from healthcare delivery to military intelligence. A new forum called big-data computing group has been formed by a consortium of industrial stakeholders and agencies including NSF and CRA (Computing Research Associates) to promote wider dissemination of the big-data solutions and transform mainstream applications. Given the omnipresent nature of large scale data and the tremendous impact they have on a wide variety of application domains, it is imperative to ready our workforce to face the challenges in this area. This project aims to improve the big-data preparedness of diverse STEM audience by defining a comprehensive framework for education and research in data-intensive computing.
Dr. Bina Ramamurthy is the principle investigator of this project, and Dr. John Benschoten and Dr. Vipin Chaudhary are the co-PIs.


A new certificate program in data-intensive computing has been approved by SUNY and is offered by the University.
The details are the undergraduate catalog page here.
Invited Presentations
           
  International Conference on Advances and Emerging Trends in Computing Technologies ICAET 2010, Chennai, India.
                6/23/2010: Women-in-Computing Conference: Data-intensive Computing
                     6/24/2010: Cloud: The next generation computer
            CUBRC and General Dynamics
                     11/4/2010: Cloud Computing: Capabilities and Limitations
            Monroe Community Library System (Rochester, NY)
                      12/2/2010: Cloud Computing for Small Enterprises
            WIPRO Chennai, India
                       6/28/2011: Cloud Computing: Concepts, Technologies, and Business Implications
            Erie Community College
                      12/5/2011: Computer Science Education Week: Data-intensive Computing on the Cloud: Concepts, Technology, and Applications
            Niagara Community College
                       2/7/2012: Abstract: Data-intensive Computing: Concepts, Technologies, and Applications
            Metropolitan State University Symposium on Big Data Science and Engineering
                        10/19/2012: Adopting Big Data Across Undergraduate Curriculum (CCR run)
            
Grants related to BigData
           

            Participant in a grant awarded to CUBRC, Buffalo, NY (1 summer month + 1 graduate student)
            Participant in a grant awarded to Industrial and Systems Engineering (1 academic month + 1 graduate student support)
            Participant in a NY state-level (suny.edu) Innovative Instructional Technology Grants (2) (nominal stipend)

Undergraduate Researchers:

Bich Vu : Working on Security Concepts
Andrew Small (Ingram Micro)
Xiang Lin: CSTEP student: Exploring Guttenberg: Text Analysis using MapReduce
A
ustin Miller (Honors Thesis) A Methodology for Transforming Common Algorithms to MapReduce Framework (Google)
Regina May (CSTEP researcher): Disaster Recovery on the cloud (M&T Bank)
Mohit Bansal (Honors Thesis): Extracting Information from Large-scale Data using Probabilistic Methods (J.P. Morgan) (Research presented at Conference on Academic Excellence, April 2010, Buffalo, NY.


Graduate Researchers:

Rikson Vareed: MasterSolver: Paralell Processing of Large Scale Tree Graphs: deals with billion node graphs-tree processing
Ying Yang: Analysis of Twitter data (ongoing)
Suhani Gupta: Evaluation of Hbase for large scale data (ongoing)
Eric Nagler: Lucene Indexing Wikipedia Data using MapReduce (CUBRC)
Abhishek Agarwal :  MOPS: A Modified Priority Scheduler for Improved Resource Utilization, Cluster 2010.

Hingsik Kim: Pop!World: A Evolutionary Biology Tool (deployed on the Google App Engine)
Amol Agarwal: Hosting Applications on the Cloud

Laboratory Facility

Excellent facility for running Hadoop MapReduce jobs on real cluster of hundreds of nodes is available at
the Center for Computational Research at Buffalo. See CCR Buffalo.
A five-node hadoop-based system NEXOS was built using old commodity machines as a student research project. (This prototye was used in a demo at CCSCNE 2009, romte access to Buffalo from Plattsburgh, NY)
We use Amazon EC2 MapReduce workflow and also Google App Engine for the projects.
Students and users can also install and use a standalone 2-node cluster on their laptop or desktop.

Getting started:
          
Hadoop Distributed File System (HDFS) and MapReduce
           References and useful links

Contact: bina@buffalo.edu for more information.