Top Critiques for Week 3: Hive and HadoopDB

(in no particular order)

Hive is an extension of the Hadoop framework, allowing developers to define their map-reduce jobs in terms of SQL queries. Hive integrates the SQL semantics into Hadoop as follows: data-tables are stored (and partitioned) as plain HDFS files, SQL queries are compiled into a MapReduce plan (map: selection, projection; reduce: join, aggregation), results are stored as HDFS files. The Hive system provides several benefits:

There are some drawbacks as well:

(credit: Anonymous)


Abstract

HadoopDB is a new data analysis platform that incorporates two different approaches currently available, parallel DBMS and MapReduce. MapReduce is scalable, flexible, and low-cost. But parallel DBMS shows better performance and efficiency. The authors presented HadoopDB design as running MapReduce over relational DBMS instances, specifically PostgreSQL. HadoopDB extends Hive's HiveQL to MapReduce transformer, by pushing most of the query processing into each database instance (as certain query processing, for example, join, is more efficient in DBMS).

Analysis

The main idea behind HadoopDB is very simple and straightforward, considering the debate between parallel database research community and MapReduce community at the time.

This paper showed the feasibility of a hybrid system that takes the good features from both technologies, parallel DB and MapReduce. Even though HadoopDB's performance wasn't comparable to Vertica in many cases, there's still room for improvement as Hive and Hadoop are improving and PostgreSQL is not column-store DBMS.

Regarding fault tolerance, would it be possible to implement parallel DBMS to recover from a node failure without restarting the entire query processing?

As Vertica (a parallel DBMS) shows a better performance than HadoopDB, I am wondering if it is inherently impossible for parallel DBMS to scale. One of the advantages using Hadoop (MapReduce) is scalability and flexibility (handling unstructured data). But if we are dealing with structured data like HadoopDB, I think MapReduce loses one of its edges because the MapReduce style flexibility is not required for structured data.

(credit: Anonymous)


Abstract

In this paper, authors present an open source datawarehousing solution calle Hive. Hive runs on Hadoop and makes the processing of large data sets on Hadoop a lot easier. Before Hive was introduced, such large data sets were processed only by MapReduce programming model, which is kind of very low level. Hive offers a data warehousing solution for this model with the halpe of metadata. This model is developed by FAcebook as their data processing became very complicated on Hadoop infrastructure. They introduce a new query language called HiveQL which is similar to SQL query language with limited options. The paper explains about the Data Model and Query Language with the help of few HiveQL query examples. We can also include the MapReduce scripts as part of the query. This section is followed by the explanation of how data is stored and retreived using SerDe(Serilization and Desrialization) functions followed by the presentation od System architecture. Final section explains how Hive is used in Facebook.

Criticism

  1. This paper does not present enough experimantal results or any comparison analysis to show how well Hive performs when compared to other existing similar solutions like Pig and scope. Comparison would have been intersting to see.
  2. Overall the paper is good and it tries to reduce the complexity of jobs executing over Hadoop and handling Large data sets.

(credit: Sakthi Sundar Alagusundaram Ganesan)