jump to navigation

Big Data in Action for those familiar with relational databases and SQL May 12, 2015

Posted by Mich Talebzadeh in Big Data.

I thought it is best to start giving a practical introduction to Big Data Hadoop with tools that  SQL developers, RDBMS DBAs (Oracle, Sybase, Microsoft) are familiar with. As I have already explained here Hadoop has two main components; namely 1) MapReduce Engine and 2) Hadoop Distributed File System (HDFS). Now this does not mean that you can run a SQL against Hadoop without using some kind of API that allows SQL to be understood by Hadoop. OK. So if I know how to create a table in RDBMS. how to interrogate data in that table will I be able to do the same without using any language except my dear SQL? Well the answer is yes, by and large.

Introducing Apache Hive, your gateway to Hadoop via SQL

A while back a tool was developed called Hive which allowed SQL developers to access data in Hadoop. The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides:

-Tools to enable easy data extract/transform/load (ETL)

- A mechanism to impose structure on a variety of data formats

- Access to files stored either directly in Apache HDFS or in other data storage systems
  such as Apache HBase

- Query execution via Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks.

Hive has gone a long way since its incarnation. It supports a SQL dialect close to ANSI SQL 92 called Hive Query Language  (hql). You can create a table in Hive much like any table in Sybase or Oracle. You can partition a table in Hive, similar to range partitioning in RDBMS. You can create hash partition using a feature called clustering into n number of buckets and so forth. As of now (Hive 0.14) indexes are not supported in Hive. So we don’t have indexes and there is an optimizer in Hive but don’t confuse it with the traditional Oracle or Sybase optimizer. So if we do not have indexes and everything is going to end up as a table scan what use is that? That is a valid question. However, remember we are talking about Big Data and data warehouse. Yes a database with tables with millions of rows. This is not your usual OLTP database with point queries. You don’t go around running a query to select a handful of rows! Hive is designed around the assumption that you will be doing scans of significant amounts of data, as are most data warehousing type solutions.  Simply put it does not have the right tools to handle efficient lookup of single rows or small ranges of rows.  This means that Hive is not a suitable candidate for transactional or front office applications. Think about it. A typical RDBMS has 8 K page or block size. On the assumption that each block in Hadoop is 128 MB (default) a 1 GB data will take around 8 blocks in Hadoop. Our say 10 GB table will fit just in 80 blocks in Hadoop. So which one is more efficient for transactional work? Taking few rows out of an Oracle table or the same number of rows from Hadoop with Hive? Doing OLTP work in Hive is like cracking a nut with a sledgehammer. Hive is great for massive transformations needed in ETL type processing and full data set analytics. Hive is improving in terms of concurrency and latency however it has a fair bit to go  to beat commercial Massive Parallel Processing (MPP) solutions in terms of performance and stability. The key advantages of Hive are storage economics and its flexibility (schema on read). To be continued ..


No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: