DaveWentzel.com            All Things Data

February 2014

Querying NoSQL with Hive and Pig

This is my next post in my evaluation of NoSQL solutions.  I want to cover Hive and Pig today, which are popular querying tools.    First, it is important to understand that with NoSQL solutions you need to have specific, predefined ways of analyzing and accessing your data.  Ad hoc querying and "advanced" query expressions that you would find in SQL (UNION, INTERSECT, etc) are generally not a very high priority in the big data world and are not implemented very well in the current solutions.  We'll get there some day but for now the space is evolving so quickly and the maturity just isn't there yet.  Heck, even SQL Server only began supporting INTERSECT as of 2005.  The current query tools, although lacking at the "advanced" features are very good at what they are designed to do, which is manage large data sets.  

 ...is built on Hadoop... ...is built on Hadoop......get it?  And Pig rides on top of Hadoop too..

During my evaluation I spent most of my time working with Hive because I found the ramp-up time and tooling much better than Pig.  Hive is very SQL-like and rides on top of Hadoop and is therefore geared to abstracting away the complexities of querying large, distributed data sets.  Under the covers Hive uses HDFS/Hadoop/MapReduce, but again, abstracting away the technical implementation of data retrieval, just like SQL does.  

But, if you remember from my "MapReduce for the RDBMS Guy" post, MapReduce works as a job scheduling system, coordinating activities across the data nodes.  This has a substantial affect on query response times.  Therefore, Hive is not really meant for real-time ad hoc querying.  There's lots of latency.  Lots.  Even small queries on dev-sized systems can be ORDERS of magnitude slower than a similar query on a RDBMS (but you really shouldn't compare the two this way).  

To further exacerbate the perceived performance problem, there is no query caching in Hive.  Repeat queries are re-submitted to MapReduce.  

So, if the performance is so bad, why does everyone expound on the virtues of these solutions?  As data sets get bigger, the overhead of Hive is dwarfed by the scale-out efficiencies of Hadoop.  Think of this as the equivalent of table scans in SQL Server.  Generally we all hate table scans and instead try to find a way to do index seeks.  But eventually we all hit a query tuning moment when we realize that a table scan, sometimes, is really better than billions of seeks.  Remember that Hive is optimize for BigData and batch processing, so touching every row is optimal.  


The HiveQL syntax is exactly like ANSI SQL.  Here's an example:

HiveQL even supports joins and can generate EXPLAIN plans (query plans).  
hhmmm...seems like NoSQL isn't so not-SQL after all.  

Prunes Analysis vs Fletcher's Castoria

Did you ever have somebody ask you:  

  • At what fragmentation level should I REORG vs REBUILD?
  • How many indexes should I have on my table?  
  • How often should I update my statistics?
  • Should I change my oil every 3000 miles or 5000?
  • What is the correct federal funds rate to ensure full employment and low inflation?  

I call these "Prunes Questions".  And I call the process of arriving at a satisfactory answer to these questions "Prunes Analysis".  I named this process after the famous Fletcher's Castoria commercial from the early '60's.  

Just like the commercial says...The problem with prunes is...if I'm constipated, is 3 enough?  Is 6 too many?  

This question haunted housewives and moms for years until the advent of Fletcher's Castoria in the mid 1800's.  Generally, the best laxative at the time was a good dose of prunes.  But how many do you eat?  If you don't eat enough, you're still constipated.  If you eat too many you'll have diarrhea.  

I know, lovely thought.  Fletcher's was supposed to eliminate this enigma.  You just give your kid the dose on the bottle and forget about it.  But of course, even with Castoria you might need to take a second and third dose.  

Where am I going with this?  People, especially IT people, like every question to have a firm (pun intended?) answer.  Given a choice between some prunes and a proper dose of Castoria, most people will take the Castoria every time.  People think of Castoria as the "known" and the prunes as the "unknown".  Known is viewed as being better.  

But not every problem has one "known" answer.  Sometimes "it depends" is a valid answer.  The questions I posited at the start of this post are examples of "prunes" questions.  Every person may have a slightly different answer and who is say who is right and who is wrong?  The answers are often difficult to prove.  In every case the "it depends" is referring to all of the variables that cannot be held constant.  

But most people hate being told "it depends" as the answer to their query.  They view the tergiversator (ambiguous respondent) as lacking knowledge or being standoffish.  But that's not always the case.  Answering "it depends" can be especially perilous if you are a consultant.  Clients like to know that their consultants have all of the answers.  

So, unless I'm trying to be purposefully equivocating, my stock answer to these types of questions is, "Is 3 enough? Is 6 too many?".   This answer makes the questioner stop and think.  I'll then give my answer/opinion, given the knowns, without having to say "it depends."  It's usually good for a chuckle too, especially if the questioner is a bit older.  


HBase is another NoSQL solution I evaluated for a recent project to replace some SQL Servers.  HBase is a close cousin to Hadoop and Google's BigTable.  HBase is a column-oriented data storage system.  It is distributed.  Hadoop is not great at random reads and writes on HDFS.  HBase is a huge improvement here.  HDFS and MapReduce are great at batch processing operations over huge datasets, but what if you only need an individual record?  HBase is the tool to use.  

The point of these blog posts is to briefly allow the relational data architect to understand a cursory bit about various NoSQL offerings.  So how does HBase and other column-oriented databases compare to RDBMSs?  In HBase the table schemas closely map to the physical implementation, whereas this is not always the case for a relational logical/physical model.  So, where the RDBMS abstracts the storage and retrieval of data for you, the burden of this activity is the programmer's problem with a column-oriented system.  

But that's kinda the whole point of NoSQL, isn't it?  For certain problems an RDBMS may not be a good fit.  Two examples...read/write concurrency and huge dataset sizes.  What I've seen in these cases is that the data professional bends or breaks the logical modeling rules of the relational model to overcome the physical limitations of the RDBMS.  In other words, denormalizing.  HBase is one possible way to avoid doing that by substituting a different storage engine.  Namely we are swapping a row-oriented storage mechanism for a column-oriented one.  


Obnoxious SSIS Designer Layout Issue

I don't want to bloviate too much, but there are some annoyances with SSIS, most of them well documented on the blogosphere.  My biggest annoyance is that the mouse is required for EVERYTHING.  Want to do a task conditionally?  That would be an IF branch in any programming language, but because SSIS is so visual it requires lots of mouse clicks.  

MongoDB and CouchDB

This my next post in my evaluation of NoSQL solutions.  MongoDB is a document store that can persist arbitrary collections of data as long as it can be represented using a JSON-like object hierarchy.  Mongo comes from the word "humongous", clearly indicating its intended purpose is for BigData.

Parallel Data Warehouse as a NoSQL Alternative

Another post in my NoSQL series...this one on Microsoft's Parallel Data Warehouse...PDW for short.  This is an installed appliance delivered right to your door with everything completely setup for you.  

Buffer Pool Extensions in SQL 2014

I've been very busy lately but one of my tasks is to evaluate migrating to SQL 2014.  After poking around some of the new features I stumbled across Buffer Pool Extensions.  If your db server is memory-constrained or you have an IO-intensive workload, and you can afford a cheap SSD drive, you really should research this feature.  Essentially, you can extend your buffer pool onto nonvolatile storage without buying pricey RAM.  Technically you could use rotational media and not even use an SSD, but I can't imagine you would get much of a performance boost.  

How Does It Work?

When your working set is small the buffer pool is constantly performing random reads and writes whenever a CHECKPOINT is issued.  Each of these small IO operations is competing for your rotational media's attention.  Latency goes up, throughput goes down.  It's not uncommon on large systems to see CHECKPOINTs take a very long time and the Page Life Expectencies (PLEs) to drop radically.  

One of the beauties of SSDs is that random and sequential IO have nearly identical performance characteristics, just like volatile RAM.  Buffer Pool Extensions (we'll call them BPE for the rest of this post) essentially becomes a middle man between the RAM and the rotational media.  It would be nice if the small, random seeks of the dirtied buffer pool pages would go to the SSDs first, and from there write to our slower media without impacting the buffer pool.  Unfortunately that wouldn't be a very safe thing to do.  Instead, pages that are getting ready to be aged out of the buffer cache...we'll call them warm pages, will be moved to the BPE area instead...assuming the pages are "clean".  

Mechanics of Setting It Up

  • This can be enabled, disabled, and altered on-the-fly without service interruption.  Although you will see performance implications of doing these things obviously.  
  • You can query the current state of buffer pool extensions by using the DMV sys.dm_os_buffer_pool_extension_configuration.  
  • Ensure your SSD is formatted with a drive letter.  
  • MS's recommendation is for a 16:1 ratio of BPE size to MAX SERVER MEMORY.  

Performance Implications

I work on any application where everyone complains about the PLEs dropping to under 30 seconds when we do certain operations.  I've written about this in the past...IMHO the PLE is a symptom to be monitored, but doesn't help you determine root cause of any performance problems I've ever seen.  At best I correlate block and wait statistics information during periods of "bad PLEs" to determine exactly what activity was occurring and then determine how to fix it to be less IO intensive.  

I don't have a production load available on my SQL 2014 test box, and I'm not exactly sure how to test BPEs accurately.  Having said that in my test env it does appear as though PLEs are less "volatile" and more stable.  Elapsed checkpoint time is also better.  It will be interesting to see real perf numbers by some of the SQL experts on the blogosphere.  

This is an awesome feature.  

MapReduce for the RDBMS Guy

MapReduce is a parallel programming framework for executing jobs against multiple data sets.  MapReduce is the framework Google uses to query multiple nodes in a partition-tolerant datastore.  Google owns the patent on the MapReduce framework, but the concepts are freely shared and there are a number of open-source implementations.  

MapReduce's concepts come from the functional programming world.  A map function applies an operation to each element in a list, generally returning another list.  It takes a key-value pair and emits another key-value pair.  Example...function "plus-one" applied to list [0,1,2,3,4] would yield a new list of [1,2,3,4,5].  The original list's data is immutable (never changed) and therefore I need not be concerned with concurrency and locking.  So, multiple map threads could work on this list if the list could be partitioned across multiple nodes.  The map function runs on each subset of a collection local to a distributed node.  The map operation on any one node is completely independent on the same map operation running on another node.  This isolation is how you get cheap, effective parallel processing.  And if any node fails the map function can be restarted on a redundant node.  

Functional programming also has the concept of a reduce function, more commonly known as a "fold function", but you sometimes hear them called "accumulate" or "inject" functions.  A fold function applies a calculation on the key-value pairs and produces a single result.  Almost like the SUM or AVG aggregate functions in RDMBS-land.  A fold (or reduce) function on the list [1,2,3,4,5] would yield a result of 15.  

Map and Reduce functions are used together to process lists of data.  The maps may run a simple calculation on many nodes of the datastore.  The result may be piped to the reducer that may run on a single master node that generates the query's answer.  The map generally must run before the reduce, but the underlying code generally runs computations in parallel.  

Maps and reduces can be thought of as algorithms.  The maps are the relationships and hierarchies among the data.  It is critical to performance that the maps are designed properly.  A map will generally "scan" the data, much like an RDBMS index scan.  Maps will generally not seek.  

A map may pass its data to another map, there can be more than one map function prior to the reduce being called.  Similarly, a reduce function can be the target of another reducer.  

When your querying requirement gets more complex it is generally not a good idea to make you map and reduce jobs more complex.  Instead you should have additional mappers and reducers.  So, add more jobs vs making the existing jobs more complex.  

A Better Example

Think of a partitioned datastore of Customers on 64 nodes.  We want to know how many customers reside in Pennsylvania.  A Java Jock working on a traditional RDBMS may do something stupid like bring back all customers from each node/database and then look for the "Pennsylvanias" on the client tier.  That's wildly inefficient.  With MapReduce the requester will issue a "map" to each node that says, "Please give me a count of your node's customer in PA".  Each node sends the data back to the requester which "reduces" the responses by saying, "Now give me a sum of all nodes' counts."  That's actually kinda efficient.  

Is MapReduce that much different than what we do in RDBMSs today?

In an RDBMS the paradigm is commonly "pass the data to the algorithm" (ie, send the customers table to the jvm to be filtered there for the customers from PA).  But in MapReduce the algorithm is reversed..."pass the algorithm to the data".  Yeah, yeah, this is really flimsy, but this is how most people think about it.  I tend to think of the map as the equivalent of a stored procedure in the RDBMS.  And if you remember that we are running the MapReduce on many, many nodes, then it rings a little more true.  
I think of reduce functions as something akin to an aggregation operation.  Aggregation reduces the number of rows in an RDBMS and a reducer does the same thing.  It reduces the amount of data travelling to the next step.  
Some MapReduce implementations have a "link" function as well which is equivalent to a JOIN condition in an RDBMS. 
Microsoft Daytona
MS has a map/reduce framework for Azure called Daytona.