During my evaluation I spent most of my time working with Hive because I found the ramp-up time and tooling much better than Pig. Hive is very SQL-like and rides on top of Hadoop and is therefore geared to abstracting away the complexities of querying large, distributed data sets. Under the covers Hive uses HDFS/Hadoop/MapReduce, but again, abstracting away the technical implementation of data retrieval, just like SQL does.
But, if you remember from my "[[MapReduce for the RDBMS Guy]]" post, MapReduce works as a job scheduling system, coordinating activities across the data nodes. This has a substantial affect on query response times. Therefore, Hive is not really meant for real-time ad hoc querying. There's lots of latency. Lots. Even small queries on dev-sized systems can be ORDERS of magnitude slower than a similar query on a RDBMS (but you really shouldn't compare the two this way).
To further exacerbate the perceived performance problem, there is no query caching in Hive. Repeat queries are re-submitted to MapReduce.
So, if the performance is so bad, why does everyone expound on the virtues of these solutions? As data sets get bigger, the overhead of Hive is dwarfed by the scale-out efficiencies of Hadoop. Think of this as the equivalent of table scans in SQL Server. Generally we all hate table scans and instead try to find a way to do index seeks. But eventually we all hit a query tuning moment when we realize that a table scan, sometimes, is really better than billions of seeks. Remember that Hive is optimize for BigData and batch processing, so touching every row is optimal.
The HiveQL syntax is exactly like ANSI SQL. Here's an example:
Dave Wentzel CONTENT
data architecture nosql