DaveWentzel.com            All Things Data

MapReduce for the RDBMS Guy

MapReduce is a parallel programming framework for executing jobs against multiple data sets.  MapReduce is the framework Google uses to query multiple nodes in a partition-tolerant datastore.  Google owns the patent on the MapReduce framework, but the concepts are freely shared and there are a number of open-source implementations.  

MapReduce's concepts come from the functional programming world.  A map function applies an operation to each element in a list, generally returning another list.  It takes a key-value pair and emits another key-value pair.  Example...function "plus-one" applied to list [0,1,2,3,4] would yield a new list of [1,2,3,4,5].  The original list's data is immutable (never changed) and therefore I need not be concerned with concurrency and locking.  So, multiple map threads could work on this list if the list could be partitioned across multiple nodes.  The map function runs on each subset of a collection local to a distributed node.  The map operation on any one node is completely independent on the same map operation running on another node.  This isolation is how you get cheap, effective parallel processing.  And if any node fails the map function can be restarted on a redundant node.  

Functional programming also has the concept of a reduce function, more commonly known as a "fold function", but you sometimes hear them called "accumulate" or "inject" functions.  A fold function applies a calculation on the key-value pairs and produces a single result.  Almost like the SUM or AVG aggregate functions in RDMBS-land.  A fold (or reduce) function on the list [1,2,3,4,5] would yield a result of 15.  

Map and Reduce functions are used together to process lists of data.  The maps may run a simple calculation on many nodes of the datastore.  The result may be piped to the reducer that may run on a single master node that generates the query's answer.  The map generally must run before the reduce, but the underlying code generally runs computations in parallel.  

Maps and reduces can be thought of as algorithms.  The maps are the relationships and hierarchies among the data.  It is critical to performance that the maps are designed properly.  A map will generally "scan" the data, much like an RDBMS index scan.  Maps will generally not seek.  

A map may pass its data to another map, there can be more than one map function prior to the reduce being called.  Similarly, a reduce function can be the target of another reducer.  

When your querying requirement gets more complex it is generally not a good idea to make you map and reduce jobs more complex.  Instead you should have additional mappers and reducers.  So, add more jobs vs making the existing jobs more complex.  

A Better Example

Think of a partitioned datastore of Customers on 64 nodes.  We want to know how many customers reside in Pennsylvania.  A Java Jock working on a traditional RDBMS may do something stupid like bring back all customers from each node/database and then look for the "Pennsylvanias" on the client tier.  That's wildly inefficient.  With MapReduce the requester will issue a "map" to each node that says, "Please give me a count of your node's customer in PA".  Each node sends the data back to the requester which "reduces" the responses by saying, "Now give me a sum of all nodes' counts."  That's actually kinda efficient.  

Is MapReduce that much different than what we do in RDBMSs today?

In an RDBMS the paradigm is commonly "pass the data to the algorithm" (ie, send the customers table to the jvm to be filtered there for the customers from PA).  But in MapReduce the algorithm is reversed..."pass the algorithm to the data".  Yeah, yeah, this is really flimsy, but this is how most people think about it.  I tend to think of the map as the equivalent of a stored procedure in the RDBMS.  And if you remember that we are running the MapReduce on many, many nodes, then it rings a little more true.  
 
I think of reduce functions as something akin to an aggregation operation.  Aggregation reduces the number of rows in an RDBMS and a reducer does the same thing.  It reduces the amount of data travelling to the next step.  
 
Some MapReduce implementations have a "link" function as well which is equivalent to a JOIN condition in an RDBMS. 
 
Microsoft Daytona
 
MS has a map/reduce framework for Azure called Daytona.  

Add new comment