DaveWentzel.com            All Things Data

What is this NoSQL thing?

You've probably seen NoSQL being bandied about on the web and in the IT press.  Whenever we see a "new" technology trend get repeated press coverage we know we need to bone up on that hot new thing.  First of all there is a NoSQL RDBMS that runs on Unix, but this isn't what is being hyped right now.  The "NoSQL" being hyped is really just a concept that states that we can store data without using an RDBMS, the SQL language, or even relational modeling!  

What is it?

As we all know, RDBMSs were never meant for certain applications (streaming media) and  mixed-workload databases (heavy reporting AND transactional traffic).  NoSQL is the movement that is growing out of the need for data persistence for these types of applications.  This is usually accomplished by sacrificing sacred data management credos (such as ACID transactions, 3rd normal form, etc) in an effort to make these applications "work".

In many cases entire data sets are copied on many nodes of a distributed grid where querying can then be attacked by many different nodes concurrently via hashing concepts.  To make hashing and distributed querying even more effective sometimes the data is stored in flat files of named-value pairs, similar to EAV Models that we may see in some physical designs.  


The one implementation heard about most often in Google's BigTable.  Implementation is often described as being like a column-oriented database.  It utilizes MapReduce to handle its distributed querying.  

Intersystem's Cache database has had lots of ads in the IT trade rags for years now as well.  It is an object-relational database.  Many might not consider this a NoSQL implementation since it runs a dialect of SQL, however, the underlying spirit of the product, I believe, makes it worthwhile to mention here.  

Among the popular, pure, NoSQL databases available today are Cassandra, MongoDB and CouchDB.

I'm a big fan of memcached, having implemented some of it's optimization tools for my little drupal site.  My drupal site is running on a VM with almost 0 memory and spare CPU cycles because I don't feel like buying a new box.  Trust me that any sluggishness you are seeing is due to the ISP, *not* my drupal implementation.  memcached is a great way to relieve database load for me.  The big problem is it is volatile in RAM.  That's where memcacheDB comes in, which adds persistence to memcached.  It's built for named value pair traversal, unlike a typical RDBMS.  All of this is open-source.  

Membase general availability is this month and it looks very exciting and will likely be the dominant NoSQL implementation very soon.  

Where do we use these?

  1. Wherever we know a relational DBMS won't cut it...streaming data and indexing applications
  2. Wherever we know heavy workloads are skewed to simple updates.  Think of Digg's up/down flagging system.  They recently rewrote quite a bit of their site to use Cassandra for just this reason.  

How do they do what they do?

There is no such thing as a free lunch.  The tradeoffs generally involve weak or non-existent transactional control, favoring using a middle tier to handle DRI if it is required at all.  Lots of commodity grid nodes with redundant data and copies of the hash tables make lookups quick.  And remember that this isn't standard relational data like an Orders table we are storing in NoSQL, rather named-value pairs.  So the systems are optimized for that style of access, not SQL access.  Some also support XQUERY since that is a little easier to learn (and more portable) than something proprietary.  

More Problems

Since the essence of NoSQL is the lack of an enforced schema you can assume that you will suddenly find yourself without everything that schemas give you.  For instance, schemas really give you the constraints about your data in one nice place.  You can assume that these constraints will therefore be spread all over your application(s).  If your application is object-oriented you will really struggle with this since object-orientation really requires structured data.  Without structured data, manipulation of the data is difficult and ultimately usefulness of the information is limited.  

Don't believe me?  OO proponents will tell you that strong typing is critical.  But NoSQL does not (generally) enforce types.  

If you require any type of relational or mathematical operations on your data you will find this difficult to do.  In many cases our databases are used to do some type of reporting.  This reporting usually involves aggregating data (summing it) after GROUPing it into sets.  NoSQL doesn't have this concept.  What it does have is very unwieldy and not something that a report analyst can easily grasp.