Schema-less solutions may need a little more schema after all

A few months ago my company renegotiated its SQL Server licensing agreement with Microsoft. We got a really terrible deal. Because of that our data architects have been tasked with looking at alternatives to SQL Server. We've had a few vendors pop in and give us brief demos. Our most recent visitor was a vendor for a commercial version of MongoDB. I'm much less jaded about NoSQL/BigData/schema-less database management systems than my co-workers (some of whom would NEVER EVER consider using one of these solutions). I have seen a few use cases where it would be worthwhile to consider a proof-of-concept of a NoSQL solution over a standard RDBMS.

I digress.

The vendor started out his demonstration by saying Mongo was a pure "schema-less" solution. He proceeded to persist a JSON document in the system without first declaring a "table", a "key", an "index", or anything else remotely representing a "schema" in the traditional RDBMS-sense of the word. He then went on for a few minutes spouting hyperbole about how this is a more natural process for Java guys than having to deal with a RDBMS with its object/relational impedance mismatch. No need for even the limited structure that an ORM would provide. MongoDB simply persists a document object just like a Java Jock would create it. The presenter noted how Mongo assigned an "_id" to our document. He showed us how this was basically the equivalent of a "Primary Key" that Mongo handled for you for free. Every document must have this. He finished his demo by writing some quick "queries" to retrieve the JSON and display it to the screen.

So far, hyperbole aside, I thought his demo was intellectually honest. I did question him that his "_id" column was actually a Primary Key, so wasn't this "schema" after all? He agreed, but said that was the minimal schema that Mongo needed to operate.

He proceeded with his demo where he showed how to "shard" the data across nodes. This is where the demo got interesting. He wanted to shard the data by a "region" attribute in all of his documents. He had 3 nodes and three possible values in all of his documents for "region".

I interrupted him. "Excuse me but your original document only contained the attributes _id and name, there is no region element, so what shard will contain that data?"

"For the demo we won't actually shard the original documents I inserted, we'll blow them away and use some properly-prepared documents that have the necessary region information."

I had him. "So, are you saying that if we wanted to maintain those original JSON documents we would need to perform an update and somehow assign them a region attribute?"

"That is correct."

"So, are you saying that the region element is a required element of all documents if we want to shard by that data element?"

"That is correct."

"So, doesn't that mean that there is a defined 'schema' necessary in MongoDB? Basically, Mongo does REQUIRE at least some semblance of schema to work as designed when introducing sharding, replication, and redundancy. Correct?"

At this point I thought I saw him roll his eyes. "Well, yes, if you want to implement sharding correctly so it performs. But we could've just sharded on a hash of _id and we wouldn't need any schema at all."

"Interesting. So what you are saying is that if you want a properly-performant system then, yes, some about of up-front planning is kinda required. The fact is, this Mongo solution has not really eliminated any of the logical data modeling activities that should be occuring on any system. Yes, Mongo does not require data typing at the column/property/attribute level, does not require foreign keys, does not require domain constraints, and does not require nullability checks, but then again, tables in SQL Server, Oracle, and DB2 do not require this either. The only thing that I can see that Mongo has eliminated is the most basic "CREATE TABLE blah" command that would need to be run on a standard RDBMS first. I've worked on RDBMS-based systems that had absolutely no PKs or FKs defined, this was "handled" in the Java tier. The RDBMS was simply a dumb persistence mechanism. In all honesty these systems all ran fabulously well in many cases and customers were thrilled.

Problems only really manifested themselves when the schema had to evolve due to new requirements. This is where the lack of up-front schema planning (logical data modeling if you will) meant lots and lots of rework to get the new requirements shoe-horned in. And I've never had the luxury, like you just did, of simply discarding my old data that did not fit the new schema requirement and 'start over'. In each case the data had to be migrated to the new schema, and that was difficult and painful. But it needed to be done. "

Crickets could be heard.

The vendor's eyes were fully glazed over. I allowed him to collect his thoughts and continue with his demo without further interruption by me. Other people pointed out the patently-obvious problems with NoSQL solutions, such as the lack of good report writers (Crystal Reports, etc won't work with these solutions because they only understand "SQL" and Mongo doesn't have a SQL-to-cURL bridge or SQL-to-MapReduce bridge). The vendor assured us that Cognos and Microstrategy were actively working on reporting solutions and would have something soon.

The fact is, there are lots of good uses for NoSQL products. But I think it is disingenuous to say that your product is "schema-less" when in fact a good "schema" is the only way to make your product work correctly. There is no substitute for good logical data modeling and designing good schemas. No substitute. When people ask me during interviews what is my least favorite thing about being a data architect I always say, without hesitation, "logical data modeling." I've worked on so many systems (and designed even more unfortunately) where the schemas were just flat out wrong. It was too difficult/risky to change them and performance and customer satisfaction always suffered. For those reasons I'm always fearful I'll design something that will not work.

Too many people want to short-circuit the logical design phase of a product. This is usually because they never had a good education on the subject. Instead, they rely on product-specific knowledge of modeling. In other words, they only know about modeling what they learned in a book about their particular RDBMS. Or, in the case of Java Jocks, they don't want to be constrained at all by any structure. It's not "sexy". Code is sexy. JSON is sexy. "Schema-less" is sexy. This is terribly short-sighted. We are unlearning best practices that have evolved over the last 40 years. Again, there are good uses for many NoSQL/BigData products (eventual consistency is AWESOME if you need it), but eliminating schema and design just isn't one of them.

Dave Wentzel 2013-06-04 CONTENT
data architecture nosql