DaveWentzel.com            All Things Data

June 2013

A Stored Procedure Blessing

Everyday at my current gig I'm asked to "bless" some change that some developer wants to make to some database code somewhere.  

CMS

I recently turned downI recently turned down a consulting day because the customer insisted on using a no SQL solution when I was convinced the requirement did not warranted.
 
Background…  a biopharma company contacted me to help them set up their Drupal CMS.  This isn't what I normally do, I am more of a data architect.  But they said the googled around and they saw that I have experience with Drupal, and that I'm local, and they decided to reach out to see if I was interested.  They wanted to store FDA-regulated data in a HBase solution.  I insisted that I had no comparative advantage in either HBase or Drupal, but they REALLY wanted me to come in.  
 
So I did.  They wanted users to fill out their documents using a built-in WYSIWYG editor in Drupal and simply store the generated HTML as a big 'ol unstructured BLOB document in the HBase solution.  I pointed out that, yes, Drupal (or any CMS) could store all of their HTML "data" as a schema-less blob.  This is actually extremely easy to do.  But the question really is, is that the right thing to do?
 
We all learned years ago that keeping our raw data separate from the presentation markup is a very wise thing to do.  Using things like CSS we can change the presentation layer very easily at a later date as requirements change without losing any data fidelity.  Why would we want to take a step backward and violate this basic principle?
 
As I said, this is also FDA-regulated data, and the government is highly structured.  It would make sense to maintain that structure somehow.  I suggested at a minimum to allow users to complete a series of metadata questions that would allow the data to more easily meet the government requirements without have to do a bunch of MapReduce queries later.  The customer was hell-bent against using any structure.  They wanted their HBase schema-less solution at any cost.
 
We had an hour-long conversation about why I really felt this was a bad idea.  I tried everything I could think of to talk them out of doing this.  Although Drupal allows everything to be unstructured, clearly some structure would aid taxonomy searches, yielding a faster and more accurate system.  This is in fact a CMS Best Practice.
 
I even tried to convince the customer that unstructured data is a contradiction in terms.  Data has to have some kind of structure otherwise is just bits and bytes.  Random bits are not data.  Data has meaning.  Meaning has structure.  They weren't buying any of my hyperbole.  I even tried to convince them that HBase does actually have data "structure" behind it because database management, at its core, is nothing but manipulation of some data structure.  They still weren't buying it.  
 
I even asked how many BigData engineers they had on staff that could architect and engineer the solution.  (They were hoping I would help out with that as well.)  
 
At this point I realized I was dealing with another BigData Icarus Project (the second one in as many weeks).  They insisted on using a schema-less solution even though there was no requirement, functional or otherwise, to do so.  I kindly thanked them for inviting me into their company and said that I thought they could do much better than contract with me.  
 
BigData BigProblems, LittleData LittleProblems.  

BigData Icarus Project

I was invited to work on another BigData project that was doomed-to-failure.  I turned it down.  But not because it was using a NoSQL solution.  It was an "Icarus Project".  
 
The project's requirements were very simple:  Be able to store and retrieve any tiny bit of patient health data ever generated during the lifetime of a patient.  This data would be used not only to improve patient health outcomes but as a mechanism for the provider to spot "trends".  The data must be able to be queried quickly and read activity cannot conflict with the real-time streaming of new data from the source systems.  By "any tiny bit of data", I mean, absolute minutae, such as individual heartbeats.
 
Wow, that's a lot of data.  
 
 
I have very limited BigData experience but I definitely agreed that a typical RDBMS solution was probably not suited for this task.  Frankly I think all of the project's functional requirements are little bizarre and nebulous but after we started talking through the requirements further I realize this project was doomed to failure.
 
As I started to talk about adding a little bit of structure to the data, I was immediately told that this MUST be a schema-less solution.  We absolutely could not have any structure to the data.  Why?  Certainly some of the data has structure.  I was finally told the real requirement of this project.
This was to be the company's first foray into BigData.  Now we were getting somewhere.  The company was losing software contracts to other shops with BigData backend implementations.  This had the telltale signs of an Icarus Project.  
 
An "Icarus Project" is the term I use to describe the situation where managers or executives initiate an overly ambitious project which then fails.  Fed by excitement for the project, the executives are unable to reign in their misguided enthusiasms before it is too late to avoid failure.  In Greek mythology, Icarus and his father, Daedalus, were imprisoned by King Minos.  Daedalus constructed wings for them to use to escape by flying.  The wings were made from wax and feathers.  Daedalus warned his son not to fly too close to the sun.  Icarus was overcome with the excitement of flying and ignored his father's warning.  As he approached the sun the wax melted and icarus fell to his death.
 
An Icarus Project is most often seen when companies plow into new technologies that they have no experience with.  As they spend more and more money to try and catch up to companies already dominant with the new technology, they use up the cash reserves built up by their profitable products.  In the past, in the realm of data, the biggest Icarus projects were around data warehousing.  That's changed in the last two years.  Now the biggest Icarus Projects seem to be around BigData.

What Goes Around Comes Around

At the turn of the millenium, when it was clear that web architecture was the future and was going to displace client-server programming as the de facto mechanism for future applications, it occurred to me that, indeed, what goes around comes back around again.   The first computer I ever had was a 286 and every application was stand-alone.  What little networking existed was via modem and simple BBS applications.  When I got to college I maintained various VAX/VMS and Ultrix-based "dumb terminal" applications.  People hated them because of the esoteric commands needed to be productive.  To send an email required the user to construct a command-line string similar to smtp%"email@domain.com".  Please remember that I was dealing with freshman college kids with little to no experience with networked computers, not to mention Luddite college professors who believed all innovation in "word processing" stopped with the advent of the IBM Selectric II typewriter.  I had little "tricks" I taught the professors to make things easier for them to remember.  Example:  "SMTP" was "Send Mail To Postman".  Even that was too hard to remember!  

"I typed smtm% just like you taught me and it's saying 'Invalid command'."

"I'm sorry you are having problems Professor Mankiewicz.  It's actually smtp, send mail to postman, not smtm.  Why would you think it's smtm?"

"I thought it was 'send mail to mailbox'.  Sorry about that."

"Is there anything else I can do for you today doctor?  No?  OK sir, see you in Econ 101 tomorrow.  Bye."  

 

I digress.  

My first "real" job out of college was as a programmer on an IBM mainframe TSO system.  TSO stands for Time Sharing Option.  It's quite simple, a user can login to the system and is unaware that other users are currently interacting with the system.  This is no different than a Unix login over telnet.  Around the same time TSO was introduced it was becoming common for universities to "rent" time on their computers to interested folks who could develop their own programs.  You've probably heard of people like young Bill Gates who "rented" off-hours, unused cycles from colleges and companies that had these expensive computers.  

Of course the concept of "renting" computing resources went out the window with the advent of technologies like the web and VMWare, right?  

Wrong!  Renting off-hours, unused cycles is again en vogue with the advent of the cloud.  

(Google's first hit for TSO)


As I've mentioned in a few recent blog posts, I need to research and prototype a number of "NoSQL" solutions.  I could of course build these in-house, but some of these products are extremely difficult to setup and install unless you hold a PhD in CompSci.  It's much easier to provide Amazon with a credit card and use Amazon AWS.  After an hour and less than a hundred bucks I had a running Riak system to play with.  Very cool.  But AWS is not really time sharing.  That's what I thought until I researched all of the AWS offerings a bit more.  

AWS and EC2

On the AWS website is a link to powerof60.com.  This explains Amazon's EC2 offering.  EC2 is an acronym for Elastic Compute Cloud.  With this offering you purchase processing cycles just like you purchase electricity.  You pay for as much as you use, and you pay more during peak hours.  Services are available just like a utility.  This sounds a lot like time-sharing to me.  Amazon has lots of case studies where companies (even NASA) have occassional need for lots of processing power.  Amazon will let you spin up the resources as you need them, when you need them, without the up-front capital expense of purchasing your own hardware and software (not to mention labor).  If you can do your processing off-hours you pay less.  You can even have Amazon spin up your processing for just a few minutes (off-peak) to save even more $$$.  

Azure

Although I can't find this documented by the Azure folks, I just heard that Windows Azure is going to a per minute pricing strategy in the future.  Apparently you can power down your VMs when (if) your developers leave the office and 5pm and you won't be billed again until you power them back up the next morning.  

What goes around comes around again.  

 
 
 
 
 
 
 
 
 
 

Schema-less solutions may need a little more schema after all

A few months ago my company renegotiated its SQL Server licensing agreement with Microsoft.  We got a really terrible deal.  Because of that our data architects have been tasked with looking at alternatives to SQL Server.  We've had a few vendors pop in and give us brief demos.  Our most recent visitor was a vendor for a commercial version of MongoDB.  I'm much less jaded about NoSQL/BigData/schema-less database management systems than my co-workers (some of whom would NEVER EVER consider using one of these solutions).  I have seen a few use cases where it would be worthwhile to consider a proof-of-concept of a NoSQL solution over a standard RDBMS. 

I digress.  

The vendor started out his demonstration by saying Mongo was a pure "schema-less" solution.  He proceeded to persist a JSON document in the system without first declaring a "table", a "key", an "index", or anything else remotely representing a "schema" in the traditional RDBMS-sense of the word.  He then went on for a few minutes spouting hyperbole about how this is a more natural process for Java guys than having to deal with a RDBMS with its object/relational impedance mismatch.  No need for even the limited structure that an ORM would provide.  MongoDB simply persists a document object just like a Java Jock would create it.  The presenter noted how Mongo assigned an "_id" to our document.  He showed us how this was basically the equivalent of a "Primary Key" that Mongo handled for you for free.  Every document must have this.  He finished his demo by writing some quick "queries" to retrieve the JSON and display it to the screen.  

So far, hyperbole aside, I thought his demo was intellectually honest.  I did question him that his "_id" column was actually a Primary Key, so wasn't this "schema" after all?  He agreed, but said that was the minimal schema that Mongo needed to operate.  

He proceeded with his demo where he showed how to "shard" the data across nodes.  This is where the demo got interesting.  He wanted to shard the data by a "region" attribute in all of his documents.  He had 3 nodes and three possible values in all of his documents for "region".  

I interrupted him.  "Excuse me but your original document only contained the attributes _id and name, there is no region element, so what shard will contain that data?"  

"For the demo we won't actually shard the original documents I inserted, we'll blow them away and use some properly-prepared documents that have the necessary region information."  

I had him.  "So, are you saying that if we wanted to maintain those original JSON documents we would need to perform an update and somehow assign them a region attribute?"  

"That is correct."  

"So, are you saying that the region element is a required element of all documents if we want to shard by that data element?"

"That is correct."

"So, doesn't that mean that there is a defined 'schema' necessary in MongoDB?  Basically, Mongo does REQUIRE at least some semblance of schema to work as designed when introducing sharding, replication, and redundancy.  Correct?"

At this point I thought I saw him roll his eyes.  "Well, yes, if you want to implement sharding correctly so it performs.  But we could've just sharded on a hash of _id and we wouldn't need any schema at all." 

"Interesting.  So what you are saying is that if you want a properly-performant system then, yes, some about of up-front planning is kinda required.  The fact is, this Mongo solution has not really eliminated any of the logical data modeling activities that should be occuring on any system.  Yes, Mongo does not require data typing at the column/property/attribute level, does not require foreign keys, does not require domain constraints, and does not require nullability checks, but then again, tables in SQL Server, Oracle, and DB2 do not require this either.  The only thing that I can see that Mongo has eliminated is the most basic "CREATE TABLE blah" command that would need to be run on a standard RDBMS first.  I've worked on RDBMS-based systems that had absolutely no PKs or FKs defined, this was "handled" in the Java tier.  The RDBMS was simply a dumb persistence mechanism.  In all honesty these systems all ran fabulously well in many cases and customers were thrilled.  

Problems only really manifested themselves when the schema had to evolve due to new requirements.  This is where the lack of up-front schema planning (logical data modeling if you will) meant lots and lots of rework to get the new requirements shoe-horned in.  And I've never had the luxury, like you just did, of simply discarding my old data that did not fit the new schema requirement and 'start over'.  In each case the data had to be migrated to the new schema, and that was difficult and painful.  But it needed to be done. "  

Crickets could be heard.  

The vendor's eyes were fully glazed over.  I allowed him to collect his thoughts and continue with his demo without further interruption by me.  Other people pointed out the patently-obvious problems with NoSQL solutions, such as the lack of good report writers (Crystal Reports, etc won't work with these solutions because they only understand "SQL" and Mongo doesn't have a SQL-to-cURL bridge or SQL-to-MapReduce bridge).  The vendor assured us that Cognos and Microstrategy were actively working on reporting solutions and would have something soon.  

The fact is, there are lots of good uses for NoSQL products.  But I think it is disingenuous to say that your product is "schema-less" when in fact a good "schema" is the only way to make your product work correctly.  There is no substitute for good logical data modeling and designing good schemas.  No substitute.  When people ask me during interviews what is my least favorite thing about being a data architect I always say, without hesitation, "logical data modeling."  I've worked on so many systems (and designed even more unfortunately) where the schemas were just flat out wrong.  It was too difficult/risky to change them and performance and customer satisfaction always suffered.  For those reasons I'm always fearful I'll design something that will not work.  

Too many people want to short-circuit the logical design phase of a product.  This is usually because they never had a good education on the subject.  Instead, they rely on product-specific knowledge of modeling.  In other words, they only know about modeling what they learned in a book about their particular RDBMS.  Or, in the case of Java Jocks, they don't want to be constrained at all by any structure.  It's not "sexy".  Code is sexy.  JSON is sexy.  "Schema-less" is sexy.  This is terribly short-sighted.  We are unlearning best practices that have evolved over the last 40 years.  Again, there are good uses for many NoSQL/BigData products (eventual consistency is AWESOME if you need it), but eliminating schema and design just isn't one of them.  

Just How Expensive is a Bookmark Lookup in a Query Plan

Yesterday I mentioned a conversation with a co-worker where he asked Just How Expensive is a Sort Operator in a Query Plan?  I showed using mathematics and theory that they are very costly.  As a followup my co-worker asked why, after eliminating the Sort Operators, did I focus on the Bookmark Lookups?  Why exactly are Bookmark Lookups so bad for performance?  Isn't it simply using a second index to get the data that is required?  Isn't this better than potentially scanning a clustered index directly?  

 

SQL server he is smart enough to consider using more than one index to satisfy a given query.  There are multiple ways SQL server can do this.  One way is to join two non-clustered indexes together to "cover" all of the columns required by the query.  This is known as "index intersection."
 
If it is not possible to cover all the columns required by the query using non-clustered indexes, then the optimizer may need to access the base table to obtainthe remaining columns.  This is known as a bookmark lookup operation.  In the table has a clustered index this is known as a Key Lookup.  If the base table is a heap, this is known as a RID Lookup.  
 
Given the choice of index intersection or bookmark lookup, the former is always better. Why?
(An example of index intersecton)
 
Just like a standard SQL JOIN an "index intersection" is simply joining two data structures by their key.  The data retrieval is handled as sequential reads.  This is very efficient, even over very large result sets.  
 
A bookmark/key/RID lookup also joins on the key, but in this case the leading column of the non-clustered index is not the leading column of the clustered index.  The non-clustered index contains the clustered index's key, but it isn't the leading column.  The data access, therefore, will not be sequential, instead requiring random I/O, which is a very expensive operation.   its usage can be effective only for a relatively small number of records.  This is why it is sometimes more efficient for SQL Server to NOT use a non-clustered index to satisfy a given query if it is not a covered index.  Instead, SQL Server may choose a clustered index scan instead.  

Just How Expensive is a Sort Operator in a Query Plan

I was looking at a query plan with a co-worker and we noticed a Sort operator.  I was contemplating various methods we could use to get rid of it when my co-worker asked, "Why do you want to get rid of the Sort Operator, it's only 4% of the query, it's not like it's 88%?".  

Good question.  Sorting can be quite expensive over larger input sets (which this query was).  Sorting has N * LOG(N) algorithm complexity, theoretically, which is definitely not linear.  So, the more rows your input has, the more expensive the sort becomes on a per-row basis.  This quick breakdown I did in Excel shows how the cost of the sort in terms of amount of work needed to be done grows as the number of rows in the input to the sort grows:  

It gets even worse if your SORT spills to tempdb.  Your odds of this happening increase as the number of rows in your input grows.  

So, yes, the Sort operator is expensive and should be eliminated if it can be.  

More Information