DaveWentzel.com            All Things Data

Data Architecture

SAP HANA Evaluation

As I mentioned in theAs I mentioned in the last couple of blog posts, I've been tasked with looking at big data slash no SQL solutions for one of the projects I'm working on.  We began looking at SP P HAMA.
As I mentioned in the last couple of blog posts, I've been tasked with looking at big data slash no SQL solutions for one of the projects I'm working on.  We began looking at SP P HAMA.
As I mentioned in the last couple of blog posts, I've been tasked with looking at BigData and NoSQL solutions for one of the projects I'm working on.  We are seriously considering SAP HANA.  HANA is an in-memory data platform that is deployed on an appliance or in the cloud.  It is mostly focused on real-time analytics.  SAP HANA is an optimized platform that combines column-oriented data storage, massively parallel processing, in-memory computing, and partitioning across multiple hosts.
HANA optimizes your data on the fly.  It can compress data and convert to/from columnstore/rowstore data depending on how you are using your data.  For instance if calculations are single column-based then columnstore storage is chosen.  
It has a standard SQL language, called SQLScript, which is very similar to TSQL, supports MDX (just like SQL Server...so it works well with Excel) and ABAP, and has standard jdbc and ODBC drivers.  It is ACID compliant.  This makes the move from a standard SQL RDBMS a little less painful.  SAP provides prepackaged algorithms optimized for HANA.  
My project is an Accounting System with lots of aggregate tables that holds summarized data at various grains of detail.  We aggregate, of course, because aggregated data is faster to read than performing the requisite calculations on the fly.  With HANA the aggregate tables are not needed.  They believe that they can retrieve the necessary aggregated data by querying the column stores directly, in-memory.  This of course would simplify our data model tremendously since we wouldn't need all kinds of logic to populate the aggregated tables.  ETL goes away.  We simply compute on the fly.  This would eliminate a lot of our data and a lot of our storage costs.  
Like many financial applications, ours is batch-oriented.  There are certain long-running, complex calculations we just can't do real-time with an RDBMS.  With SAP HANA “batch is dead”.  Even if HANA can't support your most demanding batch jobs, at a minimum they become "on demand" jobs instead.  Operational reports can run real-time on the OLTP system...no more need for an ODS.  
It will be interesting to see where this proof of concept leads us.
Where Can I Download It?
You can't.  But SAP gives you FREE development instances with all of the tools you need on their website (see below).  Here's why...SAP HANA is sold as pre-configured hardware appliances through select vendors.  It runs on SUSE Linux SLES 11.  It uses Intel E7 chips, Samsung RAM, Fusion IO SSD cards, and standard 15K rotational media for overflow and logging.  
Where Can I Get More Information?
The SAP HANA website provides lots of free information.  Lots of example use cases and code.  
The SAP HANA Essentials ebook is being written in "real time".  Google around for the free promo code and then you don't have to pay for it.  It is being continuously updated with new chapters as the content becomes available.  


This is my next post in my NoSQL series.  Sharding is not specific to NoSQL, but quite a few BigData/NoSQL solutions use sharding to scale better.  

What is sharding?

A shard is one horizontal partition in a table, relation, or database.  The difference between a shard and horizontal partitioning is that the shard is located on a separate network node.  The benefit of sharding is that you will have less data on each node, so the data will be smaller, more likely to be held in cache, and the indexes will be smaller.  Most importantly, I can now use a grid of nodes to attack queries across shards.  This is an MPP architecture vs a SMP architecture (massively parallel processing vs symmetric multi processing).  You can also view this as sharding is "shared nothing" vs horizontal partitioning which is generally "shared almost everything."  

Sharding works best when each node (or the central query node in some cases) knows exactly which shards a given data element would reside on.  Therfore the shard partition key must be a well-defined range.  You want your shards to be well-balanced and often that means using a contrived hash key as the shard key, but this is not a requirement.  

What products use sharding?

This is not a comprehensive list, but:

  • MySQL clusters use auto-sharding.  
  • MongoDB
  • Sharding is not so useful for graph databases. The highly connected nature of nodes and edges in a typical graph database can make it difficult to partition the data effectively. Many graph databases do not provide facilities for edges to reference nodes in different databases. For these databases, scaling up rather than scaling out may be a better option.  I'll talk more about graph databases in a future NoSQL blog post.  

Problems with sharding

The concept of sharding is nothing new.  The biggest problem is when you attempt to roll your own sharding.  I've seen this a lot and it never works.  Invariably the end result is a bunch of code in the app tier (or ORM) that tries to determine which shard should have a given data element.  Later we determine that the shards need to migrate or be re-balanced and this causes a lot of code changes.  

The CAP Theorem

Here is a little more of my experiences evaluating NoSQL solutions.  It is important to understand the CAP Theorem.  CAP is an acronym for Consistency, Availability, Partition Tolerance.  
In a distributed data architecture you can only have 2 of these 3 at any one time.  You, as the data architect, need to choose which 2 are the most important to your solution.  Based on your requirements you can then choose which NoSQL solution is best for you.  
If a node goes down (called a network partition) will your data still be queryable (available) or will it be consistent (ie, queryable, but not exactly accurate)?   
As a relational guy this makes sense to me.  What distributed data guys have figured out is that the CAP Theorem can be bent by saying that you can actually have all 3, if you don't expect to have all 3 at any particular exact moment in time.  With many NoSQL solutions the query language will have an option to allow a given request to honor availability over consistency or vice versa, depending on your requirements.  So, if I must have query accuracy then my system will be unavailable during a network partition, but if I can sacrifice accuracy/consistency then I can tolerate a network partition.  
It's usually not that simple unfortunately.  It's generally not a wise idea, for hopefully obvious reasons, to write data to one and only one node in the distributed datastore.  If we have, say, 64 nodes and every write only goes to one of those nodes, we have zero resiliency if a network partition occurs.  That node's data is lost until the node comes back online (and you may even need to restore a backup).
Instead, a distributed datastore will want each write to be acknowledged on more than one node before the client is notified of the write's success.  How many nodes a write must occur on is an implementation detail.  But clearly this means that writes will take a little longer in systems designed like this.  It's equivalent to a two phase commit (2PC).  
This "multiple node write" issue also means that if we query for a specific scalar data element that any two nodes may have different values depending on which was updated last.  This means that these datastores, while allowing queries to be leveraged against all nodes (map) and then merged to determine the correct version (reduce) will require a synchronization and versioning mechanism such as a vector clock.  I'll discuss vector clocks and other synchronization mechanisms in the next post.  
Other implementations may not require the "2PC" but will instead only write to one node but perform an asynchronous replication to other nodes in the background.  They may use a messaging engine such as Service Broker or JMS to do this.  Obviously, this is prone to consistency problems during a network partition.  In this type of system the designer is clearly valuing performance of writes over consistency of data.  Obviously a query may not return transactionally consistent data, always, in this type of system.  
The system designer has infinite latitude in how to bend the CAP Theorem rules based on requirements.  Most of us RDBMS people don't like this kind of flexibility because we don't like when certain rules, like ACID, are broken.  The interesting thing is, if you think about it, you can actually totally guarantee CAP in a distributed datastore if you remember that both readers and writers would need to execute against exactly half of the total nodes, plus one.  This would assume a "quorum".  Of course your performance on both reads and writes will be at their worst and will totally negate any reason for distributing data in the first place.  IMHO.  
An area where a distributed datastore may take some liberties that freaks out us RDBMS folks is that a write may not *really* be a write.  In the RDBMS world writes follow ACID.  This means that a write can't be considered successful if I just write it to buffer memory.  I have to write it to the REDO/transaction log and get a success from that.  We've all been told in years past to "make sure your disk controller has battery backup so you don't lose your write cache in the event of failure".  
Well, its possible that we can "bend" the write rules a little too.  Perhaps we are willing to tolerate the "edge case" when we lose a write due to failure between the time the buffer was dirtied and when it was flushed to disk.  So, some of these solutions will allow you to configure whether you need a "durable write" vs a plain 'ol "write".  Again, the system designer needs to consciously make the tradeoff between durability and performance.  This is a "setting" I've neither seen in an RDBMS nor even seen it requested as a feature.  It's just a totally foreign concept.  And in many cases the write/durable write setting can be changed at the request level, offering even more flexibility.  In fact, it appears as though Microsoft is getting on board with this concept.  It looks like Hekaton will support the concept of "delayed durability".  
I see one last concern with these solutions, the cascading failure problem.  If Node goes down then its requests will be routed to Node', which effectively doubles its load.  That has the potential to bring Node' down, which will cause Node" to service thrice its load, etc etc.  This can be overcome with more nifty algorithms, it's just worth noting that it's never wise to overtax your nodes.  This is why NoSQL people always opt for more nodes on cheaper commodity hardware. 

Presentation on Metadata Driven Database Deployments is tonight

As a reminder, I am giving a presentation on metadata driven database deployments using my tool at 5:30 tonight at Microsoft's Malvern office.  The presentation and source code can be downloaded from CodePlex.  

See you there.


Handling Conflicts with Eventual Consistency and Distributed Systems

In distributed data architectures like some NoSQL solutions, a mechanism is needed to detect and resolve conflicting updates on different nodes.  There are many different ways to do this.  You don't need to understand the nitty gritty details of how this is done, but understanding the basic concepts is fundamental to understanding the strengths and limitations of your distributed data system.  In this post I'll cover some of the methods.
Distributed Hash Tables
BitTorrent's distributed tracker uses this technology.  In fact, some key-value systems like Cassandra and memcached (I'll cover both in a later post) are just giant DHTs.  Essentially a data element is passed to a common hashing function to generate a key, which is then passed around to other nodes.  
Quorom Protocol
Very simply, technologies that use QP must "commit" on n number of nodes for the transaction to be considered successful.  The "commit" can be handled in one of two ways:
  1. Like a traditional two-phase commit (2PC).  In this case the write commit is delayed until a quorum of n acknowledge the change.  Obviously this can introduce a high amount of transaction latency.  
  2. The quorum is obtained on the read side.  When the data is read the quorum is obtained then.  Again, latency is introduced in this model.  
Gossip Protocol
This method allows nodes to become aware of other node crashes or new nodes joining the distributed system.  Changes to data are propogated to a set of known neighbors, who in turn propogate to a different set of neighbors.  After a certain period of time the data view becomes consistent.  The problem is that the more nodes the system contains, the longer it will take for updates to propogate, which in turn means the "eventual consistency" takes longer and therefore the possibility of conflicts occurring increases.  
Vector Clocks
A vector clock is probably the simplest way to handle conflict resolution in a distributed system.  A vector clock is a token that distributed systems pass around to keep the order of conflicting updates intact.  You could just timestamp the updates and let the last update win...if your requirements are that simple.  But if the servers are geographically disparate it may be impossible to keep the clocks synchronous.  Even using something like NTP (Network Time Protocol) on a LAN may not keep the clocks synchronized enough.  Vector clocks tag the data event so that conflicts can be handled logically by the application developer.  This is basically how Git works under-the-covers.  
There are many ways to implement vector clocks but the simplest way is for the client to stamp its data event with a "tag" that contains what the client knows about all of the other clients in the distributed system at that point in time.  A typical tag may look like this:
Assuming this was client1, the above tag indicates client1 knows that client2 is on its "Revision 100", client3 is on its "Revision 78", client4 is on its "Revision 90" and it is on "Revision 50".  "Revision" in this sense is a monotonically increasing identifier specific to that node.  You could use a GUID, but that is rarely done.  
Problems with Vector Clocks
A vector clock tag grows very large as more clients participate in the system, and as the system generates more messages.  Some vector clock-based systems like Riak have a pruning algorithm to keep the vector clock tags more compact. Vector clocks are also used by Amazon's Dynamo architecture.  

Eventual Consistency or ACIDs and BASEs

This is the next post in my NoSQL series.  A fundamental concept endorsed by many NoSQL solutions is "eventual consistency."  Each product can implement eventual consistency however it likes.  RDBMS people are familiar with ACID properties (Atomicity, Consistency, Isolation, Durability).  NoSQL people like to bend the rules a bit and I like to call these BASE properties (Basic Availability, Soft-state, Eventual Consistency).  

So, when would you use BASE transactions?  

One case is when your network is not reliable.  For instance, transactions that must span your local network as well as a cloud provider.  Local updates can be queued and batched and sent later.  While the remote transactions are in-flight the "syste" is in an inconsistent state, but eventually all updates will be applied and the system will be in balance again.  

Inventory systems can withstand eventual consistency, usually, thus are good candidates.  

In general, consider BASE transactions in cases where ACID transactions are not immediately required and the data eventually becomes consistent and is never lost.  

Data Models and Data Organization Methods

This is the next post in my series on NoSQL alternatives to RDBMSs.  How a datastore persists its data can be roughly divided into the following categories.  There may be others, depending on your perspective.  It can also be argued that some of the types below are subtypes of others.  
Data Model Description Has Schema/No Schema Relationships are Modeled As... Data Is... Examples Possible Disadvantages
Relational We probably all know this by now.   Has Schema Predefined Typed SQL Server, MySQL, Oracle Everybody has different opinions on this.  
Probably the oldest electronic data storage mechanism.  Data is modeled in a pyramid fashion (parent and child records).
Probably the oldest electronic data storage mechanism.  Data is modeled in a pyramid fashion (parent and child records).With the advent of the relational model most hierarchical data stores died out.  However, XML started a bit of a renaissance in hierarchical data stores.  Many NoSQL solutions could be categorized as hierarchical.  
Has Schema Predefined Typed IBM IMS, the Windows registry,XML
  • relationships among children is not permitted. 
  • Extreme schema rigidity.  Adding a new data property usually requires rebuilding the entire data set
Network very similar to hierarchical datastores.  It was the next evolution of the hierarchical model, allowing multiple parents to have multiple children.  Much like a graph. Many healthcare data stores are based on MUMPS which falls into this category.  "data" is stored as data and methods (instructions on what to do with the data) Has Schema Predefined Typed CODASYL, Intersystems Cache and other Object-oriented databases Never gained much traction because IBM wouldn't embrace it and the relational model came along and displaced both.  
Graph could be considered the next generation of network data stores.  These are prevalent in the social-networking space where applications like Twitter, LinkedIn, and Facebook want to visualize your "network" for you.  Nodes represent users who have relationships (the edges) to each other.  Modeling this relationally is challenging at best.  A graph system can query these structures easily.  Basically, if you can model your data on a whiteboard, then a graph database can model it too.   It depends (but generally no schema) Edges Ad hoc Neo4j data is usually untyped.  Very complex.  
Document designed for storing document-oriented data, such as XML, or semi-structured data.   No Schema None (generally) Typed CouchDb, MongoDb Does not support ad hoc reporting tools, ie Crystal Reports
Columnar stores its data as columns or groups of columns of data, rather than as rows of data.  Good for OLAP applications and anywhere where aggregation query is important.   Has Schema Predefined (similar to relational) Typed HBase generally not good for OLTP applications.
Key-Value stores its data like an EAV model.  Entirely schema-less.  Data can be stored as byte-streams so you can persist the programming language's objects directly in the key-value store No Schema Links via keys Semi-typed Redis, Riak  

Tutorial D, D, Rel and Their Relation to NoSQL

I mentioned in my last NoSQL post that developers complain that SQL is a lousy declarative language.  On some levels I agree with that.  But those same developers then damn all relational theory and data stores because of shortcomings in the SQL language.  This is wrong.  It is usually vendor implementations of SQL that are the problem.  There are relational languages other than SQL.  This post will cover some of them and how they are better than SQL in many regards.  

TutorialD is not really a language, but rather a specification co-authored by CJ Date.  It describes the characteristics that a good declarative, relational language should have.  CJ Date devised TutorialD as an educational aide in one of his books.  The "D", I believe, stands for "data".  Date believed that SQL was not a particularly good language for expressing relational concepts.  D was supposed to be a better alternative to SQL.  

What are some of SQL's shortcomings according to Date?  D was supposed to eliminate the object-relational impedance mismatch that occurs between SQL and OO languages.  The best we have in SQL today to do this is various ORM implementations which are kludgy at best.  Some other SQL shortcomings?: 
  • being able to declare and query vectors/arrays directly in D.  In other words, you don't pass a result set back to Java and load it into an array to display it. 
  • better transitive closure query capabilities
  • handling recursion better.  In fact I don't believe SQL has recursion at all.  Some vendors have "recursive CTEs", for instance, that help.  
  • better readability than SQL.  Correlated subqueries are really hard to "walk".  Don't believe me...look at some of the SQL generated by ORM tools.  Tutorial D allows sections of code to be "variable-ized", almost like a macro.  
  • You can "pipeline" and string commands together.  Think of the HAVING clause in SQL.  It is really just a WHERE clause generated against a previously aggregated set.  HAVING is unneeded and doesn't exist in D.  In reality you can actually pipeline in SQL using derived tables, CTEs, and temp tables.  So you can basically have a derived table that has a GROUP BY clause on it, and then a WHERE clause on that derived table.  In this regard I really like the HAVING clause...but to each his own.  
  • JOINs in D are always NATURAL JOINs, which means the ON clause is never needed.  I really, REALLY wish TSQL had this feature.  A NATURAL JOIN assumes the ON clause is PK:FK.  This really reduces the code clutter.  In D you can always express an alternate JOIN condition, you simply use the WHERE clause.  
  • "Presentation" elements are part of the D language, ie, you can pass a properly formatted (perhaps with css tags) result set that can be used directly by your application's display engine.  Oracle does this by embedding, for instance, PL/SQL directly in Oracle Forms.  I really hated this when I used in last century, but it does work.  And some people swear by it.  
  • D just looks more like Java code, which can make it appealing for Java Jocks.  This makes it scary for report writers though.  Surprisingly, many NoSQL query languages do the exact same thing.  And one of the big complaints about NoSQL solutions is their lack of good report writing tools (there are no "plug-ins" that allow you to query, for instance, CouchDB directly from Crystal Reports).   

There is currently no real-world implementation of D, it is merely a proscriptive list of items that a good relational query language should expouse.  It is merely educational.  Rel is an open-source implementation of D, written in Java.  I'm not aware of anything that formally supports it.  

Perhaps it is time that SQL people understand what the NoSQL people hate about SQL.  Then we can focus on providing extensions that cover some of the things listed above that would make all of us more productive without having to abandon the many good things of the relational model.  

What exactly is wrong with SQL and RDBMSs?

This is my third post on my evaluation of NoSQL products for a client who wishes to replace their (rather expensive) SQL Servers.  Many relational guys ask themselves, "why all the fuss about NoSQL?  What is wrong with relational data stores?"  I felt the same way 15 years ago when people wanted to replace RDBMS with XML stores.  But the movement is underway because people feel the relational model is flawed.  In my mind, NoSQL is not a direct replacement for relational data stores, NoSQL has its place, just like relational does.  But, we relational guys are making it tough on ourselves.  We don't want to understand WHY people like relational alternatives.    

This post is a series of things that annoy me with relational people.  These annoyances bother lots of people...the same people who control the decision making powers of the data persistence engines that are chosen for new projects.  If we relational guys don't change our ways we are going to be out of jobs.  My complaints, in no particular order:  

Relational database Schemas are Inflexible

I've been to numerous NoSQL presentations and this is the key point that these vendors drive home.  

How many times have you asked your DBA to add a new column to a table and they refused?  With <insert NoSQL product here> we are schema-less.  We don't care if you add new data elements to your schema or not, because we have no schema.  Add whatever data elements you want, whenever you want.  We'll support it.  

The fact is, relational databases can do this too.  It's not the RDBMS that is inflexible, it is the DBA.  More on that in the next section.  

There are no valid reasons why a relational database schema has to be inflexible.  The reasons are man-made contrivances.  I have many blog posts where I show that you can add new columns and value them while a system is up with ZERO impact to the users.  Some examples are here,here,here, and most importantly here.  

There are some schema changes that are disallowed by the major RDBMS vendors.  For instance, adding a non-nullable column to a billion row table generally means some downtime.  But that is not a flaw of the relational model, that is a flaw in the vendor's implementation of the relational model.  There is nothing in relational theory that limits schema flexibility.  The problem is that the data modeler and data architect are not sufficiently well-versed in how to decouple the logical from the physical model such that evolutionary relational databases can be a reality.  These people want to tie table structure to its on-disk representation.  Data architects and data modelers need to update their skills.  

DBAs, Data Architects, and Database Developers SUCK

I don't feel this way, but many of our fellow software developers do.  Sorry, but it's true.  

DBAs tend to ask too many questions and we make the developers think differently about their data.  They don't like that.  We data professionals like to think we are doing our employers a valuable service by asking the difficult questions about new data requirements, but in reality we are not.  Developers are fighting us at every turn and we don't even realize it.  

We've all seen this happen:  A developer/analyst requests a new column in a table.  What does the DBA do?  

  • We ask tons of questions about WHY the data is needed.  
  • We deny the request because
    • the requestor wanted to call it "Status" and our naming conventions require the name to be StsVal.  We direct the requestor to read the 2000 page "Data Standards and Naming Conventions Document, V3.15".  I'm not against standards, I'm against the expectation that EVERYONE knows EVERY standard.  Let's be a little more helpful.  
    • the requestor wanted to use the BIT datatype but our standard is to use TINYINT.  
    • We did not specify an appropriate DEFAULT for the new column.  
    • The new column was requested to be NOT NULLable and that is disallowed.  
    • Will require too much data conversion code to be written
  • After the request is DENIED we ask that the requestor resubmit the request, making the necessary changes, and submit it for the next Change Review Board meeting, which is held the first Tuesday of every month.  So basically, wait a month.  

Get the picture?  DBAs are notorious for not being helpful.  Sorry, but that's my opinion, and the perception of many others too.  

Don't believe me.  Then why do we all see so many of the following traits in our databases:  

  • varchar(max) columns everywhere because if the requestor asks for varchar(50) and later needs it to be varchar(100) that the DBAs will deny the request.  Best to ask for "too much" than "too little".  
  • varchar(max) columns that end up storing JSON, XML, or data in some homespun markup language.  Developers do this to avoid data modeling and DBAs.  
  • EAVs.  An EAV has almost infinite schema flexibility.  
  • Tables with ZERO DRI.  The developers don't want to admit that there might be relationships to existing data because then they'll need to deal with DRI.  
  • Data Hoarding.  We constantly see tables with no data lifecycle management.  It's easier for a developer to say that data must always be maintained without being honest about the data retention requirements with the DBAs.  

To my DBA friends, PLEASE change your attitudes.  

I Can't "Discover" My Schema in a Relational Database

This is another complaint I hear too often.  Developers want to experiment and discover their schemas without a lot of up-front formal modeling.  And I agree.  And I do that every day too.  Nobody says you need to have DBA-approval to build some tables on your scrum server and do a proof-of-concept.  When you are ready then you should get proper change management approval.  This argument is another function of rigid policies and processes.  

Many companies practice rigid "waterfall" development...they MUST do data modeling before any coding is done.  And that is often WRONG.  In these companies the model is ALWAYS determined to be lacking late in the development phase.  The schema can't support all of the data requirements perfectly.  But again, the developers and analysts fear the DBAs so they "work around" the schema deficiences.  

The result?  

  • Structures that are difficult to query
  • Structures with poor referential integrity
  • Poor performance

Did you ever notice there aren't a lot of jobs for NoSQL DBAs?  This is because the NoSQL vendors don't want DBAs.  DBAs limit the creativity process.  Flexibility to assist in prototyping and experimentation is not a function solely of NoSQL.  

SQL is hard to use and is not expressive enough

This is true.  But it is getting better.  Doing "paging" in SQL 10 years ago required lots of lines of code and performed poorly.  Now we have constructs like TOP and LIMIT that are easier to use.  And there will be even more improvements to the SQL language that will make this even easier, for instance, we'll likely soon have the ability to use a TOP without needing a CTE first.  That's just a guess.  

The NATURAL JOIN syntax would be a welcome addition too.  

Here are some other things SQL really needs:

  • array structures
  • macros to remove code duplication efficiently
  • performant scalar functions

It is getting better.  And I defy you to tell me that Hive or Pig is more expressive than SQL.  

And, of course, we always have ORMs to help us avoid hand-crafting SQL for mundane, repetitive tasks.  


There are good reasons to use a NoSQL solution.  There are also bad reasons.  This blog post was an attempt to outline a few egregious reasons why people choose NoSQL solutions.  These are all issues of perception and education.  

Structuring Your ETL like a Queue

This is a follow-on to Performant ETL and SSIS Patterns.  I really need to do a longer, explanatory post on this.  Two of the largest performance problems I see with SSIS packages is their lack of parallelism and the fact that they are written to run large batches during a defined time window.  The former is totally unnecessary, the latter is unecessary if you structure your processing smartly.  


Subscribe to RSS - Data Architecture